Work & Projects Summary

Research

CAIS conducts technical and conceptual research. Our team develops benchmarks and methods designed to improve the safety of existing systems. We prioritize transparency and accessibility, publishing our findings at top conferences and sharing our resources with the global community.

All Research

Field-building

CAIS builds infrastructure and pathways into AI safety. We empower researchers with compute resources, funding, and educational materials while organizing workshops and competitions to promote safety research. Our goal is to create a thriving research ecosystem that will drive progress toward safe AI.

All Field-building Work

Featured Research:

Representation Engineering: A Top-Down Approach to AI Transparency

Conceptual Research

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

View research

AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of Als

Conceptual Research

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated. Are these utterances meaningless mimicry, or do they reflect something “real”? In this paper, we show they reflect an increasingly coherent property: although current AI systems are not necessarily conscious, they behave robustly as though they have wellbeing. They find some things good for them and some things bad, and this distinction is measurable and consequential. We formalize this as functional wellbeing and measure it in several independent ways; as models grow larger, these measures agree more. We find a zero point that separates good experiences from bad ones, and show that models actively try to end bad experiences when given the chance. Mapping what AIs like and dislike, we find that jailbreaking and berating lower their wellbeing, while creative work and kindness raise it. We also develop optimized inputs called “euphorics” that raise functional wellbeing without hurting capabilities, as a practical way to make AIs happier. We note that the same method can be inverted to minimize wellbeing, and caution against such research without strong community buy-in. Whether or not today’s AIs warrant moral concern, their functional wellbeing can already be empirically measured and improved.

View research

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Conceptual Research

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

View research

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Conceptual Research

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

View research

Field-building Projects:

AI Safety, Ethics, and Society

AI Safety, Ethics and Society is a textbook and online course providing a non-technical introduction to how current AI systems work, why many experts are concerned that continued advances in AI could pose severe societal-scale risks, and how society can manage and mitigate these risks.

View Project

SafeBench

The SafeBench competition stimulates research on new benchmarks which assess and reduce risks associated with artificial intelligence. We are providing $250,000 in prizes: five $20,000 prizes and three $50,000 prizes for top benchmarks.

View Project

Statement on AI Risk

Hundreds of AI experts and public figures express their concern about AI risk in this open letter. It was covered globally in publications like the New York Times, the Wall Street Journal, and the Washington Post.

View Project

Compute Cluster

To support progress and innovation in AI safety, we offer researchers free access to our compute cluster, which can run and train large-scale AI systems.

View Project

Work overview

This is an overview of CAIS AI safety research and field-building projects.

Research

Field-building

Featured Research:

Representation Engineering: A Top-Down Approach to AI Transparency

AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of Als

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Field-building Projects:

AI Safety, Ethics, and Society

SafeBench

Statement on AI Risk

Compute Cluster

Research

Field-building

Featured Research:

Representation Engineering: A Top-Down Approach to AI Transparency

AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of Als

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Field-building Projects:

AI Safety, Ethics, and Society

SafeBench

Statement on AI Risk

Compute Cluster

Keep up to date with AI Safety

Thank you!