CAIS AI Safety Research

We pursue impactful research on a variety of topics in AI safety.

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.

In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.

Technical AI Research

Research which improves the safety of existing AI systems.

/

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Meta-Analysis

Richard Ren*, Steven Basart*, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks

View Research

Tamper-Resistant Safeguards for Open-Weight LLMs

Robustness

Rishub Tamirisa*, Bhrugu Bharathi*, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks**, Mantas Mazeika**

View Research

/

Improving Alignment and Robustness with Circuit Breakers

Robustness

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

View Research

/

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Unlearning

Nathaniel Li*, Alexander Pan*, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang**, Dan Hendrycks**

View Research

/

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Robustness

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

View Research

Can LLMs Follow Simple Rules?

Robustness

Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner

View Research

A Recipe for Improved Certifiable Robustness: Capacity and Data

Robustness

Kai Hu, Klas Leino, Zifan Wang, Matt Fredrikson

View Research

Representation Engineering: A Top-Down Approach to AI Transparency

Transparency

Andy Zou, Long Phan*, Sarah Chen*, James Campbell*, Phillip Guo*, Richard Ren*, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

View Research

Universal and Transferable Adversarial Attacks on Aligned Language Models

Robustness

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

View Research

Testing Robustness Against Unforeseen Adversaries

Robustness

Max Kaufmann, Daniel Kang, Yi Sun, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks

View Research

/

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Machine Ethics

Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

View Research

/

Forecasting Future World Events with Neural Networks

Forecasting

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

View Research

/

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Machine Ethics

Mantas Mazeika*, Eric Tang*, Andy Zou, Steven Basart, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks

View Research

/

Dreamlike Pictures Comprehensively Improve Safety Measures

Robustness

Dan Hendrycks*, Andy Zou*, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt

View Research

/

Scaling Out-of-Distribution Detection for Real-World Settings

Anomaly Detection

Dan Hendrycks*, Steven Basart*, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

View Research

/

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

Machine Ethics

Dan Hendrycks*, Mantas Mazeika*, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

View Research

/

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Robustness

Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

View Research

/

Natural Adversarial Examples

Robustness

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

Aligning AI With Shared Human Values

Machine Ethics

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

Pretrained Transformers Improve Out-of-Distribution Robustness

Robustness

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song

View Research

/

A Simple Data Processing Method to Improve Robustness and Uncertainty

Robustness

Dan Hendrycks*, Norman Mu*, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan

View Research

/

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Robustness

Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song

View Research

/

Using Pre-Training Can Improve Model Robustness and Uncertainty

Robustness

Dan Hendrycks, Kimin Lee, Mantas Mazeika

View Research

/

Deep Anomaly Detection with Outlier Exposure

Anomaly Detection

Dan Hendrycks, Kimin Lee, Mantas Mazeika

View Research

/

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Robustness

Dan Hendrycks, Thomas Dietterich

View Research

/

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Anomaly Detection

Dan Hendrycks, Kevin Gimpel

View Research

Conceptual Research

We approach AI safety through a multi-disciplinary perspective.

Unsolved Problems in ML Safety

Conceptual Research

This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

View Research

X-Risk Analysis

Conceptual Research

This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.

Dan Hendrycks, Mantas Mazeika

View Research

Natural Selection Favors AI Over Humans.

Conceptual Research

By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.

Dan Hendrycks

View Research

An Overview of Catastrophic AI Risks

Conceptual Research

This paper addresses the growing concerns around catastrophic risks posed by advanced AI systems, categorizing them into four main areas: malicious use, AI race dynamics, organizational risks, and rogue AIs. For each category, specific hazards are detailed with illustrative stories, ideal scenarios, and mitigation suggestions. The aim is to comprehensively understand these dangers to harness the benefits of AI while avoiding potential catastrophes.

Dan Hendrycks, Mantas Mazeika, Thomas Woodside

View Research

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Conceptual Research

This paper highlights the potential of current AI systems to deceive humans and the risks associated with such deception, including fraud and election tampering. The authors provide empirical examples of both special-use and general-purpose AI systems exhibiting deceptive behavior. The paper concludes with recommended solutions, emphasizing regulatory measures, bot identification laws, and funding for research to counteract AI deception.

Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks

View Research