
We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.
In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.
We approach AI safety through a multi-disciplinary perspective.
Dan Hendrycks
Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged.
Adam Khoja, Aiden Kim, Laura Hiscott, Alice Blair, Jason Hausenloy, Long Phan, Mantas Mazeika, Dan Hendrycks
As AIs become central to economic activity, military operations, and scientific progress, their loyalties will become a strategic asset of immense value. The prospect of intentional AI betrayal—scenarios in which AI agents are induced by rivals to subvert the interests of their principals—poses a serious and underexamined threat to AI developers and users.
Peter S. Park*, Simon Goldstein*, Aidan O'Gara, Michael Chen, Dan Hendrycks
This paper highlights the potential of current AI systems to deceive humans and the risks associated with such deception, including fraud and election tampering. The authors provide empirical examples of both special-use and general-purpose AI systems exhibiting deceptive behavior. The paper concludes with recommended solutions, emphasizing regulatory measures, bot identification laws, and funding for research to counteract AI deception.
Dan Hendrycks, Mantas Mazeika, Thomas Woodside
This paper addresses the growing concerns around catastrophic risks posed by advanced AI systems, categorizing them into four main areas: malicious use, AI race dynamics, organizational risks, and rogue AIs. For each category, specific hazards are detailed with illustrative stories, ideal scenarios, and mitigation suggestions. The aim is to comprehensively understand these dangers to harness the benefits of AI while avoiding potential catastrophes.
Dan Hendrycks, Mantas Mazeika
This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.
Dan Hendrycks
By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.
Research which improves the safety of existing AI systems.
Long Phan, Devin Kim, Alex Pan, Alice Blair, Adam Khoja, Dan Hendrycks
Large language models increasingly shape how people access political information, and they are widely perceived as neutral. However, AIs often covertly manipulate users towards specific sides of political topics. This bias is nearly impossible to detect in any single response because it manifests as inconsistencies between different responses, rather than overt stances. We call this covert political bias.
Richard Ren*, Kunyang Li*, Mantas Mazeika*, Wenyu Zhang, Yury Orlovskiy†, Rishub Tamirisa†, Wenjie Jacky Mo, Dung Thuy Nguyen, Long Phan, Steven Basart†, Austin Meek, Aditya Mehta, Oliver Ingebretsen, Alice Blair, Brianna Adewinmbi, Vy Phan, Alice Gatti†, Adam Khoja, Jason Hausenloy, Devin Kim, Dan Hendrycks
Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated. Are these utterances meaningless mimicry, or do they reflect something “real”? In this paper, we show they reflect an increasingly coherent property: although current AI systems are not necessarily conscious, they behave robustly as though they have wellbeing. They find some things good for them and some things bad, and this distinction is measurable and consequential. We formalize this as functional wellbeing and measure it in several independent ways; as models grow larger, these measures agree more. We find a zero point that separates good experiences from bad ones, and show that models actively try to end bad experiences when given the chance. Mapping what AIs like and dislike, we find that jailbreaking and berating lower their wellbeing, while creative work and kindness raise it. We also develop optimized inputs called “euphorics” that raise functional wellbeing without hurting capabilities, as a practical way to make AIs happier. We note that the same method can be inverted to minimize wellbeing, and caution against such research without strong community buy-in. Whether or not today’s AIs warrant moral concern, their functional wellbeing can already be empirically measured and improved.
Richard Ren*, Arunim Agarwal*, Mantas Mazeika*, Cristina Menghini*, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks
As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.
/
Maksym Andriushchenko*, Alexandra Souly*, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson*, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies*
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.
Nathaniel Li*, Alexander Pan*, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang**, Dan Hendrycks**
The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of 4,157 multiple-choice questions surrounding hazardous knowledge in Biosecurity, Cybersecurity, and Chemical Security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge. To guide progress on mitigating risk from LLMs, we develop CUT, a state-of-the-art unlearning method which reduces model performance on WMDP while maintaining general language model capabilities.
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
Max Kaufmann*, Daniel Kang*, Yi Sun*, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
Andy Zou, Long Phan*, Sarah Chen*, James Campbell*, Phillip Guo*, Richard Ren*, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
/
Alexander Pan*, Jun Shern Chan*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
/
Mantas Mazeika*, Eric Tang*, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Dan Hendrycks*, Andy Zou*, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt
/
Dan Hendrycks*, Mantas Mazeika*, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
/
Dan Hendrycks*, Collin Burns*, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
/
Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
/
Dan Hendrycks*, Xiaoyuan Liu*, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
/
Dan Hendrycks*, Norman Mu*, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan
/
Dan Hendrycks*, Steven Basart*, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song
/
Dan Hendrycks, Thomas Dietterich
/
Dan Hendrycks, Kimin Lee, Mantas Mazeika
/
Dan Hendrycks, Mantas Mazeika, Thomas Dietterich
/
Dan Hendrycks*, Kevin Gimpel