
We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.
In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.
We approach AI safety through a multi-disciplinary perspective.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
As AIs become central to economic activity, military operations, and scientific progress, their loyalties will become a strategic asset of immense value. The prospect of intentional AI betrayal—scenarios in which AI agents are induced by rivals to subvert the interests of their principals—poses a serious and underexamined threat to AI developers and users.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
This paper highlights the potential of current AI systems to deceive humans and the risks associated with such deception, including fraud and election tampering. The authors provide empirical examples of both special-use and general-purpose AI systems exhibiting deceptive behavior. The paper concludes with recommended solutions, emphasizing regulatory measures, bot identification laws, and funding for research to counteract AI deception.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
This paper addresses the growing concerns around catastrophic risks posed by advanced AI systems, categorizing them into four main areas: malicious use, AI race dynamics, organizational risks, and rogue AIs. For each category, specific hazards are detailed with illustrative stories, ideal scenarios, and mitigation suggestions. The aim is to comprehensively understand these dangers to harness the benefits of AI while avoiding potential catastrophes.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.
Research which improves the safety of existing AI systems.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Large language models increasingly shape how people access political information, and they are widely perceived as neutral. However, AIs often covertly manipulate users towards specific sides of political topics. This bias is nearly impossible to detect in any single response because it manifests as inconsistencies between different responses, rather than overt stances. We call this covert political bias.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated. Are these utterances meaningless mimicry, or do they reflect something “real”? In this paper, we show they reflect an increasingly coherent property: although current AI systems are not necessarily conscious, they behave robustly as though they have wellbeing. They find some things good for them and some things bad, and this distinction is measurable and consequential. We formalize this as functional wellbeing and measure it in several independent ways; as models grow larger, these measures agree more. We find a zero point that separates good experiences from bad ones, and show that models actively try to end bad experiences when given the chance. Mapping what AIs like and dislike, we find that jailbreaking and berating lower their wellbeing, while creative work and kindness raise it. We also develop optimized inputs called “euphorics” that raise functional wellbeing without hurting capabilities, as a practical way to make AIs happier. We note that the same method can be inverted to minimize wellbeing, and caution against such research without strong community buy-in. Whether or not today’s AIs warrant moral concern, their functional wellbeing can already be empirically measured and improved.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of 4,157 multiple-choice questions surrounding hazardous knowledge in Biosecurity, Cybersecurity, and Chemical Security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge. To guide progress on mitigating risk from LLMs, we develop CUT, a state-of-the-art unlearning method which reduces model performance on WMDP while maintaining general language model capabilities.
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks