We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.
In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.
Research which improves the safety of existing AI systems.
/
Richard Ren*, Steven Basart*, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
Rishub Tamirisa*, Bhrugu Bharathi*, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks**, Mantas Mazeika**
/
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
/
Nathaniel Li*, Alexander Pan*, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang**, Dan Hendrycks**
/
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
Kai Hu, Klas Leino, Zifan Wang, Matt Fredrikson
Andy Zou, Long Phan*, Sarah Chen*, James Campbell*, Phillip Guo*, Richard Ren*, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
Max Kaufmann, Daniel Kang, Yi Sun, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
/
Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
/
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks
/
Mantas Mazeika*, Eric Tang*, Andy Zou, Steven Basart, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
/
Dan Hendrycks*, Andy Zou*, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt
/
Dan Hendrycks*, Steven Basart*, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks*, Mantas Mazeika*, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
/
Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song
/
Dan Hendrycks*, Norman Mu*, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan
/
Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song
/
Dan Hendrycks, Kimin Lee, Mantas Mazeika
/
Dan Hendrycks, Kimin Lee, Mantas Mazeika
/
Dan Hendrycks, Thomas Dietterich
/
Dan Hendrycks, Kevin Gimpel
We approach AI safety through a multi-disciplinary perspective.
This paper provides a roadmap for ML Safety and refines the technical problems that the field needs to address. It presents four problems ready for research, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and reducing deployment hazards (“Systemic Safety”). Throughout, it clarifies each problem’s motivation and provides concrete research directions.
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
This paper provides an analysis of navigating tail-risks, including speculative long-term risks. The discussion covers three parts: applying concepts from hazard analysis and systems safety to make systems safer today, strategies for having long-term impacts on the safety of future systems, and improving the balance between safety and general capabilities.
Dan Hendrycks, Mantas Mazeika
By analyzing the environment that is shaping the evolution of AIs, this paper argues that the most successful AI agents will likely have undesirable traits, as selfish species typically have an advantage over species that are altruistic to other species. The paper considers various interventions to counteract these risks and Darwinian forces. Resolving this challenge will be necessary in order to ensure the development of artificial intelligence is a positive one.
Dan Hendrycks
This paper addresses the growing concerns around catastrophic risks posed by advanced AI systems, categorizing them into four main areas: malicious use, AI race dynamics, organizational risks, and rogue AIs. For each category, specific hazards are detailed with illustrative stories, ideal scenarios, and mitigation suggestions. The aim is to comprehensively understand these dangers to harness the benefits of AI while avoiding potential catastrophes.
Dan Hendrycks, Mantas Mazeika, Thomas Woodside
This paper highlights the potential of current AI systems to deceive humans and the risks associated with such deception, including fraud and election tampering. The authors provide empirical examples of both special-use and general-purpose AI systems exhibiting deceptive behavior. The paper concludes with recommended solutions, emphasizing regulatory measures, bot identification laws, and funding for research to counteract AI deception.
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks