Research Projects

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

We seek to develop foundational benchmarks and methods. To ensure that our work differentially improves the safety of AI systems, we do not pursue research which improves safety as a result of improving a model’s underlying general capabilities. Through our work, we strive to solve the technical challenge at the heart of AI safety.

In addition to our technical research, we also pursue conceptual research, examining AI safety from a multidisciplinary perspective and incorporating insights from safety engineering, complex systems, international relations, philosophy, and so on. Through our conceptual research, we create frameworks that aid in understanding the current technical challenges and publish papers which provide insight into the societal risks posed by future AI systems.

Conceptual Research

We approach AI safety through a multi-disciplinary perspective.

EIGENISM: Ethics for a Human-AI Future

Conceptual Research

Dan Hendrycks

Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged.

Categories

Guiding Principles

At the Center for AI Safety, our research focuses on mitigating high-consequence, societal-scale risks posed by AI.

Conceptual Research

EIGENISM: Ethics for a Human-AI Future

AI Deterrence by Betrayal

AI Deception: A Survey of Examples, Risks, and Potential Solutions

An Overview of Catastrophic AI Risks

X-Risk Analysis for AI Research

Natural Selection favors AI over humans.

Unsolved Problems in ML Safety

Technical AI Research

Reducing Political Manipulation with Consistency Training

AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of Als

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Can LLMs Follow Simple Rules?

Testing Robustness Against Unforeseen Adversaries

Representation Engineering: A Top-Down Approach to AI Transparency

Universal and Transferable Adversarial Attacks on Aligned Language Models

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Forecasting Future World Events with Neural Networks

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

Aligning AI With Shared Human Values

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Pretrained Transformers Improve Out-of-Distribution Robustness

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

Scaling Out-of-Distribution Detection for Real-World Settings

Natural Adversarial Examples

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Using Pre-Training Can Improve Model Robustness and Uncertainty

Deep Anomaly Detection with Outlier Exposure

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Keep up to date with AI Safety

Thank you!