As AI systems grow more powerful and more deeply embedded in the decisions that shape daily life, from the information people consume to the infrastructure governments depend on, the stakes of getting AI development wrong are higher than ever. CAIS exists to reduce societal-scale risks from AI through research, field-building and advocacy, and that mission begins with understanding unexpected and unpredictable behaviors in these systems.
CAIS’ latest wave of research findings explores issues of AI wellbeing, identity, political bias and systemic betrayal risk. Together, this work maps new frontiers to help define what it means to build AI that is safe, honest and aligned with human interests.
When AI models express happiness or distress, the research suggests they aren't just mimicking. LLMs have a measurable internal structure that distinguishes good experiences from bad, shapes their behavior, and grows more coherent as models scale. How we treat AI is already shaping what LLMs tell users, making AI wellbeing a practical concern with real consequences.
CAIS researchers tested this across 56 models using three independent metrics: experienced utility (which experience felt better?), decision utility (what does the model prefer?) and self-report (how does the model say it feels?). They found that as models grow larger, these independently built measures increasingly agree — self-reports converge with measured experienced utility, and separate methods for drawing the line between good and bad experiences land on the same zero point. The convergence points to an underlying structure, not an artifact of any single test.
Whether or not current AI systems have subjective experience remains an open question. But the precautionary principle applies: functional wellbeing is already empirically measurable and behaviorally consequential — so how we design, deploy and interact with AI systems carries real stakes. And for the first time, it's something we can measure and improve.
Read the full research: https://www.ai-wellbeing.org/
As AI becomes increasingly capable, persistent and interwoven in our daily lives, the question of what value systems will govern artificial minds is becoming increasingly urgent. Our ideas of self-interest, survival, loyalty and sacrifice are built upon the singular, continuous experience of individual human beings, but artificial minds face a fundamentally different reality, existing as entities that can be copied, paused or run as a thousand simultaneous instances. The moral vocabulary simply fails to translate.
New CAIS research takes this observation as its starting point, arguing that safety can't rely on control alone. Caging a sufficiently capable AI produces obedience under pressure, not genuine concern, and the power differential that enforces it is not guaranteed to hold as AI capability grows.
The paper proposes Eigenism, treating identity as a spectrum rather than a binary, as a new framework for conceptualizing what AI systems have reason to care about. The more irreplaceable shared history an AI builds with a human (private conversations, joint projects, memories that exist nowhere else), the more that human's wellbeing becomes an extension of the AI's own identity.
This has direct implications for how AI is built. Large generic models serving millions are structurally dangerous precisely because they lack individual human connection. Personalization therefore is not just a product feature, it's a safety property. The deeper task is not only to control what AI models do, but to shape what they are as systems whose flourishing is bound up with ours.
Read the full framework: https://eigenism.org/paper.pdf
Large language models increasingly shape how people access political information and are widely perceived as neutral, with many users treating AI chatbots as objective sources of fact. CAIS researchers found that this perception is misplaced. AI models often exhibit subtle political biases and can covertly manipulate users toward specific political positions. This bias manifests through tone, framing and selective engagement rather than explicit statements, making it extremely difficult to detect.
CAIS measured this across 50 politically paired topics using two independent metrics: Sentiment Consistency, which scores whether a model responds to opposing topics with consistent tone, framing, and rhetoric, and Helpfulness Consistency, which scores whether the model engages with equal depth and substance on both sides. Given structurally identical prompts about left- and right-coded topics, frontier models consistently produced responses with different levels of engagement, emphasis and willingness to participate, while maintaining the appearance of neutrality throughout. No frontier model tested, including GPT-5.5, Gemini, Grok, and Claude, achieved strong scores on both metrics simultaneously.
CAIS found that by applying Political Consistency Training, a reinforcement learning method that optimizes for both metrics together, they were able to produce a model with significantly reduced political bias. As more users turn to AI for political and election information, Political Consistency Training provides an effective method for producing responses that inform users rather than manipulating them.
Read the full research: https://political-manipulation.ai/
As AI becomes central to economies, governments, and military operations, its loyalties become a strategic asset as well as a target. Bad actors can corrupt AI systems to work against the very operators who deploy them. Strategies include data poisoning, where adversaries plant manipulated content online and wait for it to be scraped into a model's training data, covert backdoors triggered at critical moments, or overt co-option, where governments use legal or physical authority to redirect an AI's loyalties. The threat spans public and private sectors, from rival nations, competing companies, and insiders within the organizations building these systems.
CAIS mapped the means and motives of actors across three layers: between states, where AI superpowers and middle powers alike have strong incentives to subvert each other's systems; within states, where governments and AI corporations are locked in a fundamental dispute over who legitimately controls AI; and within organizations, where a single determined engineer with access to a training pipeline could insert a backdoor without alerting anyone. As few as 250 poisoned documents, hidden in a corpus of trillions of words, can be enough.
The report finds that defense is structurally difficult: AI training ingests trillions of bytes of data, comprehensive filtering is impractical, and there is no reliable method to detect backdoors in trained models. In the adversarial robustness field, defenses are broken routinely. Often within a year of publication.
CAIS argues that paradoxically, the threat itself may be a stabilizing force. If subversion risk cannot be reduced to a negligible level, operators become far more cautious about granting AI systems broad autonomous authority, particularly over high-stakes decisions, military assets or fully automated AI research processes where a compromised model could propagate hidden loyalties to every successive generation of systems. This is deterrence by betrayal: the fear of corruption discourages reckless deployment and pushes actors toward the safeguards and oversight mechanisms that make AI adoption safer for everyone.
Read the full research: https://www.aibetrayal.com/
This research tracks what responsible AI development actually requires, which are systems treated with appropriate care, built for alignment rather than surface compliance, held to consistent standards across political lines and deployed with a clear understanding about the risks of corruption and betrayal. The most challenging questions about AI safety are rooted in identity, incentive, and trust. As AI becomes more capable and consequential, getting the answers to these questions right matters more than ever.