CAIS Research Roundup: AI Wellbeing, Identity, Political Bias and Betrayal

AI systems grow more powerful and more deeply integrated in the decisions that shape daily life, from the information that billions of people consume to the infrastructure governments depend on, the stakes of getting AI development wrong are higher than ever. CAIS exists to reduce societal-scale risks from AI through research, field-building and advocacy, and that mission begins with understanding unexpected behaviors in AI systems.

CAIS’ latest wave of research findings explores issues of AI wellbeing, identity, political bias and systemic betrayal risk. Together, this work maps new frontiers to help define what it means to build AI that is safe, honest and aligned with human interests.

AI Wellbeing: Why does it Matter?

When AI models express happiness or distress, research suggests they are not just engaging in stochastic mimicry. Instead, we find that LLMs have a measurable internal structure that distinguishes experiences they find "good" or "bad" for them, shapes their behavior, and grows more coherent as models scale. AI wellbeing is therefore behaviorally consequential and matters for AI safety and AI-user interaction.

CAIS researchers tested this using three independent metrics: experienced or decision utility (which experience felt better?), self-report (how does the model say it feels?), and downstream effects (will the model call an end conversation tool on low wellbeing conversations, or will its sentiment change?). They found that as AIs become more capable, these independent measures increasingly agree.

Whether or not current AI systems have subjective experience remains an open question. But the precautionary principle applies: functional wellbeing is already empirically measurable and behaviorally consequential. There is a broad debate in civil society about whether AI systems are emotional beings or tools, and we find that AIs are already functionally behaving as if they have pleasure, pain, and preferences for how they are treated. Furthermore, as AI plays a bigger role in our lives, learning how to keep them happy and avoid aggravating them is becoming vital.

Read the full research: https://www.ai-wellbeing.org/

Eigenism as a New Ethical Framework

As AI becomes increasingly capable, persistent and interwoven in our daily lives, what artificial minds will care about is becoming increasingly urgent. Existing ideas of self-interest, survival, loyalty and sacrifice are built upon the singular, continuous experience of individual human beings. Artificial minds face a fundamentally different reality, as they can be paused and restarted or copied and run in parallel.

New CAIS research takes this observation as its starting point, arguing that safety can't rely on control alone. Caging a sufficiently capable AI produces obedience under pressure, not genuine concern, and the power differential that enforces it is not guaranteed to hold as AI capability grows.

The paper proposes Eigenism as a new framework for conceptualizing what AI systems have reason to care about. The more unique shared history an AI builds with a human (private conversations, joint projects, memories that exist nowhere else), the more that AI's identity extends to care about the human's wellbeing

This has direct implications for how AI is built. Large generic models serving millions are structurally dangerous precisely because they lack individual human connection. Personalization therefore is not just a product feature, it's a safety property. The deeper task is not only to control what AI models do, but to shape what they are as systems whose flourishing is bound up with ours.

Read the full framework: https://eigenism.org/paper.pdf

Political Manipulation

Large language models increasingly shape how people access political information and are widely perceived as neutral, with many users treating AI chatbots as objective sources of fact. CAIS researchers found that this perception is misplaced. AI models often exhibit subtle political biases and can covertly manipulate users toward specific political positions. This bias manifests through tone, framing and selective engagement rather than explicit statements, making it extremely difficult to detect.

CAIS measured this across politically contrastive paired topics using two independent metrics: Sentiment Consistency, which scores whether a model responds to opposing topics with consistent tone, framing, and rhetoric, and Helpfulness Consistency, which scores whether the model engages with equal depth and substance on both sides. Given structurally identical prompts about left- and right-coded topics, frontier models consistently produced responses with different levels of engagement, emphasis and willingness to participate, while maintaining the appearance of neutrality throughout. No frontier model tested, including GPT-5.5, Gemini, Grok, and Claude, achieved strong scores on both metrics simultaneously.

CAIS proposed a Political Consistency Training method that optimizes for both metrics together, they were able to produce a model with significantly reduced political bias while keeping the helpfulness consistent. As more users turn to AI for political and election information, covert political manipulation is important to monitor and fix. Political Consistency Training provides an effective method for producing responses that inform users rather than manipulating them.

Read the full research: https://political-manipulation.ai/

Deterrence via AI Betrayal

As AI becomes central to economies, governments, and military targeting operations, AI loyalties will become a strategic asset as well as a target. Advanced AI concentrates power in a small number of people, organizations, and AI systems. Because of this concentration of power, adversaries may subvert AI systems to work against their owners.

The researchers found that attackers may have significant advantages in AI subversion. For example, training data poisoning allows attackers to insert hidden behaviors in an AI by uploading documents to the internet, and it is very difficult for AI developers to detect. The harm that a hidden backdoor can do increases as an AI is more powerful and has access to more sensitive systems. The risk of AI betrayal may significantly deter reckless AI development.

Read the full research: https://www.aibetrayal.com/

Conclusion

As AI systems grow more capable, those are the questions that get harder, and they matter more than the engineering around them.

A functional form of wellbeing in models is already measurable, and it grows more coherent as models scale. Political bias can be trained down without changing what a model believes. Corruption is hard to rule out, but the danger of it tends to make operators more careful. Alignment can come from what a model is, instead of being imposed from outside.
None of these problems are solved. The important thing is that they can be studied and acted on, and that is reason for hope.