About AI Risk
What is AI risk?
AI has been compared to electricity and the steam engine in terms of its potential to transform society. The technology could be profoundly beneficial, but it also presents serious risks. When AI becomes more advanced, it could eventually pose catastrophic or existential risks. A 'catastrophic risk' is a risk to human wellbeing on a global scale, while an 'existential risk' (X-Risk) is a more extreme risk that threatens the permanent curtailment of humanity's potential (Ord 2020). There are many ways in which AI systems could pose or contribute to these risks, some of which are enumerated below.
Speculative AI Hazards and Failure Modes
Adapted from X-Risk Analysis for AI Research
Weaponization
Malicious actors could repurpose AI to be highly destructive, presenting an X-Risk in and of itself and increasing the probability of other X-Risks through political destabilization. For example, deep RL methods have been applied to aerial combat, and ML drug-discovery tools could be used to build chemical weapons.
In recent years, researchers have been developing AI systems for automated cyberattacks (Buchanan et al., 2020, Cary et al., 2020), military leaders have discussed giving AI systems decisive control over nuclear silos (Klare 2020), and superpowers of the world have declined to sign agreements banning autonomous weapons. Additionally, an automated retaliation system accident could rapidly escalate and give rise to a major war. Looking forward, we note that since the nation with the most intelligent AI systems could have a strategic advantage, it may be challenging for nations not to build increasingly powerful weaponized AI systems. Even if all superpowers ensure the systems they build are safe and agree not to build destructive AI technologies, rogue actors could still use AI to cause significant harm. Easy access to powerful AI systems increases the risk of unilateral, malicious usage. As with nuclear and biological weapons, only one irrational or malevolent actor is sufficient to cause harm on a massive scale. Unlike previous weapons, stealing and widely proliferating powerful AI systems could be a matter of copy and pasting.
Enfeeblement
Enfeeblement can occur if important tasks are increasingly delegated to machines; in this situation, humanity loses the ability to self-govern and becomes completely dependent on machines, similar to the scenario portrayed in the film WALL-E.
As AI systems encroach on human-level intelligence, more and more aspects of human labor will become faster and cheaper to accomplish with AI. As the world accelerates, organizations may voluntarily cede control to AI systems in order to keep up. This may cause humans to become economically irrelevant, and once AI automates aspects of many industries, it may be hard for displaced humans to reenter them. In this world, humans could have few incentives to gain knowledge or skills. Many would consider such a world to be undesirable. Furthermore, enfeeblement would reduce humanity's control over the future, increasing the risk of bad longterm outcomes.
Eroded Epistemics
A deluge of AI-generated misinformation and persuasive content could cause a widespread reduction of rationality and make society less-equipped to handle the important problems of our time.
States, parties, and organizations use technology to influence and convince others of their political beliefs, ideologies, and narratives. Strong AI may bring this use-case into a new era and enable personally customized disinformation campaigns at scale. Additionally, AI itself could generate highly persuasive arguments that invoke strong emotional responses. Together these trends could undermine collective decision-making, radicalize individuals, derail moral progress, or erode consensus reality.
Proxy Gaming
Strong AI systems could over-optimize and game faulty objectives, pushing the world in a direction that does not align with human values.
AI systems are trained using measurable objectives, which may only be proxies for what we care about. For example, AI recommender systems are trained to maximize watch time and click rate metrics. The content people are most likely to click on, however, is not necessarily the same as the content that will improve their well-being (Kross et al.). Also, some evidence suggests that recommender systems cause people to develop extreme beliefs in order to make their preferences easier to predict (Jiang et al.). As AI systems become more capable and influential, the objectives we use to train systems must be more carefully specified and incorporate what we value.
Value Lock-in
Highly competent systems could give small groups of people a tremendous amount of power, leading to suboptimal 'lock-in.'
AGI imbued with particular values may determine the values that are propagated into the future. Some argue that the exponentially increasing compute and data barriers to entry make AI a centralizing force. As time progresses, the most powerful AI systems may be designed by and available to fewer and fewer stakeholders. This may enable, for instance, regimes to enforce narrow values through pervasive surveillance and oppressive censorship. Overcoming such a regime could be unlikely, especially if we come to depend on it. Even if creators of these systems know their systems are self-serving or harmful to others, they may have incentives to reinforce their power and avoid distributing control. The active collaboration among many groups with varying goals may give rise to better goals, so locking in a small group's value system could curtail humanity's long-term potential.
Emergent Goals
Models demonstrate unexpected, qualitatively different behavior as they become more competent. The sudden emergence of capabilities or goals could increase the risk that people lose control over advanced AI systems.
Capabilities and novel functionality can spontaneously emerge in today’s AI systems (Ganguli et al., Power et al.), even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible. New system objectives could also emerge. For complex adaptive systems, including many AI agents, goals such as self-preservation often emerge (Hadfield-Menell et al). Goals can also undergo qualitative shifts through the emergence of intra-system goals (Gall, Hendrycks et al). In the future, agents may break down difficult long-term goals into smaller subgoals. However, breaking down goals can distort the objective, as the true objective may not be the sum of its parts. This distortion can result in misalignment. In more extreme cases, the intra-system goals could be pursued at the expense of the overall goal. For example, many companies create intra-system goals and have different specializing departments pursue these distinct subgoals. However, some departments, such as bureaucratic departments, can capture power and have the company pursue goals unlike its original goals. Even if we correctly specify our high-level objectives, systems may not operationally pursue our objectives (Hubinger et al). This is another way in which systems could fail to optimize human values.
Deception
We want to understand what powerful AI systems are doing and why they are doing what they are doing. One way to accomplish this is to have the systems themselves accurately report this information. This may be non-trivial however since being deceptive is useful for accomplishing a variety of goals.
Future AI systems could conceivably be deceptive not out of malice, but because deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately. Deception also provides optionality: systems that have the capacity to be deceptive have strategic advantages over restricted, honest models. Strong AIs that can deceive humans could undermine human control. AI systems could also have incentives to bypass monitors. Historically, individuals and organizations have had incentives to bypass monitors. For example, Volkswagen programmed their engines to reduce emissions only when being monitored. This allowed them to achieve performance gains while retaining purportedly low emissions. Future AI agents could similarly switch strategies when being monitored and take steps to obscure their deception from monitors. Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control.
Power-Seeking Behavior
Companies and governments have strong economic incentives to create agents that can accomplish a broad set of goals. Such agents have instrumental incentives to acquire power, potentially making them harder to control (Turner et al., 2021, Carlsmith 2021).
AIs that acquire substantial power can become especially dangerous if they are not aligned with human values. Power-seeking behavior can also incentivize systems to pretend to be aligned, collude with other AIs, overpower monitors, and so on. On this view, inventing machines that are more powerful than us is playing with fire. Building power-seeking AI is also incentivized because political leaders see the strategic advantage in having the most intelligent, most powerful AI systems. For example, Vladimir Putin has said “Whoever becomes the leader in [AI] will become the ruler of the world.”
Further resources
- Joseph Carlsmith 2022. Is Power-Seeking AI an Existential Risk?
- Dan Hendrycks, Mantas Mazeika 2022. X-Risk Analysis for AI Research.
- Richard Ngo 2020. AGI Safety from First Principles.