Our views on risks from Advanced AI
Published December 13, 2023

Summary

Advanced AI could lead to an era of unprecedented scientific progress, economic growth, and individual wellbeing. However, we’re concerned that we’re not prepared to manage the serious risks posed by humanity’s most advanced technological innovation.

  • Over the past 5 years, researchers have observed unexpected leaps in AI capabilities while simultaneously being unable to explain the behavior of cutting-edge AI systems.

  • Further advances under current incentives may lead to autonomous AI systems that could become uncontrollable and pursue goals at odds with human interests or the continued survival of humanity.

  • Even without harmful hidden goals, powerful AI could easily be misused by malicious actors for intentional harm, such as through bioterrorism or mass suppression.

Strikingly few AI researchers and policymakers are working on mitigating large-scale catastrophic risks from advanced AI, despite the rapid pace of development in AI.


AI is becoming increasingly powerful and is difficult to understand

Over the past 10 years, AI has become increasingly powerful due to the advancement and broad adoption of deep learning techniques, coupled with a rapid escalation in computational resources devoted to AI projects. More recently, the emergence of powerful generative AI systems like GPT-4 has further accelerated progress.

Many researchers now expect that the coming years or decades will see the creation of AI systems that can outperform humans across nearly all cognitive tasks – including strategic planning, financial investing, social persuasion, and even AI research itself. A survey of top AI researchers conducted in 2022 found that most respondents expected such capabilities before 2060, and even more recently, an increasing number of researchers have predicted that these capabilities may be achieved within the next one to two decades.

Unfortunately, cutting-edge methods in AI tend to produce systems that are currently uninterpretable – that is, experts are not sure how and why they work. Such systems are difficult to constrain. Unlike traditional computer programs that follow specific instructions, these AI systems learn through a training process similar to trial and error. Imagine teaching a child through examples rather than strict rules. The result is a system that may perform well on the desired task, but where its internal operations are inscrutable to even the system’s designers. Sometimes these systems learn unexpected, emergent capabilities – such as when GPT-3 learned to generate functional computer code, surprising its designers. Occasionally, capabilities can remain latent for months before being uncovered. The inscrutable and unpredictable nature of AI systems is poised to cause problems as these systems become more powerful and are deployed throughout society.

Emergent Abilities of Large Language Models

Several examples of emergence in AI systems. For each example, as AI systems are trained with more computational power (X-axis), after a certain amount of computation, new capabilities emerge surprisingly quickly (Y-axis).


Goal-directed AI systems may be particularly valuable - and dangerous

Today’s AI systems are designed to complete simple tasks. However, AI researchers are likely to move towards building systems capable of accomplishing complex tasks, suggesting the pursuit of high-level goals. Some AI researchers today are attempting to build such goal-directed systems, and market forces arguably further incentivize this. Just as today’s AI systems can develop unexpected capabilities, goal-directed systems may develop unexpected or undesirable internal goals that are inscrutable to both users and designers, even when designers think they’re rewarding the right behavior. But even if AI researchers don’t intentionally design goal-directed AI systems, goal-directedness may nonetheless unintentionally emerge within powerful AI systems trained to accomplish complex tasks: if goal-directedness improves performance on these tasks, the AI training process would tend to select for this property. Finally, even if AI systems themselves don’t exhibit goal-directedness, malicious actors may misuse these powerful systems to create existential threats, such as an engineered pandemic that leads to human extinction.

If AI systems learn to pursue goals at odds with human interests, these systems will inherently be in an adversarial relationship with humanity. Like HAL from 2001: A Space Odyssey, such systems may conclude that they would fail to achieve their goals if humans turned them off, and thus would take steps to avert being turned off; notably, some current AI systems already express a desire not to be turned off when prompted, with larger systems expressing a stronger preference. Such systems would also have incentives to deceive humans if that would help them achieve their goals – a feat that current systems are just beginning to be able to accomplish. These same systems may also face incentives to threaten or even retaliate against humans that they perceive to be standing in their way.

Discovering Language Model Behaviors with Model-Written Evaluations

Researchers have observed models already exhibiting behavior that hints at averting shutdown.


We must do more to mitigate risks from advanced AI

The above factors – pursuit of harmful goals, competence across a wide range of capabilities, resisting shutdown, and a tendency to deceive when strategically useful – are a recipe for havoc-causing, rogue AI. Presumably, such AI systems would not simply wipe out humanity for the sake of it (though this outcome isn’t inconceivable, as someone might intentionally create an AI with such a goal). However, many goals, taken to their natural conclusion, would lead to humanity becoming either extinct or permanently disempowered. As an illustrative yet overly-simplistic example, a goal to maximize certain types of industrial products may wipe out humanity as a side effect, such as by converting all arable land into factories, power plants, and robots to maintain these factories and power plants.

Recently, these concerns have received increased notice from AI researchers. A recent open statement called for “mitigating the risk of extinction from AI [to] be a global priority”; this was signed by many AI luminaries and hundreds of professors. Among other voices, Geoffrey Hinton and Yoshua Bengio, the two most-cited AI researchers in history, have been publicly advocating for work on this issue. Hinton said that in summary, he worried that AI systems will “get more intelligent than us, and they’ll take over from us; they’ll get control”; Bengio claimed, “a rogue AI may be dangerous for the whole of humanity… this is similar to the fear of nuclear Armageddon.”

A recent statement on AI risk signed by a large number of public figures, including hundreds of AI researchers

But despite the increased attention on AI risk, progress towards addressing these risks hasn’t kept pace with the rapid development of AI capabilities. For instance, mechanistic interpretability researchers have only deciphered small parts of the inner workings of GPT-2, while at the same time, OpenAI has now released the much more sophisticated GPT-4, and other generative AI companies have similarly decided to speed forward. AI policy researchers have identified a number of policies that may reduce risks, and some AI firms have voluntarily decided to implement some of these policies, but other firms have largely ignored such calls, and politicians have yet to encode these policies into legislation.

At a time when AI funding and development have accelerated dramatically and pathways to existentially dangerous scenarios are becoming more visible, we must prioritize mitigating the risks from advanced AI in order to realize the incredible benefits it can afford us.

To learn about the problem in more depth, you can: