The shutdown problem: an AI engineering puzzle for decision theorists
Philosophical Studies
https://doi.org/10.1007/s11098-024-02153-3
The shutdown problem: an AI engineering puzzle
for decision theorists
Elliott Thornley1
Accepted: 8 April 2024
© The Author(s) 2024
Abstract
I explain and motivate the shutdown problem: the problem of designing artificial
agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent
or cause the pressing of the shutdown button, and (3) otherwise pursue goals
competently. I prove three theorems that make the difficulty precise. These theorems
suggest that agents satisfying some innocuous-seeming conditions will often try to
prevent or cause the pressing of the shutdown button, even in cases where it’s costly
to do so. I end by noting that these theorems can guide our search for solutions to the
problem.
Keywords The shutdown problem · Corrigibility · Constructive decision theory · AI
safety
1 Preamble
Tradition has it that decision theory splits into two branches. The descriptive branch
concerns how actual agents behave. The normative branch concerns how rational
agents behave. But there is also a lesser-known third branch: what we can call
‘constructive decision theory.’ It concerns how we want artificial agents to behave
and how we can create artificial agents that behave in those ways. I suggest that this
third branch is due for a growth spurt.
I make the case for studying constructive decision theory by explaining a
characteristic problem. The shutdown problem (Soares et al. 2015) is the problem of
designing artificial agents that (1) shut down when a shutdown button is pressed, (2)
don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise
pursue goals competently. This is not so much a philosophical problem as it is an
engineering problem. Nevertheless, I think philosophers and decision theorists
* Elliott Thornley
1
Global Priorities Institute, Faculty of Philosophy, University of Oxford, Trajan House, Mill
Street, Oxford OX2 0DJ, England
13
Vol.:(0123456789)
E. Thornley
should consider it, for three reasons. First, the problem is important. As I argue in
the introduction, powerful artificial agents are on the horizon and it’s in our best
interests to ensure that they can be turned off. Second, the problem is interesting. I
hope this paper succeeds in conveying its interest. Third, philosophers and decision
theorists are well-placed to help solve the problem. I expect the solution to come in
the form of conditions governing artificial agents’ preferences, together with a proof
that these conditions give rise to shutdownable behaviour and a regimen for training
agents to satisfy the conditions. Philosophers and decision theorists have experience
supplying these kinds of conditions and proofs. We can ally with machine learning
engineers to design the training regimen.
2 Introduction
Call an artificial agent ‘shutdownable’ just in case it shuts down when we want it to
shut down. MuZero (Schrittwieser et al., 2020)—DeepMind’s game-playing AI—is
a shutdownable agent. We can say with some confidence that MuZero doesn’t know
that we humans could shut it down and can’t prevent us from shutting it down. And
so it doesn’t matter what (if anything) MuZero wants: simplifying slightly, whether
MuZero shuts down depends only on what we want.
That need not be true for all artificial agents. Imagine an agent—call it ‘Robot’—
that knows that we humans could shut it down and wants to achieve some goal.1
And imagine that Robot is powerful in the sense that it can interfere with our ability
to shut it down: perhaps Robot can disable its own off-switch. Powerful agents
like Robot won’t be shutdownable in the same way that MuZero is shutdownable.
Whether these agents shut down won’t depend only on what we want. It will also
depend on what they want.
Powerful artificial agents might not be far off. Frontier AI labs are now trying to
create agents that understand the wider world and pursue goals within it. As part of
this process, labs are connecting agents to the world in various ways: giving them
robot limbs, web-browsing abilities, and text-channels for communicating with
humans.2 Advanced agents could use these tools to prevent us shutting them down:
1
Or, if talk of artificial agents ‘knowing’ and ‘wanting’ is objectionable, we can imagine an agent that
acts like it knows that we humans could shut it down and acts like it wants to achieve some goal, in the
same way that MuZero acts like it knows that rooks are more valuable than knights and acts like it wants
to checkmate its opponent. From now on, I’ll often leave the ‘acts like’ implicit.
2
Google DeepMind (2023; Padalkar et al. 2023; Ahn et al. 2024), Google Research (2023), and Tesla
AI (2023) are each developing autonomous robots. Recent papers showcase AI-guided robots capable
of interpreting and carrying out multi-step instructions expressed in natural language (Ahn et al., 2022;
Brohan et al., 2023). Other papers report AI systems that can adapt to solve unfamiliar problems without
further training (Adaptive Agent Team, 2023), learn new physical tasks from as few as a hundred demonstrations (Bousmalis et al., 2023), beat human champions at drone racing (Kaufmann et al., 2023), and
perform well across domains as disparate as conversation, playing Atari, and stacking blocks with a robot
arm (Reed et al., 2022).
But the worry is not only about robots. Digital agents that resist shutdown (by copying themselves to
new servers, for example) would also be cause for concern. Future digital agents will likely be built on
13
The shutdown problem: an AI engineering puzzle for decision…
they could disable their off-switches, make promises or threats, copy themselves to
new servers, block our access to their power-source, and many other things besides.
And although we cannot know for sure what goals these agents will have, many
goals incentivise preventing shutdown, for the simple reason that agents are better
able to achieve those goals by preventing shutdown (Omohundro, 2008; sec. 5;
Bostrom, 2012, sec. 2.1). As the AI researcher Stuart Russell puts it, ‘you can’t fetch
the coffee if you’re dead’ (2019, 141).
That’s a concerning prospect. If powerful artificial agents are coming, we want to
ensure that they’re both shutdownable (they shut down when we want them to shut
down) and useful (they otherwise pursue goals competently).3 Unfortunately (and
perhaps surprisingly), it’s hard to design powerful agents that are both shutdownable and useful. In this paper, I explain the difficulty. I take an axiomatic approach,
proving three theorems more general than others in the nascent literature on the
shutdown problem.4 These theorems suggest that agents satisfying some innocuousseeming conditions will often try to prevent or cause the pressing of the shutdown
button, even in cases where it’s costly to do so.
Here’s a rough gloss on each theorem. The First Theorem links agents’ actions
to their preferences over o (...truncated)