The shutdown problem: an AI engineering puzzle for decision theorists

Philosophical Studies, Jun 2024

I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that these theorems can guide our search for solutions to the problem.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11098-024-02153-3.pdf

The shutdown problem: an AI engineering puzzle for decision theorists

Philosophical Studies https://doi.org/10.1007/s11098-024-02153-3 The shutdown problem: an AI engineering puzzle for decision theorists Elliott Thornley1 Accepted: 8 April 2024 © The Author(s) 2024 Abstract I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that these theorems can guide our search for solutions to the problem. Keywords The shutdown problem · Corrigibility · Constructive decision theory · AI safety 1 Preamble Tradition has it that decision theory splits into two branches. The descriptive branch concerns how actual agents behave. The normative branch concerns how rational agents behave. But there is also a lesser-known third branch: what we can call ‘constructive decision theory.’ It concerns how we want artificial agents to behave and how we can create artificial agents that behave in those ways. I suggest that this third branch is due for a growth spurt. I make the case for studying constructive decision theory by explaining a characteristic problem. The shutdown problem (Soares et al. 2015) is the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. This is not so much a philosophical problem as it is an engineering problem. Nevertheless, I think philosophers and decision theorists * Elliott Thornley 1 Global Priorities Institute, Faculty of Philosophy, University of Oxford, Trajan House, Mill Street, Oxford OX2 0DJ, England 13 Vol.:(0123456789) E. Thornley should consider it, for three reasons. First, the problem is important. As I argue in the introduction, powerful artificial agents are on the horizon and it’s in our best interests to ensure that they can be turned off. Second, the problem is interesting. I hope this paper succeeds in conveying its interest. Third, philosophers and decision theorists are well-placed to help solve the problem. I expect the solution to come in the form of conditions governing artificial agents’ preferences, together with a proof that these conditions give rise to shutdownable behaviour and a regimen for training agents to satisfy the conditions. Philosophers and decision theorists have experience supplying these kinds of conditions and proofs. We can ally with machine learning engineers to design the training regimen. 2 Introduction Call an artificial agent ‘shutdownable’ just in case it shuts down when we want it to shut down. MuZero (Schrittwieser et al., 2020)—DeepMind’s game-playing AI—is a shutdownable agent. We can say with some confidence that MuZero doesn’t know that we humans could shut it down and can’t prevent us from shutting it down. And so it doesn’t matter what (if anything) MuZero wants: simplifying slightly, whether MuZero shuts down depends only on what we want. That need not be true for all artificial agents. Imagine an agent—call it ‘Robot’— that knows that we humans could shut it down and wants to achieve some goal.1 And imagine that Robot is powerful in the sense that it can interfere with our ability to shut it down: perhaps Robot can disable its own off-switch. Powerful agents like Robot won’t be shutdownable in the same way that MuZero is shutdownable. Whether these agents shut down won’t depend only on what we want. It will also depend on what they want. Powerful artificial agents might not be far off. Frontier AI labs are now trying to create agents that understand the wider world and pursue goals within it. As part of this process, labs are connecting agents to the world in various ways: giving them robot limbs, web-browsing abilities, and text-channels for communicating with humans.2 Advanced agents could use these tools to prevent us shutting them down: 1 Or, if talk of artificial agents ‘knowing’ and ‘wanting’ is objectionable, we can imagine an agent that acts like it knows that we humans could shut it down and acts like it wants to achieve some goal, in the same way that MuZero acts like it knows that rooks are more valuable than knights and acts like it wants to checkmate its opponent. From now on, I’ll often leave the ‘acts like’ implicit. 2 Google DeepMind (2023; Padalkar et al. 2023; Ahn et al. 2024), Google Research (2023), and Tesla AI (2023) are each developing autonomous robots. Recent papers showcase AI-guided robots capable of interpreting and carrying out multi-step instructions expressed in natural language (Ahn et al., 2022; Brohan et al., 2023). Other papers report AI systems that can adapt to solve unfamiliar problems without further training (Adaptive Agent Team, 2023), learn new physical tasks from as few as a hundred demonstrations (Bousmalis et al., 2023), beat human champions at drone racing (Kaufmann et al., 2023), and perform well across domains as disparate as conversation, playing Atari, and stacking blocks with a robot arm (Reed et al., 2022). But the worry is not only about robots. Digital agents that resist shutdown (by copying themselves to new servers, for example) would also be cause for concern. Future digital agents will likely be built on 13 The shutdown problem: an AI engineering puzzle for decision… they could disable their off-switches, make promises or threats, copy themselves to new servers, block our access to their power-source, and many other things besides. And although we cannot know for sure what goals these agents will have, many goals incentivise preventing shutdown, for the simple reason that agents are better able to achieve those goals by preventing shutdown (Omohundro, 2008; sec. 5; Bostrom, 2012, sec. 2.1). As the AI researcher Stuart Russell puts it, ‘you can’t fetch the coffee if you’re dead’ (2019, 141). That’s a concerning prospect. If powerful artificial agents are coming, we want to ensure that they’re both shutdownable (they shut down when we want them to shut down) and useful (they otherwise pursue goals competently).3 Unfortunately (and perhaps surprisingly), it’s hard to design powerful agents that are both shutdownable and useful. In this paper, I explain the difficulty. I take an axiomatic approach, proving three theorems more general than others in the nascent literature on the shutdown problem.4 These theorems suggest that agents satisfying some innocuousseeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. Here’s a rough gloss on each theorem. The First Theorem links agents’ actions to their preferences over o (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s11098-024-02153-3.pdf
Article home page: https://link.springer.com/article/10.1007/s11098-024-02153-3

Thornley, Elliott. The shutdown problem: an AI engineering puzzle for decision theorists, Philosophical Studies, 2024, pp. 1-28, DOI: 10.1007/s11098-024-02153-3