Beyond Preferences in AI Alignment (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11098-024-02249-w.pdf

Beyond Preferences in AI Alignment

Philosophical Studies https://doi.org/10.1007/s11098-024-02249-w Beyond Preferences in AI Alignment Tan Zhi‑Xuan1 · Micah Carroll2 · Matija Franklin3 · Hal Ashton4 Accepted: 9 October 2024 © The Author(s) 2024 Abstract The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values. Keywords Artificial intelligence · AI alignment · Preferences · Rational choice theory · Decision theory · Value theory * Tan Zhi‑Xuan 1 Massachusetts Institute of Technology, Cambridge, MA, USA 2 University of California, Berkeley, CA, USA 3 University College London, London , UK 4 University of Cambridge, Cambridge, UK Vol.:(0123456789) T. Zhi‑Xuan et al. 1 Introduction Recent progress in the capabilities of AI systems, as well as their increasing adoption in society, has led a growing number of researchers to worry about the impact of AI systems that are misaligned with human values. The roots of this concern vary, with some focused on the existential risks that may come with increasingly powerful autonomous systems (Carlsmith, 2022), while others take a broader view of the dangers and opportunities presented by potentially transformative AI technologies (Prunkl & Whittlestone, 2020; Lazar & Nelson, 2023). To address these challenges, AI alignment has emerged as a field, focused on the technical project of ensuring an AI system acts reliably in accordance with the values of one or more humans. Yet terms like “human values” are notoriously imprecise, and it is unclear how to operationalize “values” in a sufficiently precise way that a machine could be aligned with them. One prominent approach is to define “values” in terms of human preferences, drawing upon the traditions of rational choice theory (Mishra, 2014), statistical decision theory (Berger, 2013), and their subsequent influence upon automated decision-making and reinforcement learning in AI (Sutton & Barto, 2018). Whether explicitly adopted, or implicitly assumed in the guise of “reward” or “utility”, this preference-based approach dominates both the theory and practice of AI alignment. However, as proponents of this approach note themselves, aligning AI with human preferences faces numerous technical and philosophical challenges, including the problems of social choice, anti-social preferences, preference change, and the difficulty of inferring preferences from human behavior (Russell, 2019). In this paper, we argue that to truly address such challenges, it is necessary to go beyond formulations of AI alignment that treat human preferences as ontologically, epistemologically, or normatively basic. Borrowing a term from the philosophy of welfare (Baber, 2011), we identify these formulations as part of a broadly preferentist approach to AI alignment, which we characterize in terms of four theses about the role of preferences in both descriptive and normative accounts of (human-aligned) decision-making: Rational Choice Theory as a Descriptive Framework. Human behavior and decision-making is well-modeled as approximately maximizing the satisfaction of preferences, which can be represented as a utility or reward function. Expected Utility Theory as a Normative Standard. Rational agency can be characterized as the maximization of expected utility. Moreover, AI systems should be designed and analyzed according to this normative standard. Single-Principal Alignment as Preference Matching. For an AI system to be aligned to a single human principal, it should act so as to maximize the satisfaction of the preferences of that human. Multi-Principal Alignment as Preference Aggregation. For AI systems to be aligned to multiple human principals, they should act so as to maximize the satisfaction of their aggregate preferences. Beyond Preferences in AI Alignment These four theses represent a cluster of views, not a unified theory of AI alignment. Still, the ideas they represent are tightly linked, and most approaches to AI alignment assume two or more of the theses. For example, inverse reinforcement learning (Ng & Russell, 2000; Hadfield-Menell et al., 2016), reinforcement learning from human feedback (Akrour et al., 2014; Christiano et al., 2017; Ouyang et al., 2022), and direct preference optimization (Rafailov et al., 2024; Hejna et al., 2024) all assume that human preferences are well-modeled by a reward or utility function, which can then be optimized to produce aligned behavior. Similarly, worries about deceptive alignment (Hubinger et al., 2019) and goal misgeneralization (Di Langosco et al., 2022) are typically characterized as a mismatch between a learned utility function and the human-intended utility function; the solution is thus to ensure that the utility functions (and the preferences they represent) are closely matched. Of course, preferentism in AI alignment is not without its critics. Over the years, there has been considerable discussion as to whether its component theses are warranted (Shah, 2018; Eckersley, 2018; Hadfield-Menell & Hadfield, 2018; Wentworth, 2019, 2023; Gabriel, 2020; Vamplew et al., 2021; Garrabrant, 2022; Korinek & Balwit, 2022; Thornley, 2023), echoing similar debates in economics, decision theory, and philosophy. Nonetheless, it is apparent that the dominant practice (...truncated)