Beyond Preferences in AI Alignment
Philosophical Studies
https://doi.org/10.1007/s11098-024-02249-w
Beyond Preferences in AI Alignment
Tan Zhi‑Xuan1
· Micah Carroll2 · Matija Franklin3 · Hal Ashton4
Accepted: 9 October 2024
© The Author(s) 2024
Abstract
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values, (2) that human rationality can be understood in
terms of maximizing the satisfaction of preferences, and (3) that AI systems should
be aligned with the preferences of one or more humans to ensure that they behave
safely and in accordance with our values. Whether implicitly followed or explicitly
endorsed, these commitments constitute what we term a preferentist approach to AI
alignment. In this paper, we characterize and challenge the preferentist approach,
describing conceptual and technical alternatives that are ripe for further research.
We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and
how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans
and AI, drawing upon arguments showing how rational agents need not comply
with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of
the targets of AI alignment: Instead of alignment with the preferences of a human
user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose
assistant. Furthermore, these standards should be negotiated and agreed upon by all
relevant stakeholders. On this alternative conception of alignment, a multiplicity of
AI systems will be able to serve diverse ends, aligned with normative standards that
promote mutual benefit and limit harm despite our plural and divergent values.
Keywords Artificial intelligence · AI alignment · Preferences · Rational choice
theory · Decision theory · Value theory
* Tan Zhi‑Xuan
1
Massachusetts Institute of Technology, Cambridge, MA, USA
2
University of California, Berkeley, CA, USA
3
University College London, London , UK
4
University of Cambridge, Cambridge, UK
Vol.:(0123456789)
T. Zhi‑Xuan et al.
1 Introduction
Recent progress in the capabilities of AI systems, as well as their increasing adoption in society, has led a growing number of researchers to worry about the impact
of AI systems that are misaligned with human values. The roots of this concern vary,
with some focused on the existential risks that may come with increasingly powerful autonomous systems (Carlsmith, 2022), while others take a broader view of the
dangers and opportunities presented by potentially transformative AI technologies
(Prunkl & Whittlestone, 2020; Lazar & Nelson, 2023). To address these challenges,
AI alignment has emerged as a field, focused on the technical project of ensuring an
AI system acts reliably in accordance with the values of one or more humans.
Yet terms like “human values” are notoriously imprecise, and it is unclear how to
operationalize “values” in a sufficiently precise way that a machine could be aligned
with them. One prominent approach is to define “values” in terms of human preferences, drawing upon the traditions of rational choice theory (Mishra, 2014), statistical decision theory (Berger, 2013), and their subsequent influence upon automated
decision-making and reinforcement learning in AI (Sutton & Barto, 2018). Whether
explicitly adopted, or implicitly assumed in the guise of “reward” or “utility”, this
preference-based approach dominates both the theory and practice of AI alignment.
However, as proponents of this approach note themselves, aligning AI with human
preferences faces numerous technical and philosophical challenges, including the
problems of social choice, anti-social preferences, preference change, and the difficulty of inferring preferences from human behavior (Russell, 2019).
In this paper, we argue that to truly address such challenges, it is necessary to go
beyond formulations of AI alignment that treat human preferences as ontologically, epistemologically, or normatively basic. Borrowing a term from the philosophy of welfare
(Baber, 2011), we identify these formulations as part of a broadly preferentist approach
to AI alignment, which we characterize in terms of four theses about the role of preferences in both descriptive and normative accounts of (human-aligned) decision-making:
Rational Choice Theory as a Descriptive Framework.
Human behavior and decision-making is well-modeled as approximately maximizing the satisfaction of preferences, which can be represented as a utility or reward
function.
Expected Utility Theory as a Normative Standard.
Rational agency can be characterized as the maximization of expected utility.
Moreover, AI systems should be designed and analyzed according to this normative
standard.
Single-Principal Alignment as Preference Matching.
For an AI system to be aligned to a single human principal, it should act so as to
maximize the satisfaction of the preferences of that human.
Multi-Principal Alignment as Preference Aggregation.
For AI systems to be aligned to multiple human principals, they should act so as to
maximize the satisfaction of their aggregate preferences.
Beyond Preferences in AI Alignment
These four theses represent a cluster of views, not a unified theory of AI alignment. Still, the ideas they represent are tightly linked, and most approaches to
AI alignment assume two or more of the theses. For example, inverse reinforcement learning (Ng & Russell, 2000; Hadfield-Menell et al., 2016), reinforcement
learning from human feedback (Akrour et al., 2014; Christiano et al., 2017; Ouyang et al., 2022), and direct preference optimization (Rafailov et al., 2024; Hejna
et al., 2024) all assume that human preferences are well-modeled by a reward or
utility function, which can then be optimized to produce aligned behavior. Similarly, worries about deceptive alignment (Hubinger et al., 2019) and goal misgeneralization (Di Langosco et al., 2022) are typically characterized as a mismatch
between a learned utility function and the human-intended utility function; the
solution is thus to ensure that the utility functions (and the preferences they represent) are closely matched.
Of course, preferentism in AI alignment is not without its critics. Over the
years, there has been considerable discussion as to whether its component theses are warranted (Shah, 2018; Eckersley, 2018; Hadfield-Menell & Hadfield,
2018; Wentworth, 2019, 2023; Gabriel, 2020; Vamplew et al., 2021; Garrabrant,
2022; Korinek & Balwit, 2022; Thornley, 2023), echoing similar debates in economics, decision theory, and philosophy. Nonetheless, it is apparent that the dominant practice (...truncated)