Emergent cooperation from mutual acknowledgment exchange in multi-agent reinforcement learning
Autonomous Agents and Multi-Agent Systems
https://doi.org/10.1007/s10458-024-09666-5
(2024) 38:34
Emergent cooperation from mutual acknowledgment
exchange in multi‑agent reinforcement learning
Thomy Phan1,2 · Felix Sommer2 · Fabian Ritz2 · Philipp Altmann2 · Jonas Nüßlein2 ·
Michael Kölle2 · Lenz Belzner3 · Claudia Linnhoff‑Popien2
Accepted: 2 July 2024
This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may
apply 2024
Abstract
Peer incentivization (PI) is a recent approach where all agents learn to reward or penalize each other in a distributed fashion, which often leads to emergent cooperation. Current PI mechanisms implicitly assume a flawless communication channel in order to
exchange rewards. These rewards are directly incorporated into the learning process
without any chance to respond with feedback. Furthermore, most PI approaches rely on
global information, which limits scalability and applicability to real-world scenarios where
only local information is accessible. In this paper, we propose Mutual Acknowledgment
Token Exchange (MATE), a PI approach defined by a two-phase communication protocol
to exchange acknowledgment tokens as incentives to shape individual rewards mutually.
All agents condition their token transmissions on the locally estimated quality of their
own situations based on environmental rewards and received tokens. MATE is completely
decentralized and only requires local communication and information. We evaluate MATE
in three social dilemma domains. Our results show that MATE is able to achieve and maintain significantly higher levels of cooperation than previous PI approaches. In addition,
we evaluate the robustness of MATE in more realistic scenarios, where agents can deviate
from the protocol and communication failures can occur. We also evaluate the sensitivity
of MATE w.r.t. the choice of token values.
Keywords Multi-agent learning · Reinforcement learning · Mutual acknowledgments ·
Peer incentivization · Emergent cooperation
* Thomy Phan
1
University of Southern California, Los Angeles, USA
2
LMU Munich, Munich, Germany
3
Technische Hochschule Ingolstadt, Ingolstadt, Germany
13
Vol.:(0123456789)
34
Page 2 of 36
Autonomous Agents and Multi-Agent Systems
(2024) 38:34
1 Introduction
Many potential AI scenarios like autonomous driving [53], smart grids [14], or general IoT
scenarios [11], where multiple autonomous systems coexist within a shared environment,
can be naturally modeled as self-interested multi-agent systems (MAS) [7, 33]. In self-interested MAS, each autonomous system or agent attempts to achieve an individual goal while
adapting to its environment, i.e., other agents’ behavior [16]. Conflict and competition are
common in such systems due to opposing goals or shared resources [33, 41].
In order to maximize social welfare or efficiency in self-interested MAS, all agents
need to cooperate, which requires them to refrain from selfish and greedy behavior for the
greater good. The tension between individual and collective rationality is typically modeled as a social dilemma (SD) [46]. SDs can be temporally extended to sequential social
dilemmas (SSD) to model more realistic scenarios [30].
Multi-agent reinforcement learning (MARL) has become popular for modeling individually rational agents in SDs and SSDs to examine emergent behavior [7, 19, 30, 41, 48].
The goal of each agent is defined by an individual reward function. Non-cooperative game
theory and empirical studies have shown that naive MARL approaches commonly fail to
learn cooperative behavior due to individual selfishness and lacking benevolence toward
other agents, which leads to defective behavior [3, 16, 30, 63].
One reason for mutual defection is non-stationarity, where naively learning agents do
not consider the learning dynamics of other agents but only adapt reactively [7, 22, 29,
60]. This can cause agents to defect from mutual cooperation, as studied extensively for
the Prisoner’s Dilemma [3, 16, 30, 46]. To mitigate this problem, some approaches propose to adapt the learning rate based on the outcome [6, 37, 66] or to incorporate information on other agents’ adaptations, like gradients or opponent models [16, 27, 32]. These
approaches are either tabular or require full observability to perceive each other’s behavior
and thus do not scale to complex domains. Furthermore, some approaches require knowledge about other agents’ objectives to estimate their degree of adaptation therefore violating privacy [16, 32].
Another reason for mutual defection is the reward structure, which was found to be crucial for social intelligence [30, 54]. Prior work has shown that adequate reward formulations can lead to emergent cooperation in particular domains [4, 12, 13, 24, 42]. However,
finding an appropriate reward formulation for any domain is generally not trivial. Recent
approaches adapt the reward dynamically to drive all agents towards cooperation [24, 26,
27, 68]. Peer incentivization (PI) is a distributed approach where all agents learn to reward
or penalize each other, which often leads to emergent cooperation [36, 51, 64, 68]. Current
PI mechanisms implicitly assume a flawless communication channel in order to exchange
rewards. These rewards are assumed to be simply incorporated into the learning process
without any chance to respond with feedback. Furthermore, most PI approaches rely on
global information like joint actions [68], a central market function [51], or publicly available information [64], which limits scalability and applicability to real-world scenarios
where only local information is accessible.
Once emergent cooperation has been achieved, it needs to be maintained to withstand
social pressure, such as the tragedy of the commons, where many agents compete for scarce
resources such that the outcome is less efficient than possible [30, 41] or disturbances
like protocol defections or communication failures [3, 10]. Thus, reciprocity is important to establish stable cooperation, where social welfare is maintained over time without
13
Autonomous Agents and Multi-Agent Systems
(2024) 38:34
Page 3 of 36
34
deterioration by adequately responding to both cooperative and defective opponent behavior [2, 3, 47]. While reciprocity has already been considered in some prior learning rules
[6, 16, 32, 34], there has been very little attention in most PI approaches, where agents are
only able to exchange positive rewards to reach a consensus for cooperation—without any
penalization mechanism against potential exploitation [36, 51, 68]. The lack of reciprocity at the reward level can, therefore, lead to naive cooperation in PI, which can be easily
destabilized [28].
So far, penalization via negative rewards have been mostly provided by the environment rather than as a PI-based incentive [16, 28, 31]. However, the vast majority of SSD
work studies specialized environments like Harvest or Cleanup t (...truncated)