Emergent cooperation from mutual acknowledgment exchange in multi-agent reinforcement learning (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10458-024-09666-5.pdf

Emergent cooperation from mutual acknowledgment exchange in multi-agent reinforcement learning

Autonomous Agents and Multi-Agent Systems https://doi.org/10.1007/s10458-024-09666-5 (2024) 38:34 Emergent cooperation from mutual acknowledgment exchange in multi‑agent reinforcement learning Thomy Phan1,2 · Felix Sommer2 · Fabian Ritz2 · Philipp Altmann2 · Jonas Nüßlein2 · Michael Kölle2 · Lenz Belzner3 · Claudia Linnhoff‑Popien2 Accepted: 2 July 2024 This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2024 Abstract Peer incentivization (PI) is a recent approach where all agents learn to reward or penalize each other in a distributed fashion, which often leads to emergent cooperation. Current PI mechanisms implicitly assume a flawless communication channel in order to exchange rewards. These rewards are directly incorporated into the learning process without any chance to respond with feedback. Furthermore, most PI approaches rely on global information, which limits scalability and applicability to real-world scenarios where only local information is accessible. In this paper, we propose Mutual Acknowledgment Token Exchange (MATE), a PI approach defined by a two-phase communication protocol to exchange acknowledgment tokens as incentives to shape individual rewards mutually. All agents condition their token transmissions on the locally estimated quality of their own situations based on environmental rewards and received tokens. MATE is completely decentralized and only requires local communication and information. We evaluate MATE in three social dilemma domains. Our results show that MATE is able to achieve and maintain significantly higher levels of cooperation than previous PI approaches. In addition, we evaluate the robustness of MATE in more realistic scenarios, where agents can deviate from the protocol and communication failures can occur. We also evaluate the sensitivity of MATE w.r.t. the choice of token values. Keywords Multi-agent learning · Reinforcement learning · Mutual acknowledgments · Peer incentivization · Emergent cooperation * Thomy Phan 1 University of Southern California, Los Angeles, USA 2 LMU Munich, Munich, Germany 3 Technische Hochschule Ingolstadt, Ingolstadt, Germany 13 Vol.:(0123456789) 34 Page 2 of 36 Autonomous Agents and Multi-Agent Systems (2024) 38:34 1 Introduction Many potential AI scenarios like autonomous driving [53], smart grids [14], or general IoT scenarios [11], where multiple autonomous systems coexist within a shared environment, can be naturally modeled as self-interested multi-agent systems (MAS) [7, 33]. In self-interested MAS, each autonomous system or agent attempts to achieve an individual goal while adapting to its environment, i.e., other agents’ behavior [16]. Conflict and competition are common in such systems due to opposing goals or shared resources [33, 41]. In order to maximize social welfare or efficiency in self-interested MAS, all agents need to cooperate, which requires them to refrain from selfish and greedy behavior for the greater good. The tension between individual and collective rationality is typically modeled as a social dilemma (SD) [46]. SDs can be temporally extended to sequential social dilemmas (SSD) to model more realistic scenarios [30]. Multi-agent reinforcement learning (MARL) has become popular for modeling individually rational agents in SDs and SSDs to examine emergent behavior [7, 19, 30, 41, 48]. The goal of each agent is defined by an individual reward function. Non-cooperative game theory and empirical studies have shown that naive MARL approaches commonly fail to learn cooperative behavior due to individual selfishness and lacking benevolence toward other agents, which leads to defective behavior [3, 16, 30, 63]. One reason for mutual defection is non-stationarity, where naively learning agents do not consider the learning dynamics of other agents but only adapt reactively [7, 22, 29, 60]. This can cause agents to defect from mutual cooperation, as studied extensively for the Prisoner’s Dilemma [3, 16, 30, 46]. To mitigate this problem, some approaches propose to adapt the learning rate based on the outcome [6, 37, 66] or to incorporate information on other agents’ adaptations, like gradients or opponent models [16, 27, 32]. These approaches are either tabular or require full observability to perceive each other’s behavior and thus do not scale to complex domains. Furthermore, some approaches require knowledge about other agents’ objectives to estimate their degree of adaptation therefore violating privacy [16, 32]. Another reason for mutual defection is the reward structure, which was found to be crucial for social intelligence [30, 54]. Prior work has shown that adequate reward formulations can lead to emergent cooperation in particular domains [4, 12, 13, 24, 42]. However, finding an appropriate reward formulation for any domain is generally not trivial. Recent approaches adapt the reward dynamically to drive all agents towards cooperation [24, 26, 27, 68]. Peer incentivization (PI) is a distributed approach where all agents learn to reward or penalize each other, which often leads to emergent cooperation [36, 51, 64, 68]. Current PI mechanisms implicitly assume a flawless communication channel in order to exchange rewards. These rewards are assumed to be simply incorporated into the learning process without any chance to respond with feedback. Furthermore, most PI approaches rely on global information like joint actions [68], a central market function [51], or publicly available information [64], which limits scalability and applicability to real-world scenarios where only local information is accessible. Once emergent cooperation has been achieved, it needs to be maintained to withstand social pressure, such as the tragedy of the commons, where many agents compete for scarce resources such that the outcome is less efficient than possible [30, 41] or disturbances like protocol defections or communication failures [3, 10]. Thus, reciprocity is important to establish stable cooperation, where social welfare is maintained over time without 13 Autonomous Agents and Multi-Agent Systems (2024) 38:34 Page 3 of 36 34 deterioration by adequately responding to both cooperative and defective opponent behavior [2, 3, 47]. While reciprocity has already been considered in some prior learning rules [6, 16, 32, 34], there has been very little attention in most PI approaches, where agents are only able to exchange positive rewards to reach a consensus for cooperation—without any penalization mechanism against potential exploitation [36, 51, 68]. The lack of reciprocity at the reward level can, therefore, lead to naive cooperation in PI, which can be easily destabilized [28]. So far, penalization via negative rewards have been mostly provided by the environment rather than as a PI-based incentive [16, 28, 31]. However, the vast majority of SSD work studies specialized environments like Harvest or Cleanup t (...truncated)