Scalable conflict-free bandit algorithm using a quantum optical setup (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41534-026-01201-6.pdf

Scalable conflict-free bandit algorithm using a quantum optical setup

npj | quantum information Article Published in partnership with The University of New South Wales https://doi.org/10.1038/s41534-026-01201-6 Scalable conﬂict-free bandit algorithm using a quantum optical setup Check for updates 1234567890():,; 1234567890():,; Kohei Konaka, André Röhm, Takatomo Mihana & Ryoichi Horisaki Quantum optics utilizes the unique properties of light for computation or communication. In this work, we explore its ability to solve certain reinforcement learning tasks, with a particular view towards the scalability of the approach. Our method utilizes the Orbital Angular Momentum (OAM) of photons to solve the Competitive Multi-Armed Bandit (CMAB) problem while maximizing rewards. In particular, we encode each player’s preferences in the OAM amplitudes, while the phases are optimized to avoid conﬂicts. We ﬁnd that the proposed system is capable of solving the CMAB problem with a scalable number of options and demonstrates improved performance over existing techniques. Our method utilizes quantum interference to guarantee conﬂict avoidance using purely physical attributes of light in a way impossible for a classical setup. As an example of a system with simple rules for solving complex tasks, our OAM-based method adds to the repertoire of functionality of quantum optics. We are constantly forced to make decisions with limited information in the real world. Yet, we can also learn from our experience to avoid bad choices and prefer those that have previously led to good outcomes. Reinforcement learning1 is the framework that models such learning steps, in particular in stochastic environments with unknown reward distributions. Among the models of reinforcement learning, the Multi-Armed Bandit (MAB) problem2 is arguably the clearest framework for illustrating the central tension of this class of tasks. In the MAB problem, we consider a ﬁnite number of options, called arms, which generate rewards according to certain probability distributions when selected. The basic scenario involves a single player who repeatedly chooses one arm from multiple options over a ﬁxed number of trials, aiming to maximize the cumulative rewards. The player does not know the probability distributions from which the arms generate rewards. Therefore, the player must ﬁrst perform “exploration,” selecting each arm at least some times to identify those with higher expected rewards. However, to maximize cumulative rewards, the player also needs to perform “exploitation,” focusing on selecting the arm with higher expected rewards. Balancing exploration and exploitation effectively is crucial and the core aspect modeled by the MAB problem3. Due to the generality of this model, the MAB problem has been applied in various ﬁelds, ranging from online advertising optimization4,5 to clinical trials of new medicines6–8. Well-known algorithms for solving the MAB problem include the Softmax method9, Thompson sampling10, and the Upper conﬁdence bound method11. One extension of the MAB problem involves considering multiple players, particularly addressing the issue of reward division when conﬂicts in selection (referred to as selection conﬂicts) occur. This problem is known as the Competitive Multi-Armed Bandit (CMAB) problem12–15. In the CMAB problem, the goal is to maximize the sum of cumulative rewards across all players. Therefore, each player must engage in both “exploration” and “exploitation” while avoiding selection conﬂicts. The CMAB problem is relevant to various applications such as frequency allocation in wireless communications16, where selection conﬂicts result in performance degradation of individual devices due to multiple devices using the same frequency band. Direct communication between players (e.g., the wireless devices) is often undesirable from the perspective of time and energy efﬁciency. Consequently, a common assumption in the CMAB problem, which this paper also adopts, is that players cannot communicate to share selection information (which arms players select and whether the selected arms generate rewards) directly. In naive extensions of algorithms used for the MAB problem it is extremely difﬁcult to avoid selection conﬂicts without sharing selection information. In recent years, there has been a growing interest in studying bandit problems from the perspective of quantum information and computation. This line of research can be broadly categorized into two directions. The ﬁrst involves using quantum algorithms to solve classical bandit problems more efﬁciently. Speciﬁc examples include quantum algorithms for the best armidentiﬁcation problem17,18 and regret minimization19,20, as well as the application of quantum neural networks to contextual bandits21. Furthermore, quantum algorithms for bandits in adversarial environments have also been investigated22. The second direction applies classical bandit algorithms to learn properties of quantum states efﬁciently, such as learning quantum states under minimal regret23–25 and best arm identiﬁcation for entanglement detection26. While these studies have signiﬁcantly advanced the ﬁeld through algorithmic speedups and theoretical frameworks, they predominantly focus on the single-player MAB setting. The application of quantum principles to the CMAB problem remains largely unexplored. This highlights the necessity of a distinct approach: physical decision making, Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo-ku, Tokyo, Japan. npj Quantum Information | (2026)12:44 e-mail: 1 Article https://doi.org/10.1038/s41534-026-01201-6 which exploits physical phenomena directly to solve coordination problems, rather than relying solely on software-based algorithms. In this context, recent studies have been conducted to solve the MAB and CMAB problem through physical decision-making using the properties of light. First, a decision-making system utilizing the polarization state of single photons was proposed for the two-armed MAB problem27. In this system, linearly polarized single photons are incident on a polarization beam splitter, followed by the detection of their polarization states. The detection result is mapped to the player’s arm selection; for instance, observing vertical polarization corresponds to selecting Arm L, while observing horizontal polarization corresponds to selecting Arm R. Furthermore, by dynamically adjusting the angle of a half-wave plate placed before the PBS based on the received rewards, the probability of selecting the arm with a higher expected reward can be increased. Although this system is limited to the two-armed MAB problem, a subsequent study proposed a hierarchical architecture to improve scalability regarding the number of arms28. Regarding the CMAB problem, a collective decision-making system utilizing the quantum interference of two polarized photons has also been proposed29. Designed for the two-player, two-armed CMAB scenario, this system (...truncated)