Scalable conflict-free bandit algorithm using a quantum optical setup
npj | quantum information
Article
Published in partnership with The University of New South Wales
https://doi.org/10.1038/s41534-026-01201-6
Scalable conflict-free bandit algorithm
using a quantum optical setup
Check for updates
1234567890():,;
1234567890():,;
Kohei Konaka, André Röhm, Takatomo Mihana
& Ryoichi Horisaki
Quantum optics utilizes the unique properties of light for computation or communication. In this work,
we explore its ability to solve certain reinforcement learning tasks, with a particular view towards the
scalability of the approach. Our method utilizes the Orbital Angular Momentum (OAM) of photons to
solve the Competitive Multi-Armed Bandit (CMAB) problem while maximizing rewards. In particular,
we encode each player’s preferences in the OAM amplitudes, while the phases are optimized to avoid
conflicts. We find that the proposed system is capable of solving the CMAB problem with a scalable
number of options and demonstrates improved performance over existing techniques. Our method
utilizes quantum interference to guarantee conflict avoidance using purely physical attributes of light in
a way impossible for a classical setup. As an example of a system with simple rules for solving complex
tasks, our OAM-based method adds to the repertoire of functionality of quantum optics.
We are constantly forced to make decisions with limited information in the
real world. Yet, we can also learn from our experience to avoid bad choices
and prefer those that have previously led to good outcomes. Reinforcement
learning1 is the framework that models such learning steps, in particular in
stochastic environments with unknown reward distributions. Among the
models of reinforcement learning, the Multi-Armed Bandit (MAB)
problem2 is arguably the clearest framework for illustrating the central
tension of this class of tasks.
In the MAB problem, we consider a finite number of options, called
arms, which generate rewards according to certain probability distributions
when selected. The basic scenario involves a single player who repeatedly
chooses one arm from multiple options over a fixed number of trials, aiming
to maximize the cumulative rewards. The player does not know the probability distributions from which the arms generate rewards. Therefore, the
player must first perform “exploration,” selecting each arm at least some
times to identify those with higher expected rewards. However, to maximize
cumulative rewards, the player also needs to perform “exploitation,”
focusing on selecting the arm with higher expected rewards. Balancing
exploration and exploitation effectively is crucial and the core aspect
modeled by the MAB problem3. Due to the generality of this model, the
MAB problem has been applied in various fields, ranging from online
advertising optimization4,5 to clinical trials of new medicines6–8. Well-known
algorithms for solving the MAB problem include the Softmax method9,
Thompson sampling10, and the Upper confidence bound method11.
One extension of the MAB problem involves considering multiple
players, particularly addressing the issue of reward division when conflicts in
selection (referred to as selection conflicts) occur. This problem is known as
the Competitive Multi-Armed Bandit (CMAB) problem12–15. In the CMAB
problem, the goal is to maximize the sum of cumulative rewards across all
players. Therefore, each player must engage in both “exploration” and
“exploitation” while avoiding selection conflicts. The CMAB problem is
relevant to various applications such as frequency allocation in wireless
communications16, where selection conflicts result in performance degradation of individual devices due to multiple devices using the same frequency band. Direct communication between players (e.g., the wireless
devices) is often undesirable from the perspective of time and energy efficiency. Consequently, a common assumption in the CMAB problem, which
this paper also adopts, is that players cannot communicate to share selection
information (which arms players select and whether the selected arms
generate rewards) directly. In naive extensions of algorithms used for the
MAB problem it is extremely difficult to avoid selection conflicts without
sharing selection information.
In recent years, there has been a growing interest in studying bandit
problems from the perspective of quantum information and computation.
This line of research can be broadly categorized into two directions. The first
involves using quantum algorithms to solve classical bandit problems more
efficiently. Specific examples include quantum algorithms for the best armidentification problem17,18 and regret minimization19,20, as well as the
application of quantum neural networks to contextual bandits21. Furthermore, quantum algorithms for bandits in adversarial environments have
also been investigated22. The second direction applies classical bandit
algorithms to learn properties of quantum states efficiently, such as learning
quantum states under minimal regret23–25 and best arm identification for
entanglement detection26. While these studies have significantly advanced
the field through algorithmic speedups and theoretical frameworks, they
predominantly focus on the single-player MAB setting. The application of
quantum principles to the CMAB problem remains largely unexplored. This
highlights the necessity of a distinct approach: physical decision making,
Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
npj Quantum Information | (2026)12:44
e-mail:
1
Article
https://doi.org/10.1038/s41534-026-01201-6
which exploits physical phenomena directly to solve coordination problems,
rather than relying solely on software-based algorithms.
In this context, recent studies have been conducted to solve the
MAB and CMAB problem through physical decision-making using the
properties of light. First, a decision-making system utilizing the polarization state of single photons was proposed for the two-armed MAB
problem27. In this system, linearly polarized single photons are incident
on a polarization beam splitter, followed by the detection of their
polarization states. The detection result is mapped to the player’s arm
selection; for instance, observing vertical polarization corresponds to
selecting Arm L, while observing horizontal polarization corresponds to
selecting Arm R. Furthermore, by dynamically adjusting the angle of a
half-wave plate placed before the PBS based on the received rewards, the
probability of selecting the arm with a higher expected reward can be
increased. Although this system is limited to the two-armed MAB
problem, a subsequent study proposed a hierarchical architecture to
improve scalability regarding the number of arms28. Regarding the
CMAB problem, a collective decision-making system utilizing the
quantum interference of two polarized photons has also been
proposed29. Designed for the two-player, two-armed CMAB scenario,
this system (...truncated)