Scalable photonic reinforcement learning by time-division multiplexing of laser chaos
www.nature.com/scientificreports
OPEN
Received: 13 April 2018
Accepted: 5 July 2018
Published: xx xx xxxx
Scalable photonic reinforcement
learning by time-division
multiplexing of laser chaos
Makoto Naruse1, Takatomo Mihana2, Hirokazu Hori3, Hayato Saigo4, Kazuya Okamura5,
Mikio Hasegawa6 & Atsushi Uchida 2
Reinforcement learning involves decision-making in dynamic and uncertain environments and
constitutes a crucial element of artificial intelligence. In our previous work, we experimentally
demonstrated that the ultrafast chaotic oscillatory dynamics of lasers can be used to efficiently solve
the two-armed bandit problem, which requires decision-making concerning a class of difficult tradeoffs called the exploration–exploitation dilemma. However, only two selections were employed in that
research; hence, the scalability of the laser-chaos-based reinforcement learning should be clarified.
In this study, we demonstrated a scalable, pipelined principle of resolving the multi-armed bandit
problem by introducing time-division multiplexing of chaotically oscillated ultrafast time series. The
experimental demonstrations in which bandit problems with up to 64 arms were successfully solved
are presented where laser chaos time series significantly outperforms quasiperiodic signals, computergenerated pseudorandom numbers, and coloured noise. Detailed analyses are also provided that
include performance comparisons among laser chaos signals generated in different physical conditions,
which coincide with the diffusivity inherent in the time series. This study paves the way for ultrafast
reinforcement learning by taking advantage of the ultrahigh bandwidths of light wave and practical
enabling technologies.
Recently, the use of photonics for information processing and artificial intelligence has been intensively studied
by exploiting the unique physical attributes of photons. The latest examples include a coherent Ising machine for
combinatorial optimization1, photonic reservoir computing to perform complex time-series predictions2,3, and
ultrafast random number generation using chaotic dynamics in lasers4,5 in which the ultrahigh bandwidth attributes of light bring novel advantages. Reinforcement learning, also called decision-making, is another important
branch of research which involves making decisions promptly and accurately in uncertain, dynamically changing
environments6 and constitutes the foundation of a variety of applications ranging from communication infrastructures7,8 and robotics9 to computer gaming10.
The multi-armed bandit problem (MAB) is known to be a fundamental reinforcement learning problem in which the goal is to maximize the total reward from multiple slot machines whose reward probabilities
are unknown and could dynamically change6. To solve the MAB, it is necessary to explore higher-reward slot
machines. However, too much exploration may result in excessive loss, whereas too quick decision-making or
insufficient exploration may lead to missing the best machine; thus, there is a trade-off referred to as the exploration–exploitation dilemma11.
In our previous study, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of
lasers2–5 can be used to solve the MAB efficiently12,13. With a chaotic time series generated by a semiconductor
laser with a delayed feedback sampled at a maximum rate of 100 GSample/s followed by a digitization mechanism
with a variable threshold, ultrafast, adaptive, and accurate decision-making was demonstrated. Such ultrafast
1
Network System Research Institute, National Institute of Information and Communications Technology, 4-2-1
Nukui-kita, Koganei, Tokyo, 184-8795, Japan. 2Department of Information and Computer Sciences, Saitama
University, 255 Shimo-Okubo, Sakura-ku, Saitama City, Saitama, 338-8570, Japan. 3Interdisciplinary Graduate
School, University of Yamanashi, Takeda, Kofu, Yamanashi, 400-8510, Japan. 4Nagahama Institute of Bio-Science and
Technology, 1266 Tamura, Nagahama, Shiga, 526-0829, Japan. 5Graduate School of Informatics, Nagoya University,
Furo, Chikusa, Nagoya, Aichi, 464-8601, Japan. 6Department of Electrical Engineering, Tokyo University of Science,
6-3-1 Niijuku, Katsushika, Tokyo, 125-8585, Japan. Correspondence and requests for materials should be addressed
to M.N. (email: )
Scientific REPOrTS | (2018) 8:10890 | DOI:10.1038/s41598-018-29117-y
1
www.nature.com/scientificreports/
decision-making is unachievable using conventional algorithms on digital computers11,14,15 that rely on pseudorandom numbers. It was also demonstrated that the decision-making performance is maximized by using an
optimal sampling interval that exactly coincides with the negative autocorrelation inherent in the chaotic time
series12. Moreover, even when assuming that pseudorandom numbers and coloured noise were available in such
a high-speed domain, the laser chaos method outperformed these alternatives; that is, chaotic dynamics yields
superior decision-making abilities12.
However, only two options, or slot machines, were employed in the MAB investigated therein; that is, the
two-armed bandit problem was studied. A scalable principle and technologies toward an N-armed bandit with N
being a natural number are strongly demanded for practical applications. In addition, detailed insights into the
relations between the resulting decision-making abilities and properties of chaotic signal trains should be pursued
to achieve deeper physical understanding as well as performance optimization at the physical or photonic device
level.
In this study, we experimentally demonstrated a scalable photonic reinforcement learning principle based on
ultrafast chaotic oscillatory dynamics in semiconductor lasers. Taking advantage of the high-bandwidth attributes
of chaotic lasers, we incorporated the concept of time-division multiplexing into the decision-making strategy;
specifically, consecutively sampled chaotic signals were used in the proposed method to determine the identity of
the slot machine in a binary digit form.
In the recent literature on photonic decision-making, near-field-mediated optical excitation transfer16,17 and
single photon18,19 methods have been discussed; the former technique involves pursuing the diffraction-limit-free
spatial resolution20, whereas the latter reveals the benefits of the wave–particle duality of single light quanta21. A
promising approach for achieving scalability by means of near-field-coupled excitation transfer or single photons
is spatial parallelism; indeed, a hierarchical principle has been successfully demonstrated experimentally in solving the four-armed bandit problem using single photons19. In contrast, the high-bandwidth attributes of chaotic
lasers accommodate time-division multiplexing and have been successfully used in optical communications22.
In this study, we transformed the hierarchical decision-making strategy19 into the time domain, transcending
the barrier toward s (...truncated)