Scalable photonic reinforcement learning by time-division multiplexing of laser chaos

Scientific Reports, Jul 2018

Reinforcement learning involves decision-making in dynamic and uncertain environments and constitutes a crucial element of artificial intelligence. In our previous work, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of lasers can be used to efficiently solve the two-armed bandit problem, which requires decision-making concerning a class of difficult trade-offs called the exploration–exploitation dilemma. However, only two selections were employed in that research; hence, the scalability of the laser-chaos-based reinforcement learning should be clarified. In this study, we demonstrated a scalable, pipelined principle of resolving the multi-armed bandit problem by introducing time-division multiplexing of chaotically oscillated ultrafast time series. The experimental demonstrations in which bandit problems with up to 64 arms were successfully solved are presented where laser chaos time series significantly outperforms quasiperiodic signals, computer-generated pseudorandom numbers, and coloured noise. Detailed analyses are also provided that include performance comparisons among laser chaos signals generated in different physical conditions, which coincide with the diffusivity inherent in the time series. This study paves the way for ultrafast reinforcement learning by taking advantage of the ultrahigh bandwidths of light wave and practical enabling technologies.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-018-29117-y.pdf

Scalable photonic reinforcement learning by time-division multiplexing of laser chaos

www.nature.com/scientificreports OPEN Received: 13 April 2018 Accepted: 5 July 2018 Published: xx xx xxxx Scalable photonic reinforcement learning by time-division multiplexing of laser chaos Makoto Naruse1, Takatomo Mihana2, Hirokazu Hori3, Hayato Saigo4, Kazuya Okamura5, Mikio Hasegawa6 & Atsushi Uchida 2 Reinforcement learning involves decision-making in dynamic and uncertain environments and constitutes a crucial element of artificial intelligence. In our previous work, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of lasers can be used to efficiently solve the two-armed bandit problem, which requires decision-making concerning a class of difficult tradeoffs called the exploration–exploitation dilemma. However, only two selections were employed in that research; hence, the scalability of the laser-chaos-based reinforcement learning should be clarified. In this study, we demonstrated a scalable, pipelined principle of resolving the multi-armed bandit problem by introducing time-division multiplexing of chaotically oscillated ultrafast time series. The experimental demonstrations in which bandit problems with up to 64 arms were successfully solved are presented where laser chaos time series significantly outperforms quasiperiodic signals, computergenerated pseudorandom numbers, and coloured noise. Detailed analyses are also provided that include performance comparisons among laser chaos signals generated in different physical conditions, which coincide with the diffusivity inherent in the time series. This study paves the way for ultrafast reinforcement learning by taking advantage of the ultrahigh bandwidths of light wave and practical enabling technologies. Recently, the use of photonics for information processing and artificial intelligence has been intensively studied by exploiting the unique physical attributes of photons. The latest examples include a coherent Ising machine for combinatorial optimization1, photonic reservoir computing to perform complex time-series predictions2,3, and ultrafast random number generation using chaotic dynamics in lasers4,5 in which the ultrahigh bandwidth attributes of light bring novel advantages. Reinforcement learning, also called decision-making, is another important branch of research which involves making decisions promptly and accurately in uncertain, dynamically changing environments6 and constitutes the foundation of a variety of applications ranging from communication infrastructures7,8 and robotics9 to computer gaming10. The multi-armed bandit problem (MAB) is known to be a fundamental reinforcement learning problem in which the goal is to maximize the total reward from multiple slot machines whose reward probabilities are unknown and could dynamically change6. To solve the MAB, it is necessary to explore higher-reward slot machines. However, too much exploration may result in excessive loss, whereas too quick decision-making or insufficient exploration may lead to missing the best machine; thus, there is a trade-off referred to as the exploration–exploitation dilemma11. In our previous study, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of lasers2–5 can be used to solve the MAB efficiently12,13. With a chaotic time series generated by a semiconductor laser with a delayed feedback sampled at a maximum rate of 100 GSample/s followed by a digitization mechanism with a variable threshold, ultrafast, adaptive, and accurate decision-making was demonstrated. Such ultrafast 1 Network System Research Institute, National Institute of Information and Communications Technology, 4-2-1 Nukui-kita, Koganei, Tokyo, 184-8795, Japan. 2Department of Information and Computer Sciences, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama City, Saitama, 338-8570, Japan. 3Interdisciplinary Graduate School, University of Yamanashi, Takeda, Kofu, Yamanashi, 400-8510, Japan. 4Nagahama Institute of Bio-Science and Technology, 1266 Tamura, Nagahama, Shiga, 526-0829, Japan. 5Graduate School of Informatics, Nagoya University, Furo, Chikusa, Nagoya, Aichi, 464-8601, Japan. 6Department of Electrical Engineering, Tokyo University of Science, 6-3-1 Niijuku, Katsushika, Tokyo, 125-8585, Japan. Correspondence and requests for materials should be addressed to M.N. (email: ) Scientific REPOrTS | (2018) 8:10890 | DOI:10.1038/s41598-018-29117-y 1 www.nature.com/scientificreports/ decision-making is unachievable using conventional algorithms on digital computers11,14,15 that rely on pseudorandom numbers. It was also demonstrated that the decision-making performance is maximized by using an optimal sampling interval that exactly coincides with the negative autocorrelation inherent in the chaotic time series12. Moreover, even when assuming that pseudorandom numbers and coloured noise were available in such a high-speed domain, the laser chaos method outperformed these alternatives; that is, chaotic dynamics yields superior decision-making abilities12. However, only two options, or slot machines, were employed in the MAB investigated therein; that is, the two-armed bandit problem was studied. A scalable principle and technologies toward an N-armed bandit with N being a natural number are strongly demanded for practical applications. In addition, detailed insights into the relations between the resulting decision-making abilities and properties of chaotic signal trains should be pursued to achieve deeper physical understanding as well as performance optimization at the physical or photonic device level. In this study, we experimentally demonstrated a scalable photonic reinforcement learning principle based on ultrafast chaotic oscillatory dynamics in semiconductor lasers. Taking advantage of the high-bandwidth attributes of chaotic lasers, we incorporated the concept of time-division multiplexing into the decision-making strategy; specifically, consecutively sampled chaotic signals were used in the proposed method to determine the identity of the slot machine in a binary digit form. In the recent literature on photonic decision-making, near-field-mediated optical excitation transfer16,17 and single photon18,19 methods have been discussed; the former technique involves pursuing the diffraction-limit-free spatial resolution20, whereas the latter reveals the benefits of the wave–particle duality of single light quanta21. A promising approach for achieving scalability by means of near-field-coupled excitation transfer or single photons is spatial parallelism; indeed, a hierarchical principle has been successfully demonstrated experimentally in solving the four-armed bandit problem using single photons19. In contrast, the high-bandwidth attributes of chaotic lasers accommodate time-division multiplexing and have been successfully used in optical communications22. In this study, we transformed the hierarchical decision-making strategy19 into the time domain, transcending the barrier toward s (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41598-018-29117-y.pdf
Article home page: https://www.nature.com/articles/s41598-018-29117-y

Makoto Naruse, Takatomo Mihana, Hirokazu Hori, Hayato Saigo, Kazuya Okamura, Mikio Hasegawa, Atsushi Uchida. Scalable photonic reinforcement learning by time-division multiplexing of laser chaos, Scientific Reports, 2018, DOI: 10.1038/s41598-018-29117-y