Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0283207&type=printable

Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem

PLOS ONE RESEARCH ARTICLE Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem Jin Zhang ID1,2*, Qing Liu1, XiaoHang Han1 1 School of Computer and Information Engineering, Henan University, Kaifeng, Henan, China, 2 Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, Henan, China * a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Zhang J, Liu Q, Han X (2023) Dynamic sub-route-based self-adaptive beam search Qlearning algorithm for traveling salesman problem. PLoS ONE 18(3): e0283207. https://doi.org/ 10.1371/journal.pone.0283207 Editor: Shih-Wei Lin, Chang Gung University, TAIWAN Received: June 5, 2022 Accepted: March 3, 2023 Published: March 21, 2023 Copyright: © 2023 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and instances are available in http://www.math.uwaterloo.ca/tsp/concorde/index. html. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist. Abstract In this paper, a dynamic sub-route-based self-adaptive beam search Q-learning (DSRABSQL) algorithm is proposed that provides a reinforcement learning (RL) framework combined with local search to solve the traveling salesman problem (TSP). DSRABSQL builds upon the Q-learning (QL) algorithm. Considering its problems of slow convergence and low accuracy, four strategies within the QL framework are designed first: the weighting function-based reward matrix, the power function-based initial Q-table, a self-adaptive εbeam search strategy, and a new Q-value update formula. Then, a self-adaptive beam search Q-learning (ABSQL) algorithm is designed. To solve the problem that the sub-route is not fully optimized in the ABSQL algorithm, a dynamic sub-route optimization strategy is introduced outside the QL framework, and then the DSRABSQL algorithm is designed. Experiments are conducted to compare QL, ABSQL, DSRABSQL, our previously proposed variable neighborhood discrete whale optimization algorithm, and two advanced reinforcement learning algorithms. The experimental results show that DSRABSQL significantly outperforms the other algorithms. In addition, two groups of algorithms are designed based on the QL and DSRABSQL algorithms to test the effectiveness of the five strategies. From the experimental results, it can be found that the dynamic sub-route optimization strategy and self-adaptive ε-beam search strategy contribute the most for small-, medium-, and largescale instances. At the same time, collaboration exists between the four strategies within the QL framework, which increases with the expansion of the instance scale. 1. Introduction For a set of given cities, the traveling salesman problem (TSP) is finding the shortest route along which a salesman visits all of the cities exactly once before returning to the starting point. The TSP is a well-known combinatorial optimization problem with applications in many fields [1], such as transportation, circuit board design, production scheduling, and logistics distribution. As a traditional NP-hard problem, numerous approaches have been proposed to solve the TSP, most of which use exact and heuristic algorithms. Exact algorithms include branch-and- PLOS ONE | https://doi.org/10.1371/journal.pone.0283207 March 21, 2023 1 / 31 PLOS ONE DSRABSQL bound (BnB), cut-plane, integer programming, and dynamic programming, all of which are used to obtain the global optimal solution by continuous iteration. For example, Pekny et al. (1990) [2] and Pesant et al. (1998) [3] proposed the BnB method and its variants for the TSP problem, and the famous TSP solver Corconde (http://www.math.uwaterloo.ca/tsp/concorde/ index.html) is based on the BnB algorithm. Sanches et al. (2017) [4] proposed a partitioned cross-improvement initial solution method to speed up Corconde. However, since the time cost of the exact algorithm increases exponentially with the size of the instance, they are not suitable for large-scale applications. Heuristic algorithms are widely used because of their high computational efficiency, and they can obtain a sub-optimal solution in a reasonable timeframe. Representative heuristic algorithms include the Lin-Kernighan heuristic (LKH), ant colony optimization (ACO) algorithm, genetic algorithm (GA), particle swarm optimization (PSO) algorithm, whale optimization algorithm (WOA), and gray wolf optimization (GWO). Based on this, various improved algorithms have been studied for solving the TSP problem. For example, by modifying the heuristic rules of LKH to improve its search strategy, Helsgaun (2000) [5] proposed an improved LKH algorithm. Ebadinezhad et al. (2020) [6] proposed an ACO algorithm that included a dynamic evaporation strategy (DEACO). Wang et al. (2022) [7] proposed a fine-grained fast parallel GA algorithm based on a ternary optical computer, and Zheng et al. (2022) [8] proposed a transfer learning-based PSO algorithm. Zhang et al. (2020) [9] proposed a variable neighborhood discrete WOA algorithm (VDWOA), while Panwar et al. (2021) [10] proposed a novel discrete GWO algorithm. However, heuristic algorithms are mostly based on random search, which lacks both an ability to learning and a theoretical foundation. They also usually require unique heuristic rules for a given problem. For this reason, designing unified solutions for combinatorial optimization problems such as TSP has become a popular research topic in machine learning. As one of the main machine-learning methods, reinforcement learning (RL) has strong decision making and autonomous learning capabilities. RL is based on the Markov decision process (MDP), which is a sequential decision mathematical model that has natural similarity to the TSP. Therefore, many recent algorithms have adopted RL for TSP, and they can be divided into three categories. 1. Deep learning algorithm combined with RL. The deep learning (DL) algorithm combined with RL exploits the perception ability of DL and the decision-making ability of RL. Its fast solution speed and strong generalization ability give this combination great potential in finding approximate TSP solutions. For example, Vinyals et al. (2015) [11] used supervised learning for training in a pointer network for TSP. Bello et al. (2016) [12] used an actor– critic policy gradient method in a recurrent neural network. Dai et al. (2017) [13] proposed an S2V-DQN algorithm using a graph embedding network trained by deep Q-learning. Deudon et al. (2018) [14] trained a neural network for TSP by policy gradient using the reinforcement learning rule with a critic. Ma et al. (2019) [15] used h (...truncated)