Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem
PLOS ONE
RESEARCH ARTICLE
Dynamic sub-route-based self-adaptive beam
search Q-learning algorithm for traveling
salesman problem
Jin Zhang ID1,2*, Qing Liu1, XiaoHang Han1
1 School of Computer and Information Engineering, Henan University, Kaifeng, Henan, China, 2 Henan Key
Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, Henan, China
*
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Zhang J, Liu Q, Han X (2023) Dynamic
sub-route-based self-adaptive beam search Qlearning algorithm for traveling salesman problem.
PLoS ONE 18(3): e0283207. https://doi.org/
10.1371/journal.pone.0283207
Editor: Shih-Wei Lin, Chang Gung University,
TAIWAN
Received: June 5, 2022
Accepted: March 3, 2023
Published: March 21, 2023
Copyright: © 2023 Zhang et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All relevant data are
within the paper and instances are available in
http://www.math.uwaterloo.ca/tsp/concorde/index.
html.
Funding: The authors received no specific funding
for this work.
Competing interests: The authors have declared
that no competing interests exist.
Abstract
In this paper, a dynamic sub-route-based self-adaptive beam search Q-learning
(DSRABSQL) algorithm is proposed that provides a reinforcement learning (RL) framework
combined with local search to solve the traveling salesman problem (TSP). DSRABSQL
builds upon the Q-learning (QL) algorithm. Considering its problems of slow convergence
and low accuracy, four strategies within the QL framework are designed first: the weighting
function-based reward matrix, the power function-based initial Q-table, a self-adaptive εbeam search strategy, and a new Q-value update formula. Then, a self-adaptive beam
search Q-learning (ABSQL) algorithm is designed. To solve the problem that the sub-route
is not fully optimized in the ABSQL algorithm, a dynamic sub-route optimization strategy is
introduced outside the QL framework, and then the DSRABSQL algorithm is designed.
Experiments are conducted to compare QL, ABSQL, DSRABSQL, our previously proposed
variable neighborhood discrete whale optimization algorithm, and two advanced reinforcement learning algorithms. The experimental results show that DSRABSQL significantly outperforms the other algorithms. In addition, two groups of algorithms are designed based on
the QL and DSRABSQL algorithms to test the effectiveness of the five strategies. From the
experimental results, it can be found that the dynamic sub-route optimization strategy and
self-adaptive ε-beam search strategy contribute the most for small-, medium-, and largescale instances. At the same time, collaboration exists between the four strategies within
the QL framework, which increases with the expansion of the instance scale.
1. Introduction
For a set of given cities, the traveling salesman problem (TSP) is finding the shortest route
along which a salesman visits all of the cities exactly once before returning to the starting
point. The TSP is a well-known combinatorial optimization problem with applications in
many fields [1], such as transportation, circuit board design, production scheduling, and logistics distribution.
As a traditional NP-hard problem, numerous approaches have been proposed to solve the
TSP, most of which use exact and heuristic algorithms. Exact algorithms include branch-and-
PLOS ONE | https://doi.org/10.1371/journal.pone.0283207 March 21, 2023
1 / 31
PLOS ONE
DSRABSQL
bound (BnB), cut-plane, integer programming, and dynamic programming, all of which are
used to obtain the global optimal solution by continuous iteration. For example, Pekny et al.
(1990) [2] and Pesant et al. (1998) [3] proposed the BnB method and its variants for the TSP
problem, and the famous TSP solver Corconde (http://www.math.uwaterloo.ca/tsp/concorde/
index.html) is based on the BnB algorithm. Sanches et al. (2017) [4] proposed a partitioned
cross-improvement initial solution method to speed up Corconde. However, since the time
cost of the exact algorithm increases exponentially with the size of the instance, they are not
suitable for large-scale applications.
Heuristic algorithms are widely used because of their high computational efficiency, and
they can obtain a sub-optimal solution in a reasonable timeframe. Representative heuristic
algorithms include the Lin-Kernighan heuristic (LKH), ant colony optimization (ACO) algorithm, genetic algorithm (GA), particle swarm optimization (PSO) algorithm, whale optimization algorithm (WOA), and gray wolf optimization (GWO). Based on this, various improved
algorithms have been studied for solving the TSP problem. For example, by modifying the heuristic rules of LKH to improve its search strategy, Helsgaun (2000) [5] proposed an improved
LKH algorithm. Ebadinezhad et al. (2020) [6] proposed an ACO algorithm that included a
dynamic evaporation strategy (DEACO). Wang et al. (2022) [7] proposed a fine-grained fast
parallel GA algorithm based on a ternary optical computer, and Zheng et al. (2022) [8] proposed a transfer learning-based PSO algorithm. Zhang et al. (2020) [9] proposed a variable
neighborhood discrete WOA algorithm (VDWOA), while Panwar et al. (2021) [10] proposed
a novel discrete GWO algorithm. However, heuristic algorithms are mostly based on random
search, which lacks both an ability to learning and a theoretical foundation. They also usually
require unique heuristic rules for a given problem. For this reason, designing unified solutions
for combinatorial optimization problems such as TSP has become a popular research topic in
machine learning.
As one of the main machine-learning methods, reinforcement learning (RL) has strong
decision making and autonomous learning capabilities. RL is based on the Markov decision
process (MDP), which is a sequential decision mathematical model that has natural similarity
to the TSP. Therefore, many recent algorithms have adopted RL for TSP, and they can be
divided into three categories.
1. Deep learning algorithm combined with RL. The deep learning (DL) algorithm combined
with RL exploits the perception ability of DL and the decision-making ability of RL. Its fast
solution speed and strong generalization ability give this combination great potential in
finding approximate TSP solutions. For example, Vinyals et al. (2015) [11] used supervised
learning for training in a pointer network for TSP. Bello et al. (2016) [12] used an actor–
critic policy gradient method in a recurrent neural network. Dai et al. (2017) [13] proposed
an S2V-DQN algorithm using a graph embedding network trained by deep Q-learning.
Deudon et al. (2018) [14] trained a neural network for TSP by policy gradient using the
reinforcement learning rule with a critic. Ma et al. (2019) [15] used h (...truncated)