Implementation of Deep Deterministic Policy Gradients for Controlling Dynamic Bipedal Walking.
biomimetics
Article
Implementation of Deep Deterministic Policy
Gradients for Controlling Dynamic Bipedal Walking
Chujun Liu 1 , Andrew G. Lonsberry 1 , Mark J. Nandor 1 , Musa L. Audu 2 ,
Alexander J. Lonsberry 1 and Roger D. Quinn 1, *
1
2
*
Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland,
OH 44106, USA; (C.L.); (A.G.L.); (M.J.N.);
(A.J.L.)
Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106, USA;
Correspondence:
Received: 12 November 2018; Accepted: 11 March 2019; Published: 22 March 2019
Abstract: A control system for bipedal walking in the sagittal plane was developed in simulation.
The biped model was built based on anthropometric data for a 1.8 m tall male of average build.
At the core of the controller is a deep deterministic policy gradient (DDPG) neural network that
was trained in GAZEBO, a physics simulator, to predict the ideal foot placement to maintain stable
walking despite external disturbances. The complexity of the DDPG network was decreased through
carefully selected state variables and a distributed control system. Additional controllers for the hip
joints during their stance phases and the ankle joint during toe-off phase help to stabilize the biped
during walking. The simulated biped can walk at a steady pace of approximately 1 m/s, and during
locomotion it can maintain stability with a 30 kg·m/s impulse applied forward on the torso or a
40 kg·m/s impulse applied rearward. It also maintains stable walking with a 10 kg backpack or a
25 kg front pack. The controller was trained on a 1.8 m tall model, but also stabilizes models 1.4–2.3 m
tall with no changes.
Keywords: biped; DDPG neural network; gait; stability
1. Introduction
Spinal cord injuries (SCI) can cause paralysis, resulting in minimal motor control and rendering
standing and walking impossible. Exoskeletons can help patients regain their ability to stand and walk
on their own. It has been established previously that combining functional neuromuscular stimulation
(FNS) with a powered, lower limb exoskeleton can restore locomotion to such individuals [1–3].
There remain many challenges in realizing such systems, given that each patient’s body is unique.
One of the primary problems needing more work is the generation of adaptive control systems for
stable walking and fall prevention. While much research has been invested in such control for legged
robots, there have been few applications of these methods to exoskeletons.
The design of algorithms for control of bipedal robot locomotion is a topic of intense research
interest [4–8] and several different methods have been developed. Some of these focus on the concept
of finding a zero-moment point (ZMP) about which to step. Kim and Oh [9] reported on a controller
using a ZMP-based technique with feedback from inertial sensor measurements. This controller has
three subcomponents: one that adjusts a pre-defined walking pattern, a second one for balance in
real-time using information from sensory feedback, and a third one for motion control based on
previous experience. The controller is successfully implemented on a robot that walks without the
need for support and without any extensive tuning of the controller parameters. A similar method [10],
Biomimetics 2019, 4, 28; doi:10.3390/biomimetics4010028
www.mdpi.com/journal/biomimetics
Biomimetics 2019, 4, 28
2 of 20
again featuring a controller comprised of three modules, demonstrates bipedal stability using a ZMP
method. The three modules perform body inclination control, ZMP control, and foot adjustment
control. The last component is primarily invoked when the system encounters uneven terrain. The ZMP
reported by Yokoi et al. [10] is computed using torque sensors at the ankles and controlled via an
adjustment of the orientation of the trunk. In implementation, the robot can walk stably with a step
length of 0.2 m/step and a step period of 0.8 s/step. A key difference between these two works is in
how the foot position is chosen. In Kim and Oh [9], foot placement is defined as part of the desired
motion pattern, while, in Yokoi et al. [10], it is computed by inverse kinematics from the desired joint
angle trajectories.
In this paper, based on the concept of the ZMP, a simplified, but robust, algorithm for biped
locomotion is presented as the basis for control of an exoskeleton used to stabilize individuals with SCI.
As has been established in robotic biped locomotion, foot placement is a critical component. Each step
must be carefully planned based on feedback from the current robotic state space [11]. Choosing the
next step carefully, the biped is shown to maintain its ZMP inside the support polygon as well as
ensuring that the center of mass (COM) does not diverge from the ZMP [12]. This strategy, and those
similar to it, depend on having a linear inverted pendulum model to find the desired ZMP and COM
trajectories. In ideal situations, the inverted pendulum model can describe the real biped system
well enough to predict the correct next step. However, our approach does not depend on having
a known, fixed dynamics model. Instead, the model is obtained through a learning process where
the data is used to train a neural network. The advantage of this approach is that it does not need
any prior knowledge about the system. Furthermore, the use of a neural network is superior to a
linearized dynamic model, as it can capture nonlinearities and make approximate or simplified models
unnecessary [13].
In future work, the methods presented here will be applied to control the user’s muscles and
powered lower limb exoskeleton based on an adaptable, reinforcement learning approach. To make
the system robust for any user, the control approach must be adaptable [14]. It should thus function
with limited a-priori information about the individual. To accomplish this, we employ an exploratory
reinforcement learning type approach based on deep Q-networks (DQN) [15]. Reinforcement learning
(RL) is a type of machine learning wherein a controller learns through trial and error. Over each trial and
error episode, the controller is graded by a reward function that indicates how well it is performing [16].
The goal is to maximize the total reward, and thereby produce a controller that accomplishes some
given task [17]. As control of a biped is defined over continuous state–action space, DQN is not
directly applicable as it is natively applicable to discrete action space problems. A variation of
DQN called deep deterministic policy gradient (DDPG) [18] is utilized here instead. Our system is
composed of three separate controllers designed to operate together to produce stable walking control.
One of the three controllers is a trained DDPG network and the other two consist of a conventional
proportional–integral–derivative (PID) feedback controller and an open-loop controller. The use of
three separa (...truncated)