A snake-inspired path planning algorithm based on reinforcement learning and self-motion for hyper-redundant manipulators

Redundant manipulators are flexible enough to adapt to complex environments, but their controller is also required to be specific for their extra degrees of freedom. Inspired by the morphology of snakes, we propose a path planning algorithm named Swinging Search and Crawling Control, which allows the snake-like redundant manipulators to explore in complex pipeline environments without collision. The proposed algorithm consists of the Swinging Search and the Crawling Control. In Swinging Search, a collision-free manipulator configuration that of the end-effector in the target point is found by applying reinforcement learning to self-motion, instead of designing joint motion. The self-motion narrows the search space to the null space, and the reinforcement learning makes the algorithm use the information of the environment, instead of blindly searching. Then in Crawling Control, the manipulator is controlled to crawl to the target point like a snake along the collision-free configuration. It only needs to search for a collision-free configuration for the manipulator, instead of searching collision-free configurations throughout the process of path planning. Simulation experiments show that the algorithm can complete path planning tasks of hyper-redundant manipulators in complex environments. The 16 degrees of freedom and 24 degrees of freedom manipulators can achieve 83.3% and 96.7% success rates in the pipe, respectively. In the concentric pipe, the 24 degrees of freedom manipulator has a success rate of 96.1%.


Introduction
In recent decades, more and more researchers have been involved in the study of redundant manipulators to explore complex narrow environments, such as the natural gas pipes, sewer pipes, and tailpipes of airplanes. A manipulator is termed kinematically redundant when it possesses more degrees of freedom (DoFs) than it is needed to execute a given task. 1 The extra DoFs allow the manipulator to perform complex tasks, making path planning a challenge though.
To address the path planning of redundant manipulators, some early researchers proposed geometric methods.
Chirikjian and Burdick 2 proposed the backbone curve approach for hyper-redundant robot kinematics. Later, a shape control geometric analysis method 3 was proposed by Mochiyama to control the whole shape of a manipulator. But such algorithms are computationally intensive. To reduce computational complexity, Yahya et al. 4 proposed geometric constraints that limit the angles between the adjacent links to be equal. However, the flexibility of the manipulators is also limited due to the introduced constraints.
Other researchers proposed the method based on optimization theory, in which obstacle avoidance can be achieved by optimizing the objective distance function between the manipulators and the obstacles. Zucker et al. 5 proposed the covariant Hamiltonian optimization for notion planning method which can quickly converge with generating a smooth collision-free trajectory. Zhao et al. 6 applied the ant colony algorithm to plan the collision-free optimal path of the end-effector. Collins and Shen 7 proposed an optimization framework path planning with swarm optimization to efficiently calculate the approximate solution of the collision-free path planning. Artificial potential field method is a classic algorithm for path planning of mobile robots, which can also be applied to the planning of manipulators. 8,9 In this algorithm, the obstacle produces repulsive virtual force and the target point generates attractive virtual force to the point, the resultant force of them push this point to achieve a collision-free path toward the target. These methods require calculating the distance from the manipulator to the obstacles and therefore also require modeling the surface of the obstacles, which is sometimes difficult to complete.
In addition, the random sampling methods are efficient path planning algorithms for redundant manipulators. The probabilistic roadmaps (PRM) method 10 proposed by Kavraki et al. randomly samples in the configuration space to search for collision-free configurations. 11 The sampling results are stored in a PRM. For any given start and goal configurations of the manipulator, the motion planning can be transformed into the problem of connecting the two corresponding nodes in the roadmap. 12 Compared with the optimization theory methods, the random sampling methods do not require figuring out the objective distance function, which reduces the computational complexity and improves the algorithm applicability. However, as the DoFs increase, methods of this kind are also computationally complex.
Facing the unknown environment in application, Um and Ryu 13 proposed RRT-Cabomba algorithm based on the sampling method RRT 14 which controls the IPA sensitive skin equipped robot to perceive and plan alternately. Further, they also adopted the bug algorithm in RRT-Cabomba to implement local minima avoidance.
With the development of artificial intelligence, applying genetic algorithm in the obstacle avoidance of redundant manipulators 15,16 is another method for solving the optimization problems in planning, without requiring to calculate the distance from the manipulator to the obstacles. As a search process, the genetic algorithm copies and mutates the parameters, and then filters them to find the parameters that make the fitness function better, so as to control the manipulator to avoid obstacles. Reinforcement learning (RL) has also been applied to the path planning of redundant manipulators. Hua et al. 17 applied the RL method to train the agent to plan the end-effector motion and self-motion [18][19][20] separately. Combining the gradient projection method 21 decouples the motion of redundant manipulators into end-effector motion and self-motion. In this algorithm, the agent has to learn the entire planning process without making full use of the nature of selfmotion, which still causes great computational difficulties.
In nature, snakes crawl along the path of their heads during crawling. Their bodies will not collide with the environment, as long as the heads can pass smoothly. Inspired by this biological phenomenon, this article proposes a path planning algorithm named Swinging Search and Crawling Control (SSCC) for snake-like redundant manipulators. This algorithm consists of the Swinging Search and the Crawling Control, allowing the redundant manipulators with repetitive modules to explore the complex pipeline environments.
By self-motion, the Swinging Search algorithm generates collision-free configurations as the directive configurations, in which the end-effector is in the target point. And then the manipulator can crawl into the pipe along the directive configurations by the Crawling Control. During crawling, each modular link repeats the motion of the foremost module.
The major contributions of this article are as follows.
(i) Inspired by the biological characteristics of snakes, this article innovatively proposes the SSCC algorithm for snake-like redundant manipulators to be adapted to narrow environments such as pipes. (ii) The SSCC algorithm has low computational complexity, since only one directive configuration needs to be searched. (iii) The SSCC algorithm has good expansibility and adaptability for modular snake-like redundant manipulators with n universal joints.

Snake-like redundant manipulators
In this section, each subsection presents the mechanical structure description, the forward kinematics, and the self-motion of the snake-like redundant manipulators in turn. The formulation presented here is to elaborate on what the manipulators look like and how to control them.

Mechanical structure design
In this article, to be analogous to snakes in morphology, the modular redundant manipulators with n universal joints are designed. The manipulators are n Â U open chains. 22 Each module consists of a universal joint and a fixed-length rigid body link, as shown in Figure 1. Modularity allows the DoFs of the manipulators to be easily expanded by increasing the number of modules. The Denavit-Hartenberg (D-H) parameters of the snake-like redundant manipulators are shown in Table 1. By the D-H parameters, the homogeneous transformation matrices for kinematics are expressed as

Forward kinematics
In forward kinematics, the pose of the end-effector x ¼ x y z a b g ½ T , in which the first three and the last three variables represent the position and the orientation of the manipulator respectively. The pose x can be calculated as follows where f fkine is the function of the forward kinematics, q ¼ q 1 q 2 . . . q 2n ½ T is the joint motion vector of the manipulators in the configuration space.

Self-motion
The i-th column of the Jacobian matrix j i can be calculated by the vector product method as follows where z i is the i-th axis vector and i p 0 E is the vector from the end-effector to the i-th joint. Then the Jacobian matrix can be represented as The pseudo inverse of Jacobian matrix is defined as With the Jacobian matrix, the end-effector velocity vector _ x can be expressed as a function of the joint velocity vector _ q where v and w are the linear velocity vector and the angular velocity vector of the end-effector, respectively.  Joint n For control purposes, it is common to solve _ q in equation (6) by specifying the desired _ x. Since the manipulators are redundant, the dimension of _ q is higher than the dimension of _ x. In this way, equation (6) can be regarded as a set of underdetermined equations, and the solution of which can be formalized as follows where the term J y _ x is the minimum norm solution _ q S of equation (6). The term ðI À J y J ÞM represents the homo- All the homogeneous solutions form the null space of J . In the null space, redundant manipulators can change their configurations without changing the pose of the endeffector which is named the self-motion. The self-motion can be used in the gradient projection method 21 to realize the obstacle avoidance of redundant manipulators where k is a scale factor, H is the objective function which represents the distance between the manipulator and obstacles, and rH is the gradient of H. Then equation (7) can be derived as follows By specifying _ x and calculating rH, the control of redundant manipulators can be achieved by the following iterative equation where qðtÞ is the current joint positions and qðt þ DtÞ is the joint positions after an iteration in the next moment.

Framework of SSCC
In this article, a path planning algorithm named SSCC is proposed. Through this algorithm, the modular redundant manipulators can reach into the complex pipe environment without collision, allowing the end-effector to achieve the goal pose. SSCC algorithm consists of two units, the Swinging Search and the Crawling Control. The Swinging Search generates collision-free configurations by applying RL to self-motion. The generated configurations are named directive configurations, in which the end-effector has reached the goal pose. In the Crawling Control, the controlled redundant manipulator crawls into the pipe along the directive configuration like a snake. The flowchart of SSCC is shown in Figure 2.
In Figure 2, s is the state of the environment fed back to the agent; rH is the action the agent decides to perform, which is the gradient in equation (9) of the objective function that represents the distance between the obstacle and the manipulator; r is the reward obtained by the agent. The above parameters will be described in detail in the "Stateaction design" section.
Since the length of the rigid link in the module cannot be ignored, the manipulator will inevitably deviate from the directive configuration during the Crawling Process, which may also cause the manipulator to collide. The Crawling Process is deterministic, which means that the degree of deviation can be calculated based on the directive configuration. Therefore, we propose the maximum deviation distance to evaluate the directive configuration to train the agent in RL, so that the Swinging Search can generate the directive configuration with a smaller maximum deviation distance to reduce the possibility of collision during crawling.
Both the Swinging Search and the Crawling Control include collision detection. In the Swinging Search, collision detection is used to determine whether to terminate the search. In the Crawling Control, if a collision is detected, then the directive configuration is invalid, and the Swinging Search should regenerate a new directive configuration.

Crawling Control
The Crawling Process refers to the process in which the manipulator M is controlled to move forward along the directive configuration Mðq d Þ. Through this control, the end-effector of the manipulator can reach the specified pose.
The flow of the Crawling Process is shown in the subsection "Crawling Process," which specifies how the manipulator is controlled given a certain directive configuration.
Then in the subsection "Evaluation of directive configurations," the evaluation method is proposed, in which the maximum deviation distance is designed to measure the deviation of the controlled manipulator from the directive configuration. For a specific configuration, the Crawling Process produces a unique control process. So the maximum deviation distance is the evaluation indicator of directive configurations. The smaller the maximum deviation distance, the better the directive configuration. This property can be used in the reward function of RL in Swinging Search, to help generate better directive configurations, so as to reduce the possibility of collisions in the Crawling Process.

Crawling Process
To illustrate the Crawling Process, several definitions are made as follows, and structures are shown in Figure 3.
i. The modules of Mðq d Þ that are completely or partially inside the pipe are called the inner directive modules. ii. The inner directive module at the entrance of the pipe is called the critical module. The two ends of the critical module are denoted as p 1 and p 2 .
iii. Similarly, the modules of the crawling manipulator M that are completely or partially inside the pipe are called the inner modules. iv. The remaining modules of M are called the base modules. All the base modules form the base manipulator M base .
In the Crawling Process, the base modules and the inner modules of M are controlled separately, as shown in Figure 4. By solving the inverse kinematics, the end of M base is controlled to move from p 1 to p 2 within time T, while keeping the orientation of the foremost module the same as that of the critical module. At the same time, the universal joint angles of the inner modules are concurrently changed to the corresponding joint angles of their next inner directive module. In this way, each inner module will repeat the motion of the foremost inner module, like the movement of a snake. In detail, suppose the number of modules to be inserted is N and the number of modules inserted is i, then the angles of the j-th inner module should be adjusted to the angles of the (N -i þ j -1)-th inner directive module.
As the modules move forward, the base modules in the front will become the inner modules, and the control over them will be changed, too.
For the next round of insertion, the new M base should adjust the orientation of its end, and at the same time, the newly inserted module should keep overlapping the critical module. This can be achieved through pose interpolation of M base , while planning the first inner module to move to the last inner directive module. The Crawling Process algorithm is summarized in Algorithm 1.

Evaluation of directive configurations
The crawling manipulator M will inevitably deviate from the directive configuration in the Crawling Process. Therefore, we introduce the maximum deviation distance D to evaluate the deviation.   For a specific directive configuration, the Crawling Process will correspondingly generate a planned motion. D is defined as the maximum distance between the inner modules and their corresponding inner directive modules at any moment in the planned motion.
Since each inner module will repeat the motion of the frontmost inner module, it is only necessary to analyze the motion of the foremost inner module to calculate the maximum deviation distance.
Sampling e points uniformly on the first inner module and calculating the distance from each point to the (N -i-1)th and the (N-i)-th inner directive module, as shown in Figure 5. N is the number of inner directive modules, and i is the module inserted. The distance between the foremost inner module and its corresponding module is defined as the maximum of these distances dðt; i; jÞ ¼ dðs j ðqðtÞÞ; M N Ài ðq d ÞÞ where the s j ðqðtÞÞ indicates the j-th sampling point in the configuration of the M at time t, the M N Ài ðq d Þ indicates the (N À i)-th inner directive module, and the d indicates the distance function which calculates the distance between the sampling points and the link of the inner directive module. Then the maximum deviation distance can be calculated as follows The algorithm of the entire Crawling Control is summarized in Algorithm 2.

Swinging Search
In this section, details of the Swinging Search are elaborated. By applying RL to self-motion, the Swinging Search generates collision-free directive configurations, in which the end-effector has reached the goal pose. These generated directive configurations are used as the input of the Crawling Control which is discussed above. In this way, the search can make use of the environment information, so as not to explore blindly. Applying the self-motion also reduces the range of configurations that need to be searched.
The first subsection "State-action design" shows how the parameters are set to allow RL to help search, and the second subsection "Learning by DDPG" shows the specific RL algorithm used here and the full flow of Swinging Search.

State-action design
The design of state and action has a great influence on the convergence of RL, so the designed state and action should reflect the relationship between the manipulator and the environment.
To achieve self-motion to search for directive configurations, _ q derived from equation (10) is used to control the manipulator Intuitively, we design M as the action a t And the state s t is designed as where q i is the joint angle of the manipulator and x y z ½ is the coordinate vector of the target position that the endeffector needs to reach.
A well-designed reward function can make the RL algorithm converge quickly. Therefore, a reward function consisting of penalties r 0 t and r 0 0 t is proposed. We design the number of modules that collided as a penalty r 0 t so that the agent can obtain information about collisions. In this way, the agent tends to make the manipulator avoid obstacles when optimizing, so that it can find directive configurations more quickly where reward function r 0 t represents the number of modules that collided at time t and C i is the collision situation of  the i-th module. If the i-th module collided with obstacles, the score of the i-th joint is À1, otherwise 0.
In addition, if a directive configuration can be found, the Swinging Search will also receive feedback from the Crawling Control, and the maximum deviation distance will also be used as a penalty r 0 0 t , as follows where D is the maximum deviation distance from the Crawling Control and l is the length of the manipulator link. Then the r t is as follows where the scalar w is a scalar hyperparameter that is set according to the number of steps in an episode.

Learning by DDPG
The snake-like redundant manipulators used in this article have the characteristics of multiple DoFs and spatial continuity of action, so it is necessary to choose an algorithm that can solve the spatial problem of continuous action. Lillicrap et al. presented a deep deterministic policy gradient (DDPG) 23 algorithm which can operate over continuous action spaces, which is a good choice for this task. The DDPG algorithm uses the Actor-Critic framework, including the Actor network for selecting actions and the Critic network for evaluating the policy. The update of Actor network adopts the strategy gradient descent method, which is specifically expressed as where q is the Actor network weight parameter, s is the state, a is the action, Q is the evaluation value, and m 0 is the number of samples of empirical data. The Critic network uses the mean square error loss function to update the parameters where o is the Critic network weight parameter, g is the reward discount factor. The DDPG algorithm replicates the Actor network and the Critic network as the target network so that the agent can learn the task strategy stably, and its network weight parameters are expressed as q 0 and ! 0 , respectively. The target network greatly enhances the stability of the learning process when the agent is training. The specific update method of the Actor target network is where t is used to control the update speed of the Actor target network weight q 0 . Use the same method to update the Critic target network parameter ! 0 In addition, the DDPG algorithm uses random noise to increase the exploration ability of the Actor network in the continuous action space to form a policy-map m 0 where N 0 is the random process of the noise. Combined with the crawling algorithm, SSCC uses the feasible solution obtained by the RL algorithm to complete the exploring task. And each task can be used as empirical data to train the model. The RL model will be more convergent and the solution speed will be faster after many tasks.
The Swinging Search algorithm is summarized in Algorithm 3.

Experiments and analysis
The performance of the SSCC is studied through some simulation experiments. This section presents the settings and the results of the experiments. The simulation is verified in two environments (the hollow pipe and the concentric pipe), and two manipulators are required to reach the target points which are in the pipe in each environment.

System description
The hardware equipment used in all experiments in this article is a MacBook Pro laptop (CPU M1, RAM 16 GB). All experiments are implemented in the software PyCharm CE, and the environment configuration was Python 3.9.7 and Pytorch 1.8.0. We use a robotics toolbox in python called robots-toolbox-python 24 to model and solve the kinematics of a hyper-redundant manipulator.

Testing case in the hollow pipe
In the first experiment, the hollow pipe is set as the obstacle. Parameters of the hollow pipe environment are shown in Table 2. And two snake-like redundant manipulators with different DoFs and equal total lengths are designed to verify the SSCC method. Parameters of manipulators are shown in Table 3.
We generated 30 target positions randomly in this experiment. And the efficiency of the SSCC is analyzed using directive configurations with different DoF manipulators, as shown in Table 4. The criterion for the success of the SSCC algorithm is defined as successfully searching for a set of collision-free configurations within 30s and no collision occurs during the crawling process for at least one directive configuration. Experiments show that the success rate of a 24-DoF manipulator is higher than that of a 16-DoF manipulator. By thinking about it to the limit, this result can be explained intuitively. Assuming that the total length of the manipulator is constant, the more joints, the shorter the average length of each link. If there are infinite DoFs on the manipulator, then it can be seen as a soft rope. Increasing the number of DoFs over a fixed length makes the manipulator more flexible, allowing it to steer better to avoid obstacles and improve the success rate of the algorithm.
Due to the relatively single environment, the manipulator is less likely to collide during the Crawling Process. Most of the failed samples are caused by the search time exceeding the limit. In the early stage of training, the RL algorithm has not converged and may not be able to finish searching within a certain time.
For a more detailed elaboration, we randomly selected two samples from the two groups of experiments respectively as the explanation examples.
The target position is at 1:65 À0:11 0:04 ½ . The two directive configurations of the two manipulators are searched by the Swinging Search algorithm, where the inner directive configuration consists of three modules and five modules, respectively. These configurations are shown in Figure 6, and the angles are not listed in detail due to the high redundancy of manipulators.
Then, the Crawling Control algorithm is verified in the simulation environment. The manipulators do not collide with the hollow pipe when manipulators enter the hollow pipe.
The joint angle curves of inner modules are shown in Figure 7, and the continuous curves without abrupt change indicate the stability of manipulator motion and verify the  effectiveness. The deviation distances are shown in Figure 8. The deviation distances will be close to 0 when the inner manipulators overlap the corresponding inner directive modules. The maximum deviation distance then is calculated as the feedback to the DDPG model. The movement processes of the snake-like redundant manipulators are shown in Figure 9.

Testing case in the concentric pipe
In the second experiment, the path planning using SSCC in the concentric pipe is studied. The parameters of the concentric pipe are shown in Table 5. The controlled 24 DoFs manipulator in this experiment is the same as the second manipulator used in the previous experiment.
As the environment becomes more complex, the time limit of the algorithm in this experiment has been extended to 60s. The main reason for failure is the same as in the previous experiment. Most of the failed samples are caused by the search time exceeding the limit.
We generated 52 target positions randomly in the concentric pipe. The results are presented as shown in Table 6. The manipulator is requested to reach the target position between the two pipes without collision. The success rate is lower than that of the previous experiment in the pipe, but the SSCC can still complete most of the path planning tasks.
We randomly select a target point 1:53 0:61 À0:11 ½ to elaborate the SSCC method in this complex environment. The directive configuration is searched by the Swinging Search, and the inner directive configuration consists of four modules, as shown in Figure 10. The Crawling Process of the directive configuration is shown in Figure 11 in the complex environment. Similarly, the continuity of the crawling is verified by the joint angles curve shown in Figure 12. The deviation distance is shown in Figure 13.
In summary, the high efficiency of the SSCC method is proved. In the pipe environment, changing the 16 DoFs manipulator to a 24-DoF manipulator with a shorter link length can improve search efficiency. In the concentric pipe environment, the hyper-redundant manipulator still has a high success rate for path planning.

Discussion
As the main advantage of the algorithm, we designed the Crawling Control inspired by snake crawling. It makes the Swinging Search only need planning the path of the endeffector instead of planning the body at the same time. What's more, the purpose of the swinging searching algorithm using RL is not to plan the optimal solution but to speed up the searching of the collision-free configurations as the input of the Crawling Control.
The SSCC still has some limitations that we need to discuss in detail. In the Crawling Process, the inner modules will deviate from the directive configuration. And the  deviation distance is cumulative. This means that the deviation distance may be greater if the more modules extend into the pipe. But this problem can be solved by using the maximum deviation distance as part of the reward function in the RL model. In addition, when facing a specific application in practice, it is enough for Swinging Search to plan a few feasible solutions to solve the problem before the RL model is fully trained. So that the planner can find solutions faster in future searches.
Another limitation of the SSCC algorithm is that it cannot plan the movement of the manipulator from one point in the pipe to another efficiently, since the algorithm just specifies the movement to get the manipulator into the pipe.   For this former task, one possible solution is to plan the manipulator to move out of the pipe, by specifying the initial configuration as the directive configuration and then using the reverse Crawling Control. The reverse Crawling Control is the process that reverses the specified joint angles of the Crawling Control process in time series. The task now is transferred into one that the SSCC algorithm can solve. However, there is no guarantee that the manipulator will not collide when going out the pipe, because the directive configuration is now specified as the initial configuration of the movement, which cannot be re-specified if a collision occurs in the Crawling Control.

Conclusion
In this article, inspired by the morphology of snakes, we propose the SSCC algorithm for path planning. The snakelike redundant manipulator can be controlled by using SSCC for exploring complex environments. The snakelike redundant manipulator of the n Â U open chains is designed to enhance adaptability. The snake's crawling properties are imitated by the manipulator, which makes the manipulator move along the collision-free configuration in a special way. Furthermore, we utilize the maximum deviation distance to optimize the RL model, improving the success rate of planning collision-free configurations. Finally, a series of adequate experiments validate that the manipulators can explore complex environments with high success rates by using SSCC.

Acknowledgment
The authors thank Qi Liu for his help with this article.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.