Research on autonomous underwater vehicle wall following based on reinforcement learning and multi-sonar weighted round robin mode

When autonomous underwater vehicle following the wall, a common problem is interference between sonars equipped in the autonomous underwater vehicle. A novel work mode with weighted polling (which can be also called “weighted round robin mode”) which can independently identify the environment, dynamically establish the environmental model, and switch the operating frequency of the sonar is proposed in this article. The dynamic weighted polling mode solves the problem of sonar interference. By dynamically switching the operating frequency of the sonar, the efficiency of following the wall is improved. Through the interpolation algorithm based on velocity interpolation, the data of different frequency ranging sonar are time registered to solve the asynchronous problem of multi-sonar and the system outputs according to the frequency of high-frequency sonar. With the reinforcement learning algorithm, autonomous underwater vehicle can follow the wall at a certain distance according to the distance obtained from the polling mode. At last, the tank test verified the effectiveness of the algorithm.


Introduction
Nowadays, water conveyance tunnels have been an important transportation route in hydraulic engineering. 1 When detecting the wall crack in the water conveyance tunnels, the ability of following the wall at a certain distance is the basic premise of autonomous underwater vehicle's (AUV's) safe operation. 2 Ranging sonar is widely used in obtaining the distance of obstacles, its principle is the transducer actively emits acoustic waves and obtains the distance information of obstacle by receiving the echo reflected by the obstacle. 3 Efficient access to stable and accurate distance data is an important performance index of the ranging sonar.
The water conveyance tunnel inspection AUV (AUV-T; Figure 1) in this study is equipped with eight ranging sonars. The ranging sonars are mounted on the top, bottom, left, and right positions of the bow and stern to measure the distance of the AUV from the surrounding walls. The model of the sonars is DYW-50/200-NB with a range of 0.6-120 m at 200 KHz. Their half power angle is 7.5 . If the channel width is less than 0.6 m, the stability and accuracy of the sonars will decrease rapidly due to the interference caused by the echo.
Usually, the working mode of the ranging sonar is that a plurality of sonars emit acoustic waves simultaneously at a fixed frequency and acquire obstacle information, 4 but there will be interference in the sonar information in different directions. Experiments show that the stability and accuracy of the ranging sonar data will decrease rapidly with the decrease of the channel width after the channel width is lower than a certain threshold, and even the data become erroneous. Although the use of polling mode of the ranging sonar can avoid the interference between the sonar, but the frequency of access to data will be greatly reduced, thus reducing the AUV obstacle avoidance efficiency. By reducing the polling interval, the efficiency can be improved, but when the channel is narrow, shortening the polling interval can cause mutual interference between the sonar and cannot obtain accurate data.
On the other hand, the different sampling frequencies of the different sensors output by the dynamic weighted polling mode (DWPM) will cause asynchronous problems. From the engineering point of view, the existing time registration methods, such as least squares, extrapolation, maximum entropy, and so on, have some limitations and one-sidedness. These methods use the sampling frequency of the low-frequency sensor as the standard, reduce the utilization rate of the measured data, and reduce the accuracy of the system, so that the weighted polling mode proposed in this article has lost its meaning.
Recently, there have been many researches on the application of sonar interference and reinforcement learning in AUV. Ohya et al. constructed two ultrasonic ranging systems to investigate the influence of characteristics of the sensing system. 5 The characteristics of the system differ from each other. Li et al. proposed the ultrasonic ranging method using discrete chaotic phase modulated signal. 6 The chaotic phase modulated signal showed the property of sharp autocorrelation and flat cross-correlation. Kleeman presented the sonar system which can produce accurate measurement and on-the-fly single cycle classification of planes, corners, and edges. 3 This article showed how to use double pulse coding of transmitted pulses to simultaneously suppress interference and classify. Kleeman proposed the approach to rejecting interference between sonar systems which was based on identifying a transmitter by sending a double pulse with known separation. 7 The ability was demonstrated by the experiment. Browne and Kleeman proposed the sonar ring refreshing at 60 Hz for 5.7-m range. 4 It can result in lower latency and denser measurements. Huang et al. designed the environment states and obstacles avoidance behaviors, 8 then the reinforcement learning was used to select the state-action combinations. Simulation results showed that AUV can meet the requirements of safe navigation. Liu et al. adopted the reinforcement learning to control AUV, 9 Q-learning, back-propagation neural net, and artificial potential were integrated to implement avoidance planning for AUV. The simulation test verified the validity and feasibility of the motion planning.
It can be seen that the current research on sonar is mostly limited to the improvement of sonar performance, and it is difficult to solve the problem of mutual interference between sonar due to narrow channel, while the application of sonar information in AUV is limited to obstacle avoidance research, and there is basically no relevant research on wall following. Combining with the current sonar working mode and the shortcomings of the above existing situation, a DWPM for multi-sonar AUV is proposed in this article. In this mode, the ranging sonar works in a DWPM, the system can independently identify the environment complex situation, dynamically establish the environment model and switch the working frequency of the sonar. Through the polling mode, the interference between the sonar can be avoided, the data accuracy of the loudness can be improved. The system establishes the corresponding dynamic weighted frequency equation according to the velocity component and the obstacle distance in each sonar direction, and dynamically adjusts the working frequency of the sonar to ensure that the sonar adaptively improves or reduces the working frequency according to the environment variable in the current direction. When the speed is faster or the obstacle is closer, the obstacle distance information in the current direction can be quickly acquired. Through the interpolation algorithm based on velocity interpolation, the data of different frequency ranging sonar are time registered to solve the asynchronous problem of multi-ranging sonar, and the system is based on the frequency of the high-frequency sonar to output the obstacle distance. This approach avoids the asynchronous problem due to the noncoincidence of the multi-sonar sampling frequency and can perform efficient data output. When following the wall, the distance between AUV and wall obtained according to polling mode is taken as the input of reinforcement learning algorithm. According to the output action command, AUV performs the corresponding yaw movement, so as to realize the wall following.
The rest of the article is organized as follows. The multisonar DWPM steps are presented in the second section. In the third section, the data fusion and time registration algorithm are proposed. The application of reinforcement learning in AUV wall following is shown in the fourth section. Then the results of the experiments are given in the fifth section, and quantitative analysis is performed later. Finally, the sixth section concludes and summarizes the article.

Dynamic weighted polling mode
The multi-sonar weighted polling mode steps presented in this article are shown in Figure 2.

Static safe distance
The static safety distance means the positioning error area in the sonar mounting direction. 10 Usually, the positioning error area of AUV is an ellipse. As shown in Figure 3, the parameters of ellipse include the standard deviation, variance, spreading factor, and AUV heading angle of the ranging sonar. Here is the equation In the equation, s e stands for the standard deviation of the ranging sonar mounted on the left (or the right) side, s n stands for the standard deviation of the ranging distance mounted on the bow (or the stern); s 2 e and s 2 n stand for the ranging variance of the ranging sonar, s en and s ne stand for the covariance of the ranging sonar; a stands for the semicircular axis of the ellipsoid of the positioning error ellipse, b stands for the semi short axis of the positioning error ellipse; stands for the bow angle, which can be obtained in real time by the attitude sensor; s 0 stands for the spreading factor, which can be used to expand the error area and it is empirically obtained. In the case of a two-dimensional plane, when s 0 ¼ 2:15 we think that the credibility is 95%, when s 0 ¼ 3:03 we believe that the credibility is 99%. The ellipse center is the current AUV positioning position.

Safety alert distance
The alert distance threshold equation consists of the static part and the velocity part. Here is it In the equation, h i stands for the security alert distance threshold of AUV in the direction of i (including the bow, stern, port, starboard, same as below), d stands for the static safe distance, a i stands for the static distance correction factor, v i stands for the velocity, and b i stands for the speed-distance correction factor.

Dynamic sonar sampling frequency
The safety distance triggering factor in the dynamic sonar sampling frequency equation is determined by the relationship between the obstacle distance detected by the sonar and the safety alert distance. Here is the equation In the equation, f i stands for the sonar sampling frequency in the direction of i, f 0 stands for the basic sampling frequency of the sonar, h stands for the safe distance trigger factor, v i stands for the velocity, m i stands for the speedfrequency correction factor, s i stands for the obstacle distance detected by the sonar, n i stands for the distance-frequency correction factor, and g(x) is the external interface function.
When the safety alert distance threshold is not reached, the AUV will poll at the basic sampling frequency. At this time, the obstacle information will not trigger the local obstacle avoidance plan. When the safety alert distance threshold is reached, the mode of sampling frequency will be triggered. The system will take different sampling frequencies for the sonar in different directions depending on the speed in different directions and the distance of the obstacle. In addition, the equation sets the unified external interface function g(x) for the sampling frequency adjustment under complex conditions. If the voltage is low, the sampling frequency of all the sonar can be reduced by the external interface function to reduce the energy consumption. g(x) is set to 1 when no external interface function is required. Through the interface function, the system can refer to the AUV state and the sea condition information to realize the quadratic precision adjustment of the sampling frequency of the sonar.

Data fusion and time registration
There are three cases of weighted polling mode as shown in Figure 4.

Mode 1
When the distance in all directions is greater than the safety alert distance, the sonar will be polled according to the basic sampling frequency f a . At this time the poll mode taken is the bow and the port sonars launch sound waves firstly, after the interval time t a , the stern and the starboard sonars launch sound waves.

Mode 2
When the sonar in some or all directions takes the dynamic sampling frequency so that the sampling frequency of the sonar in each direction is the same (the sampling frequency is f b ), the sonar will work in the polling mode. At this time the poll mode taken is the bow and the port sonars to launch sound waves firstly, after the interval time t b (t b < t a ), the stern and the starboard sonars launch sound waves.

Mode 3
When the sonars in some or all directions take the dynamic sampling frequency and the sonar sampling frequencies in the respective directions are not exactly the same, the sonars will sample at the respective frequencies. In order to avoid the interference of the sonar in the relative direction, the following algorithm is used.
Assuming that the two sonars in the relative direction are 1 and 2, respectively, the real-time sampling frequency is f 1 ; f 2 ð f 1 f 2 Þ, and the real-time sampling interval is as follows The smaller real-time sampling interval is Dt min ¼ min Dt 1 ; Dt 2 ð Þ ; the upper limit of the sampling frequency of the ranging sonar is f max ; the lower limit is f min (the basic sampling frequency); the lower limit of the sampling interval is t min ; the upper limit is t max ; the total working time of the two sonars are respectively t 1 and t 2 ; and t is the total working time of the system. In the embedded operating system "VxWorks," setting the obstacle avoidance system operating frequency to t 0 , t 0 is the least common multiple of all sonar sampling intervals, that is, Dt 1 ¼ p Á t 0 p ¼ 1; 2; 3 ::: ð Þ and Dt 2 ¼ q Á t 0 p ¼ ð 1; 2; 3 :::Þ. Through the watchdog timer, recursive call achieves the delay of t 0 cycle. When t 1 þ Dt 1 ¼ t 2 þ Dt 2 (i.e. at next time the two sonars will sample at the same time), let Dt min ¼ 1:5Dt min , so as to avoid the relative direction of the mutual interference between the sonars.
In the above three cases, the asynchronous problem is generated due to the sampling frequency of the sonar, and the time registration is performed by the fusion algorithm based on velocity interpolation. Normally, the AUV is slow and the speed will not change suddenly. Therefore, the distance information in the low-frequency signal can be interpolated according to the corresponding speed component and time interval in the direction, so that the system can output the frequency according to the frequency of the high-frequency signal.
Assuming that the frequency of the high-frequency signal is f h , the total time of operation is t h , the frequency of  the low-frequency signal is f l , the total time of operation is t l , the component of velocity in its direction is v l , then the output distance based on velocity interpolation is here In the equation, s l stands for the obstacle distance obtained by interpolation of low-frequency sensors and s 0 l stands for the obstacle distance obtained by low-frequency sensor on the last moment. When the high-frequency sensor sampling, the data of low-frequency sensor is also exported to the system. Through the obstacle distance obtained by speed interpolation, the system can output the obstacle distance in all direction according to the high-frequency signal rate.
Application of reinforcement learning in AUV wall following AUV wall following is achieved by adjusting the heading of the AUV when detecting the wall crack in the water conveyance tunnels. AUV sails in tunnels with unknown environmental information, therefore, the desired heading cannot be set in advance. AUV obtains desired heading in real time through reinforcement learning algorithms. The input of the reinforcement learning algorithm is the distance of the AUV from the wall. AUV obtains the accurate distance from the wall according to the DWPM, and output the appropriate desired heading. Combined with reinforcement learning 11 and artificial potential field, 12 reinforcement learning algorithm is used to achieve the optimal control of wall following task. In this article, BP neural network and Q-learning algorithm are combined. [13][14][15] The output of each network corresponds to the Q value of an action, that is, Q x; a ð Þ. Q function is defined as Only on the premise of getting the optimal strategy can the above formula be established. In the learning phase, the error signal is where Q x tþ1 ; a t ð Þis the Q value corresponding to the next state, the error is minimized by adjusting the weight of the network. When Q learning is realized by BP neural network, the weight is adjusted to The choice of action is reflected by the value of strengthening function, and the external strengthening value is determined by the potential field method. Firstly, the resultant force of AUV at time t is calculated as follows As shown in Figure 5, d t is the distance from AUV to the wall calculated by DWPM at time t, and d 0 is the desired distance, that is, the following distance from AUV to the wall. Then the resultant force of AUV at time t À 1 can be expressed as The evaluation function of following distance between AUV and wall is defined as When DF t ð Þ < 0, it indicates that the distance between AUV and wall is close to the desired distance and should be rewarded. When DF t ð Þ > 0, it indicates that the distance between AUV and wall is far from the desired distance and should be punished. Therefore, the definition of enhancement signal r t ð Þ can be obtained The following behavior of AUV to the wall is divided into nine actions, including turning left with the maximum output, turning left 30 , turning left 20 , turning left 10 , direct flight, turning left 10 , turning left 20 , turning left 30 , and turning right with the maximum output. In the process of wall following, AUV obtains the accurate distance from the wall according to the DWPM and selects the appropriate actions using reinforcement learning algorithm. Through certain control strategies, [16][17][18] such as fuzzy dynamic surface control 19 and backstepping sliding mode control, [20][21][22] AUV can achieve precise wall following in accordance with control instructions.

Experiment
As shown in Figure 6, in order to verify the effectiveness of the DWPM of sonars proposed, experiments are carried out in the pool. The sonar is fixed on a special customized support along the four directions of up, down, left, and right, and then the bracket is fixed on the x-y carriage. The aerial vehicle can realize the precise motion control in the two-dimensional plane. Therefore, the motion of AUV can be simulated by the x-y carriage.
As shown in Figure 7, the DWPM of sonar is realized by the software written based on Visual Cþþ 6.0, the working frequency of sonar is controlled by the software, and the data of multiple sonars are obtained.
As the upper sonar is close to the water surface, the upper and lower sonar have slight interference, so the left and right sonar data are taken as an example for analysis. The data obtained from the dock test (about 5 m wide)  under static state is shown in Figures 8 and 9. "L" indicates the distance obtained by the left sonars, and "R" indicates the distance obtained by the right sonars.
The conventional polling mode refers to the polling work after the same time interval between the left and right sonars. When the polling mode is not used, the interference between the sonars is relatively large. The standard deviation of the left and right sonar is 9.52 and 1.92, respectively. The standard deviation of the left and right sonar with conventional polling mode is 0.027 and 0.023, respectively, and the standard deviation of the left and right sonar with DWPM is 0.023 and 0.007, respectively. DWPM can obtain more data in the same time under the premise of ensuring the same stability and accuracy as the conventional polling mode.
In the pool (about 30 m wide), the x-y carriage moves in X and Y directions simultaneously (Y direction along the sonar sound direction, X direction perpendicular to the sonar sound direction). The speed in x direction is v x ¼ 1:0 m=s, and the sonar data obtained by taking different speeds in u direction is shown in Figure 10. The speed of the vehicle in the upper left, upper right, lower left, and lower right of Figure10 is v y ¼ 0:1; 0:5; 1; 1:5 m=s, respectively.
When v y ¼ 0:5 m=s, the average speed of the left and right sonar is 0.5122 and 0.4982 m/s, and the error with the actual value (0.5 m/s) is 2.44% and 0.036%. When v y ¼ 1 m=s, the average speed of the left and right sonar is 1.0358 and 1.0366 m/s, and the error with the actual value (1 m/s) is 3.58% and 3.6%. When v y ¼ 1:5 m=s, the average speed of the left and right sonar is 1.5077 and 1.5077 m/s. The error between 1.4922 m/s and the actual value (1.5 m/s) is 0.51% and 0.52%. It can be seen that the sonar working in multi-sonar DWPM can still obtain stable and accurate data in the course of the vehicle motion.
As shown in Figure 11, the reinforcement learning algorithm is verified in the pool, and the effectiveness of following the wall is shown in Figure 12.
It can be seen that AUV can follow the wall stably through the motion instructions obtained by the reinforcement learning algorithm, and the following error is less than 0.5 m. The reinforcement learning algorithm proposed in this article is effective.

Conclusion
Through the multi-sonar weighted polling mode proposed in this article, we can obtain the wall distance information dynamically and efficiently and avoid the mutual interference between the sonar. Through the time registration algorithm based on the speed interpolation, we can avoid the asynchronous problems caused by sensor frequency inconsistency. In the process of wall following, the distance between AUV and wall obtained according to the DWPM is used as the input of reinforcement learning algorithm, and the corresponding yaw motion is executed according to the output action, so as to realize the wall following. Finally, the effectiveness of the algorithm proposed in this article is verified by the pool test.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.