Research on autonomous collision avoidance of merchant ship based on inverse reinforcement learning

To learn the optimal collision avoidance policy of merchant ships controlled by human experts, a finite-state Markov decision process model for ship collision avoidance is proposed based on the analysis of collision avoidance mechanism, and an inverse reinforcement learning (IRL) method based on cross entropy and projection is proposed to obtain the optimal policy from expert’s demonstrations. Collision avoidance simulations in different ship encounters are conducted and the results show that the policy obtained by the proposed IRL has a good inversion effect on two kinds of human experts, which indicate that the proposed method can effectively learn the policy of human experts for ship collision avoidance.


Introduction
In robotics, kinds of collision avoidance techniques have been widely tested in fields, such as smart cars, military robots, entertainment, and service robots, in different environments. These collision avoidance methods are quite specific to individual scenarios. Various collision avoidance methods could be broadly classified into two categories, that is, classical and reactive methods. 1 Early scholars mainly focused on classical methods, such as artificial potential field (APF), 2 cell decomposition, 3 roadmap planner, A* algorithm, 4,5 and so on. The major shortcoming of these classical methods is high computational costs and failure to respond to the uncertainty present in the environment, leading to changing control instructions. In recent years, reactive methods have been accepted as the most popular tool for unmanned vehicle collision avoidance, including Q-learning, 6 artificial neural network, genetic algorithm (GA), 7 particle swarm optimization (PSO), ant colony, 8 and some other evolutionary optimization algorithms, 9 even model predictive control (MPC). 10 Especially for moving obstacles and multiple vehicles, MPC and sliding-mode control could achieve better robustness to disturbance. As reactive methods could deal with uncertainty present in dynamic environment much better than classical methods, most of the existing approaches for ship collision avoidance belong to reactive methods. 11 With the rapid development of intelligent ships, the collision avoidance at sea becomes more and more prominent. Scholars have carried out a lot of research on collision avoidance 12,13 of unmanned surface vehicles (USVs) in recent years, achieving good collision avoidance performance in relatively simple static environments. However, as the kinematics of merchant ships are so different from USVs of which the sizes are small, the collision avoidance law for merchant ships is much more complex obviously.
To gain a proper collision avoidance action for a merchant ship in a specified encounter environment, the obvious solution is to establish the state-action mapping relations. For a single ship in dynamic environments, the state action corresponding to Q-value tables for simple discrete state-action decision problems, such as optimal policy search and path planning, have been put forward. Based on the ship kinematics and Q-learning, Yoo and Kim 14 conducted an automatic ship autopilot control program from start points to end points, among static obstacles, taking the currents into consideration. Chen et al. 15 treated the discretized ship rudder angle as a Q-learning action, corresponding to the ship's position states with grid map, and verified the effectiveness of the Q-learning for collision avoidance path planning. Zheng et al. 16 established a Markov decision process (MDP) discrete state strategy optimization method based on multiweight apprentice learning, achieving the scheduling policies, which perform close to experts' experience. Heuristic optimization-based algorithms include GA, 17 and PSO, 18 which have clear and simple structures, being widely used in collision avoidance for intelligent unmanned vehicles. These algorithms usually search the collision-free paths according to the gradient descent direction of a set objective function. In addition, some hybrid methods have also been tested. Shen et al. 19 combined deep Q-learning and A* algorithm to propose an intelligent collision avoidance method for unmanned vessels, considering the ship's characteristics and bumper areas. Human experience was introduced into A* grid map to improve the search efficiency, which obtains good collision avoidance performance in a complex environment.
Nowadays, machine learning and artificial intelligence tend to be important tools to solve real-time decisionmaking problems. With the development of deep reinforcement learning recently, scholars have also applied these methods to the controlling of unmanned ships. Based on deep Q-learning networks (DQN), Cheng and Zhang 20 proposed four kinds of objective functions, consisting the reward function and testing the collision avoidance algorithm of vessels. Abbeel 21 put forward the walking control policies for a quadruped robot using inverse reinforcement learning (IRL). With the application of deep learning in the deterministic policy gradient method, the decision-making actions of reinforcement learning can also be approximated as continuous actions using some functions. Continuous action reinforcement learning methods, such as Deep Deterministic Policy Gradient (DDPG) and Asynchronous Advantage Actor-Critic (A3C), have been tested in control and decision-making problems. Xu et al. 22 used the DDPG method to learn collision avoidance behavior in the continuous state and action space, and obtained an effective collision avoidance strategy. Kim et al. 23 also applied the DDPG algorithm to carry out ship collision avoidance policies, using the relative motion parameters between the own ship and the target ships (other ships in the area except the own ship), and the distance between the own ship and the target track. The state-space simplifies the complexity of learning tasks. In the literature, 24 a constrained DQN is proposed to reduce the complexity of the action space by adding constraints based on some collision avoidance rules on the sea, which improves the learning rate of DQN. Generally, the machine learning methods not only have the advantages of strong learning ability but also have the disadvantages of large requirements of training samples. How to obtain and exploit the training samples with high efficiency and accuracy is the key issue in the application of machine learning in ship collision avoidance.
The International Rules for Collision Avoidance at Sea (COLREGs) is the basic rule for ship collision avoidance handling on the sea. To make decisions in different ship encounters for maritime safety, Li et al. 25 constructed a dynamic personifying intelligent decision-making structure for vessel collision avoidance system, considering rules and human experience. Liu et al. 26 established the shortest path model to realize collision avoidance through path planning based on COLREGs. However, the practical collision avoidance of merchant ships has the following characteristics: (1) Large size, large hysteresis and inertia Since merchant ship has the characteristics of large size, small redundancy space, and large hysteresis and inertias, it is difficult to generate proper collision avoidance decision using conventional algorithms.
(2) Complexity and uncertainty of ship collision avoidance scenarios The COLREGs does not specify all encounter scenarios and may even result in close-quarter situations in several encounter scenarios.
The above characteristics result in the specialty and complexity of collision avoidance of merchant ships. Therefore, the navigation experiences of human experts are still of great significance for learning a collision avoidance policy. To make use of human experts' experience, the most challenging work is to gain the reward functions of machine learning algorithm. Abbeel and Ng 27 proposed an appealing framework for apprenticeship learning. The reward function, while unknown to the apprentice, is assumed to be a linear combination of a set of state features, which can be observed directly. Although it may be difficult to directly and correctly define the reward function, it is usually much easier to specify the state features on which the reward function depends. With this setting in mind, Abbeel and Ng put forward IRL algorithm to generate a policy that performs at least as well as a human expert with respect to the unknown reward function. Essentially, IRL algorithm is an efficient method for mimicking the expert's behavior, which was widely tested in many kinds of robots.
To improve the safety and rationality of the ship collision avoidance at sea, an IRL algorithm is proposed in this study to learn an approximate reward function and a collision avoidance policy that approaches the expert's demonstration operations. Two kinds of expert demonstration operations (safety and efficiency) are learned by the proposed IRL method in simulation tests, and the results indicate that the proposed IRL method can obtain a good collision avoidance policy, which has the similar performance with human experts.

Ship collision avoidance modeling
Collision avoidance of large merchant ships in open waters follows the principle of "using rudder instead of car," that is, only relying on steering to realize collision avoidance. 28 Therefore, the service speed of the ship is adopted in the entire collision avoidance process in this article and the rudder angle is the action in collision avoidance.
To ensure the accuracy of the collision avoidance model, the following four assumptions are made as follows: (1) The speed of the ship is stable and constant, and the maneuverability of the ship is also stable; (2) it is considered that the collision avoidance process can be simplified to three steering actions before the clearance, that is, two rudder commands to change the course and one rudder command for resailing; (3) ship motions in three degrees of freedom are considered in the collision avoidance process, that is, the sway, surge and yaw; and (4) the ship is located in still water without considering the impact of large wind and waves.

Ship maneuverability model
In this study, the most widely used KVLCC2 ship model 29 is selected as the object, and the ship maneuvering motion model is established to verify the training effect of machine learning on expert demonstration operation. Considering the accuracy and computational complexity, we establish the following nonlinear Nomoto model for KVLCC2 where K and T are the maneuverability indicators of KVLCC2, a is the nonlinear coefficient, d is the rudder angle, r and _ r are the heading rate and accelerate, respectively. The Nomoto model represents the relationship between the ship heading and rudder, which is widely used in ship control. Assuming that h ¼ ½ x y T and v ¼ ½ u v r T are the position and velocity vectors of the ship, then, the kinematic model of the ship is where is the heading angle. When the ship is sailing at the service speed, the surge speed u % U and the sway speed v % 0, where U is the service speed. Then, the final ship maneuverability model can be obtained based on equations (1) and (2).

Modeling of ship collision avoidance process
As shown in Figure 1, the geodetic coordinate system is defined as X À O À Y , and the body-fixed coordinate system of the ship is defined as x 0 À o 0 À y 0 . The heading angle can be defined by the angle between the surge u and the positive x-axis. The own ship's fixed coordinate system x 1 À o 1 À y 1 is established at the predicted collision point, and the relative position angle q is defined by the angle between the course direction of the approaching ship and the positive x 1 axis. Focusing on the mathematical expression of stateaction space, the ship encounter state, collision avoidance policy, and rudder actions should be expressed in highdimensional space for collision avoidance. The stateaction dimension needs to be reduced as much as possible to avoid the dimensional disaster problem in machine learning, and the ship collision avoidance process should be simplified to reduce the difficulty of learning.
Generally, the collision encounters are detected by the perception system of a ship and the responsibility of collision avoidance is determined based on COLREGs. If the own ship is the given-way ship and has the responsibility of collision avoidance, the decision result, that is, the rudder command will be applied at the point of the last-minute action to avoid collision and stabilize the course. The detailed process of a typical collision avoidance is shown in Figure 2.
Firstly, the first steering time is defined by the moment that the own ship needs to steer to increase the heading velocity and change the course for avoidance; secondly, the second steering time is defined by the moment that another rudder angle is applied in the opposite direction to rapidly reduce the heading velocity if there is no collision risk; when the heading velocity of the ship is reduced to a certain extent, a middle rudder is adopted to keep the course, that is, the third steering time; finally, the own ship returns back to the original path by trajectory tracking when the two ships have passed the closest positions, which have the distance of closest point of approach (DCPA).
Remark: DCPA is the minimum distance between the closet points of the own ship and the approaching ship in two-ship encounters, which is an important indicator of collision risk.
In summary, the actions to be decided include the first steering time, the first rudder angle, the second steering time, and the second rudder angle. Besides, the third steering time is automatically decided when the heading velocity is reduced to a certain setting threshold.

Markov decision process of ship collision avoidance
A multistage MDP is shown in Figure 3 and can be described by tuples fS; A; P; Rg, where S is the state, A is the action, P is the state-transition probability, and R is the reward for the state action. The collision avoidance process can be described by a typical MDP.
Generally, the positions, speeds, and courses of the own ship and other ships can be used for the definition of the state S. In aspect of the positions and courses, it is considered that the relative position and course of each target ship is limited. The circumference of the own ship is divided into seven sectors by 22.5 , 85 , 95 , 202.5 , 265 , 275 , and 337.5 referring to the dividing method of collision avoidance responsibility in COLREGs, as shown in Figure 4(a). At the decision-making moment, the relative position of the target ship is located in one of the seven sectors based on the relative position angle q and coded as state s 1 . In aspect of the speed state, the speed ratios of own ship and target ships are regarded as another state s 2 , which is shown in Figure 4(b). Then, the state S in MDP consists of s 1 and s 2 .
A ¼ ½a 1 ; a 2 ; :::; a n is the action space of the collision avoidance, each action a j represents a possible action option for the current state S, which includes the rudder angle and moment. To reduce the complexity of calculation, the rudder angle value is discretized in this study. After taking an action A, the ship gets a reward R : S ! R, where R is a mapping function from state S to a real number in R. Assuming that Y denotes a set of rules for any possible selection of the action based on the state, and a policy p 2 Y denotes a sequence of rules from the state to the action. Then, the goal of solving the MDP problem is to select a policy to maximize the value function, that is, the discounted sum rewards under this policy p at the decision-making moment where V p ðsÞ is the state value function under the policy p, which represents the discounted sum of R, g is the discount factor to reduce the impact of future state on the current state, E½Á represents the expectation, and Pðs t ; a t Þ represent the state-transition equations obtained by the established ship maneuverability model. Therefore, the optimization problem in MDP is   where p Ã is the optimal policy, which satisfies V p Ã ðsÞ ! V p ðsÞjp 2 P; s 2 S. Bellman 30 has proved the existence of the maximum value function V p Ã ðsÞ, and it does not change with time in a certain environment.

Construction of the state features
The state features are the indicators of MDP process to construct the final reward, which are very important for reinforcement learning. In the collision avoidance of large ships at sea, the most significant indicators to characterize the process of collision avoidance include two major aspects: (1) DCPA, which is the distance to closest points of approach between two ships and (2) the maximum heading changes of two ships. The former indicator represents the safety level of collision avoidance, while the latter indicator represents the efficiency level.
Similarly, to reduce the complexity of the machine learning, the finite-state features are set, as given in Table 1. Each state feature represents the proportion of collision avoidance samples in a certain interval to large numbers of stochastic collision avoidance samples, which were conducted by simulation programs, thus, all of the 27 state features of ship collision avoidance process are defined.

Stochastic policy optimization based on cross entropy
The reinforcement learning for MDP is an optimum policy searching process. The idea of introducing noise cross-entropy (CE) algorithm 31 is to randomize a deterministic optimization problem and solve it using rare event simulation and optimization techniques. The main steps of CE are as follows: (1) Generating random data samples and (2) generating new samples with a certain distribution and optimizing the sample distribution.
Without losing generality, the reward function R of reinforcement learning can be represented by a linear combination function where W ¼ ð! 1 ; ! 2 ; :::; ! n Þ is the weight matrix for the state s and f is the state feature. For a random weight matrix W ¼ ð! 1 ; ! 2 ; :::; ! n Þ, a set of random policies Y ¼ ½p 1 ; p 2 ; . . . ; p n is generated by CE algorithm. Then, the state features f are obtained by executing the action a i mapped by each policy p i under the current state s i . After that, the immediate reward can be calculated by equation (5), and the value function can be updated by equation (3).
In each iteration of CE, the policy Y t is obtained using the Gauss distribution in high-dimensional space, and the mean and variance of Y t are as follows where b is the sample selection ratio of the policy, that is, only b policies with the largest value function are taken in   each iteration and the sample mean and variance of the b policies are calculated as the mean and variance of the random policies for the next iteration. The CE algorithm converges fast, but it is easy to fall into suboptimal solution. To deal with this problem, Szita and Lörincz 32 introduce Z noise component in the variance and achieve better global optimization results where Z tþ1 ¼ C Á ðt þ 1Þ þ d and C; d are the constants. The calculation steps of the noise CE algorithm can be denoted in Algorithm 1.

Expert policy approximation based on projection method
As a search process of reinforcement learning, the noise CE method needs to search the optimal policy on the premise of defining the weight matrix of the reward function. The projection method 27,33 is used in this study to obtain the approximated reward of expert policy using the weight matrix W as the medium. Firstly, the state feature expectations of expert demonstration samples are calculated where k is the expert demonstration sample size and f ðs l Þ is the state feature of samples s at time l. Then, the weight vector W ð0Þ is initialized randomly, and an initial strategy ð0Þ is generated randomly. Based on W ð1Þ ¼ E À Moreover, the flowchart of projection algorithm is shown in Figure 5.
The state feature expectation corresponding to the policy p Ã can become close to the state feature expectation of the expert demonstration sample based on the projection method.
In summary, the reward function is obtained by projection-based IRL, and the CE-based RL method is used to search the optimal policy. The reward function is updated by the difference between the expectations of state features of the current policy and the expert demonstration until the convergence condition is satisfied. The final flowchart of the policy search is shown in Figure 6.

Acquisition of the expert demonstration samples
To conduct the IRL simulation experiments, large amounts of expert demonstration samples are acquired. A simulation software for ship collision avoidance operation based on the established ship maneuverability model is developed, as shown in Figure 7.
The software generates different encounter scenarios and judges the responsibility of collision avoidance according to the COLREGs. If the own ship has the responsibility of avoidance, the experts need to drag the horizontal slider to control the rudder angle of own ship to change the course. The software will automatically add the simulation results into training samples.

Validation of the proposed projection-based inverse reinforcement learning method
To reduce the randomness, fixed encounter scenarios are adopted. In range of 0-360 , as shown in Figure 4(a), the interval of the relative position angle of the target ship is 2.88 . The other ship's speeds are set as 4,6,8,12,14,18,25, and 30 knots, respectively. The own ship's speed is set as 10 knots, that is, the service speed. With this kind of method, 1000 encounter scenarios are designed and used in the simulation software to obtain demonstrations by experts, and these typical encounter scenarios were called base scenarios. In fact, on the one hand, the movements of ships are so complex that it is impossible to establish all encounter scenarios. In our research, testing set scenarios was classified according to the encounter situation judgment methods in COLREGs. On the other hand, as the directions and speeds of target ships are the major concerns of captains, the states of MDP in our research only consist Algorithm 1. The noise CE algorithm of s 1 ; s 2 , and the training set data were conducted by software, removing lots of random events. As a result, the state features could be determined by initial states and policies.
In addition, the experts are divided into two categories, that is, the safety experts and efficiency experts. The safety experts give priority to the safety in collision avoidance,  who usually use a larger rudder angle to achieve a larger DCPA and heading change, so as to keep the two ships as far away as possible to ensure the safety. While the efficiency experts give priority to the efficiency, who usually use a smaller rudder angle to shorten the ship's voyage under the premise of the safety between two ships, so as to improve the economy.
Demonstrations of these two kinds of experts are obtained by four sailors and experts using the simulation software. The discount factor is set as g ¼ 0:99. The sample selection ratio of the CE algorithm is set as b¼10% and the noise factor is set as Z t ¼ 0:2 Á t þ 1.
The obtained 1000 samples of different encounter scenarios are used for learning of each policy p in the IRL method. The software runs on a computer with I7-6700 (four-core, eight-thread) CPU and the features in IRL method converge in about 6000 s. The feature deviations between the demonstrations and the learned policy are shown in Figure 8.
It can be seen from Figure 8 that the feature deviations can converge to a good level within about eight iterations. Moreover, the comparison of the state features between the convergent policy and the expert demonstrations is shown in Figures 9 and 10.  In Figure 9, the maximum error between the state features of the safety expert demonstrations (the white bar) and those of the policy trained by IRL (the gray bar) is less than 5%, which indicates that the proposed IRL method can obtain a collision avoidance policy similar to the safety expert. In the aspect of the DCPA, the expectation of the sixth feature represents the sample proportion of DCPA between 500 m and 600 m, which is the largest expectation with respect to DCPA and shows that the collision avoidance samples given by the safety experts are mostly concentrated in this area. The sample distribution of DCPA between 600 m and 1200 m is more uniform than that between other ranges, which indicates that the searched policies achieve a larger distance between two ships during more collision avoidances, corresponding to larger DCPAs.
In the aspect of the heading change, the largest expectations are the 16th, 22nd, 21st, and 27th feature expectations. The 16th and 22nd feature expectations are the sample proportions of that the own ship and target ships keeping course, respectively, corresponding to the keeping course scenarios, in which the target ship has avoidance responsibility. The 21st and 27th feature expectations represent the sample proportions of that the maximum heading change of the own ship and target ships varies between 40 and 50 , respectively. It can be seen that the safety experts prefer to control the heading change between 40 and 50 .
In Figure 10, the maximum error between the state features of the efficiency expert demonstrations (the white bar) and those of the policy trained by IRL (the gray bar) is also less than 5%. In the aspect of the DCPA, the fifth feature expectation is the largest expectation with respect to DCPA, which indicates that the collision avoidance samples given by efficiency experts are more likely to be completed with a medium DCPA (about four to five times the length of the ship). In the aspect of the heading change, the largest expectations are the 20th and 26th feature expectations except for the 16th and 22nd features for keeping the course, which indicates that more samples have the heading change between 30 and 40 , indicating that efficiency experts prefer to choose less DCPAs and heading changes to achieve higher efficiency level.

Simulation verification of random collision avoidance
To show the decision-making performance of the proposed IRL method in different collision avoidance scenarios more intuitively, 1000 base scenarios, including head-on, crossing, and overtaking encounters, are selected as typical ship encounter scenarios. The learned policies through IRL are used to control ships during collision avoidance, and the results are compared with the demonstrations of safety experts and efficiency experts in the same encounter scene, as shown in Figures 11 to 16. In Figure 11 , it can be seen that the rudder angle of the IRL policy is also similar to that of the expert demonstrations, although the steering time is slightly different.
In addition, the DCPA and maximum heading angle changes of expert demonstrations and learned policies are plotted as a box diagram, as shown in Figure 17. It can be seen that the DCPA values of the safety experts and corresponding safety IRL policy are about 480 m, which are larger than those of the efficiency experts and corresponding efficiency IRL policy (about 400 m), indicating that the safety experts and safety IRL policy achieve more safe avoidance results. On the contrary, the efficiency experts and efficiency IRL policy adopt smaller rudder change values to realize collision avoidance with smaller heading angle changes, which means that the own ship can return to the original route faster after collision avoidance is completed. Both own ship and target ship were controlled by the same policy in the simulation software. For example, the target ship in safety experts' demonstrations is controlled by safety IRL policy. For the same policy, the statistical average heading changes of other ships are less than the own ship since the speeds of other ships in most samples are faster than that of own ship and the experts tend to steer with smaller rudder angles in advance for highspeed ships.

Simulation comparisons of the proposed inverse reinforcement learning method and normal reinforcement learning method
To compare the performance of IRL and concise reinforcement learning applied in collision avoidance scenarios,     concise reinforcement learning was also tested. It is difficult to define the reward functions in collision avoidance.
As the state features of MDP is a 27-dimension vector, a linear combination reward function based on state features could be defined as follows where W ¼ ð! 1 ;! 2 ; . . . ;! n Þ is the weight vector and f is the state feature of test samples conducted by reinforcement learning. As the weight W could influence the reward function value directly, it could be defined according to the safety requirements for navigation. The feature expectations of reinforcement learning in the specific collision avoidance scenarios are shown in Figure 18. How to conduct the collision avoidance according to COLREGs is the major challenge in our research for reinforcement learning. For example, in some encounter scenarios, ships, on one hand, should obey COLREGs, turning to special direction instead of the other direction to avoid potential collision. On the other hand, the responsibility of collision avoidance is so complex that in different encounter scenarios, own ship need not avoid collisions. As a result, it is difficult for reinforcement learning to generate proper policies. Therefore, we develop an expert system, which could judge whether the avoidance action is right based on COLREGs. If the avoidance action does not obey COLREGs, the expert system could generate determine factor as the hard constraints. In Figure 18, the state features of reinforcement learning are similar with the state features of efficiency expert, showing that the weight vector of reward function prefers efficiency more than safety on the basis of satisfying with the safety requirements of COLREGs. Similarly, other optimization algorithms, such as A*, APF, and GA, also need hard constraints based on the expert system of COLREGs.
On the contrary, it is relatively easy for human to control a ship to avoid collision in most encounter scenarios, especially in some complex scenarios. Human experts' prior knowledge about collision avoidance is so valuable that it could improve the validity and practicability of algorithm. IRL is more suitable to collect collision avoidance policies.
In summary, the IRL algorithm proposed in this article can easily obtain the decision-making policies of human experts, so that the algorithm has a similar collision avoidance performance with human drivers.

Conclusions
An IRL method through CE-based policy optimization and projection-based policy approximation is proposed in this study to realize ship collision avoidance. The main works of this article are concluded as follows: 1. The ship maneuverability model is established, and the expert demonstration operation software is developed to obtain collision avoidance samples through the expert operation. 2. The distributions of DCPA and maximum heading angle change are taken as the state features, and the collision avoidance policy of expert demonstrations is obtained by the proposed IRL method. The learned policy has similar performance with the expert demonstrations, which indicates that the proposed IRL method is suitable for collision avoidance policy training of merchant ships.
However, the practicability of the IRL method also depends on the reasonableness of expert demonstrations. Therefore, it is necessary to follow the captain's driving habits of real merchant ships and collect real operation data extensively to expand the samples for IRL so as to find a reasonable trade-off between the safety and efficiency for autonomous collision avoidance. Subsequently, further research on the proposed IRL method will focus on data collection of real ship navigation data collection.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.