Car-following method based on inverse reinforcement learning for autonomous vehicle decision-making

There are still some problems need to be solved though there are a lot of achievements in the fields of automatic driving. One of those problems is the difficulty of designing a car-following decision-making system for complex traffic conditions. In recent years, reinforcement learning shows the potential in solving sequential decision optimization problems. In this article, we establish the reward function R of each driver data based on the inverse reinforcement learning algorithm, and r visualization is carried out, and then driving characteristics and following strategies are analyzed. At last, we show the efficiency of the proposed method by simulation in a highway environment.


Introduction
Intelligent driving in intelligent vehicles is a technical high point in industrial technology and is studied by various countries and major technological companies.Car following is one of the most significant and common conditions for manual driving, assisted driving, or unmanned driving. 1 With the rapid growth of urban traffic scale, car following has become the most primary condition encountered by drivers. 2,3Car-following models have been extensively studied since 1950s, 4 and the research currently focuses on different fields, such as vehicle engineering, traffic safety, big data and artificial intelligence, psychology, and cognition.Research on car-following behavior gradually extends from the original operation of acceleration, deceleration, and other specific operations to perception, psychology, and physiology.The methodology for studying such behavior has been extended from early mathematic modeling to various fields, such as logistics, planning, transportation, cognitive science, neuroscience, data science, machine learning, and artificial intelligence. 5,6 1950, Reuschel studied the car-following behavior of drivers from an operational research perspective, 7 whereas Pipes proposed the first car-following problem in 1953. 8,9xisting car-following models (algorithms) can be divided into two categories.The first category is explanatory carfollowing model.First, the model predetermines some physical quantities in the car-following process to describe the expression using parameters.Then, the unknown parameters of the expression can be determined based on statistics or experience.This type of car-following model often requires assumptions and explanations of the carfollowing process.The second category is nonexplanatory car-following model.The car-following behavior of drivers is based on a learning algorithm, namely, the fitting or induction of a large number of data.
Explanatory models include linear car-following model, distance inverse model, nonlinear car-following model, memory function model, expected distance model, and physiological-psychological model.In 1958 and 1959, Chandler 10 and Herman 11,12 respectively proposed linear car-following models.In 1959, Gazis et al. presented a range inverse car-following model. 13After 2 years, Gazis et al. further proposed a nonlinear car-following model. 14n 1967, May and Keller completed the fitting of the nonlinear model with actual vehicle data under highway and tunnel conditions. 15,16In 1993, Ozaki divided driver motion into four stages: acceleration start, deceleration start, acceleration maximum, and deceleration maximum.When separately fitted, the reaction time is strongly dependent on the stage of driving action.In particular, the reaction times are quite different in the acceleration and deceleration stages.Ozaki suggested that the possible reason for this difference is the use of the taillight of a front car during deceleration stages. 17Lee introduced a memory function model in 1966, in which he thought that drivers responded to the integral of the relative speed of the front vehicle rather than the instantaneous value.He then analyzed its stability.In 1972, Darroch and Rothery used spectral analysis methods.The shape of the memory function was estimated based on the experimental data.They found that Dirac delta function can approximate experimental data; in fact, it corresponds to the linear following model. 18n 1961, Helly suggested that the driving strategy of drivers not only minimized relative speed but also the difference between real and expected vehicle distances.In 1982, Gabard et al. used Helly's model in SITRA-B (microscopic traffic flow model). 19In 1974 and 1988, Weidmann and Leutzbach proposed two unreasonable points of the traditional car-following model: (1) In the previous car-following model, even with a large distance to a front vehicle, testing vehicle will also keep following.(2) The previous car-following model assumed that the drivers had the perfect perception and reaction, even if the external incentive was very small.Therefore, they introduced a perceptual threshold to define the minimum environmental incentive, which can be reacted to by the drivers.Evidently, the perceived threshold increased monotonically with the distance to the car.At the same time, they also found that the perception threshold is different during the acceleration and deceleration phases. 20Explanatory model can generally guarantee the safety of the following process of the car, but accurately describing the highly nonlinear carfollowing behavior is difficult.Moreover, the model does not have adaptive adjustment ability for different drivers or different conditions. 21,22ith the development of artificial intelligence research, a variety of machine learning methods are the most prominent.These methods have outstanding advantages in dealing with nonlinear problems, 2,7 such as convolutional neural network, reinforcement learning (RL), and inverse reinforcement learning (IRL).A considerable number of researchers have begun to focus on the car-following model based on machine learning methods. 2,7Richard S Sutton proposed a temporal-difference learning (TD) method. 23radtke and Andrew G Barto established two algorithms which were called least-squares TD (LSTD) and recursive least-squares TD (RLSTD) with the help of the theory of linear least-squares function approximation. 24Michail G Lagoudakis and Ronald Parr proposed an approach called least-squares policy iteration (LSPI) by combining value function approximation with linear architectures and approximate policy.Xu et al. proposed a kernel-based least-squares policy iteration (KLSPI). 25Wei Xia et al. proposed a new control strategy of self-driving vehicles using the deep RL model, in which learning with an experience of professional driver and a Q-learning algorithm with filtered experience replay are proposed. 26Pyeatt and Howe applied RL to learning racing behaviors in Robot Auto Racing Simulator, precursor of the The Open Racing Car Simulator (TORCS) platform. 27,28Daniele et al. used the tabular Q-learning model to learn the overtaking strategies on TORCS. 29Riedmiller proposed a neural RL method, namely neural fitted Q-iteration (NFQ), to generate control strategy for the pole balancing and mountain car task with least interactions. 30Zheng et al. established a 14-Degree of Freedom (DOF) dynamic model of an autonomous vehicle and use R to build a decision-making system for autonomous driving. 31The nonexplanatory model, represented by artificial neural network (ANN), fully demonstrates the high nonlinearity of car-following behavior and has been proven by some researchers to be stable and safe under certain conditions (such as slope input and sinusoidal input).However, the model treats the drivers as a "black box" because such models are not interpretative.Theoretically analyzing whether or not the model is stable or has existing "bad spots" is difficult.Meanwhile, such models are more capable of "cloning" rather than "learning" driving strategies and thus have difficulty in intuitively reflecting "adaptability".With the change in working conditions and drivers, the change of the model itself is only the weight between nodes, but a direct relationship between the weights and driving behaviors is difficult to establish.As a result, further analysis is also difficult to perform.The lack of flexibility is another major problem with carfollowing models based on ANNs.A network trained by a data set may not be well applied to another data set.Thus, this present work proposes a learning algorithm with a certain interpretation for car-following model and establishes an anthropomorphic following model.The vital RL and its associated IRL in machine learning provide us a novel idea.Analyzing the car-following behavior of drivers by the following model and proposing the intelligent following algorithm have great value and significance in many fields, such as road safety, driving assistance system, and intelligent driving.The explanatory model is too simple to accurately describe the highly nonlinear carfollowing behavior and cannot adapt to different drivers and working conditions.Although the nonexplanatory model represented by ANN can fit complicated nonlinear relations, such models are not interpretable.Moreover, the theoretical analysis on the stability or the establishment of the relationship between driving behaviors and neural network structure for further analysis becomes difficult because the drivers are treated as "black box."Therefore, to implement the anthropomorphic car-following model, the machine learning-based method is used to optimize auto-following algorithm, which is of great value for research.The contribution of this article is briefly described as follows: (1) A learning car-following algorithm with a certain explanation by using IRL combined with the carfollowing data of driving simulator.( 2

Reinforcement learning
RL is a vital branch of machine learning, and a typical RL task is usually described by the Markov decision process.The machine (or agent) is in environment E, defining a state space S, where each state is a description of the environment that the agent can perceive.The actions that an agent can perform constitute action space A; a 2 A is an action can be taken by an agent.After taking the action, the state transition probability P enables the environment to be transferred from the current state to another state with certain probability.At the same time, as the state transitions, a reward r is the feedback of the environment to the agent according to the potential reward function R. 32 A RL task corresponds to the tetrad E ¼ hS; A; P; Ri; P : S Â A Â S 7 !R, which represents the probability of state transition; and R : S Â A Â S 7 !R, which represents the reward function.As shown in Figure 1, the agent observes state s and then performs action a, s transfers to the next state based on the state transition probability P, and simultaneously an instant reward r is obtained.In RL, agents continuously interact with the environment and update strategies to learn policer a ¼ pðsÞ.
The relative merits of the strategy depend on the cumulative reward of long-term execution, rather than the instant reward for performing an action.Consequently, the RL task maximizes the long-term cumulative reward generated by the policer.Therefore, this work uses the "g-discounted cumulative reward" to estimate the long-term cumulative reward, as shown in equation ( 1) g is the discount rate, a positive number less than 1, and represents the degree of emphasis that the agent has on future rewards.The greater the value of g, the more the attention is paid to the future received rewards.r t is the instant rewards of the tth step, and E is the expression of the expectation of all random variables.

Inverse reinforcement learning
The reward function R plays a crucial role for a deterministic RL task.The setting of R directly determines which strategy the agent will adopt.However, for many RL tasks, the reward function R cannot be predetermined, or the suitable state (strategy) for an agent is unknown.For the carfollowing decisions studied in this work, different explicit R values for different driver models are difficult to determine, and distinguishing which state (strategy) is good or bad is unclear.Although the relative merits of a strategy are known, specific reward function values are difficult to quantify.IRL is based on a sample data provided by experts, which is reversely introduced as the reword function. 33The basic idea is to define the strategy with the sample data as p Ã and another strategy as p.The reward function is expressed as a linear function of the state s, that is, RðsÞ ¼ !T s.Given the coefficient !T , r represents the cumulative reward, the cumulative reward for the strategy p is shown in equation ( 2) The goal of IRL is to calculate !Ã .When the difference between the optimal and other samples is maximized, some parameters are computed to ensure that the strategy with example data p Ã can be better than any strategy p.The objective function is shown in equation ( 3)

Design of IRL algorithm
The main purpose of IRL is to obtain the reward function R of drivers.In this work, an IRL algorithm based on the max-margin algorithm is proposed. 34As shown in equations ( 2) and ( 3), the algorithm is divided into the following steps: 1. Determine the state space and implement the transformation of the kernel function.2. Obtain the various components R ij for R and calculate the cumulative reward of each component V ij .3. Determine the weight of R ij and eventually solve R.

Determine the state space and transform kernel function
The physical quantities that reflect the process of carfollowing method are as follows: car-following distance d (which is the leading distance between the front vehicle and testing vehicle, and the unit is m), velocity of the testing vehicle v (in unit of km=h), velocity of front vehicle v front , acceleration of the testing vehicle a, and acceleration of front vehicle a front .For data visualization, this work chooses velocity of the testing vehicle v and carfollowing distance d to form a 2-D variable as the state space, which constitutes the S in the quadruple E ¼ hS; A; P; Ri.The 2-D features are transformed by the Gaussian radial kernel function and mapped into the highdimensional feature space to denote the strong nonlinear relationship in the car-following process.After data preprocessing, the range of the testing vehicle velocity v is limited to ð0; 130 km=h), and the range of the carfollowing distance d is ð0; 350 mÞ.According to such range of the velocity and distance, velocity v is divided into equal intervals ðv 1 ; v 2 ; Á Á Á v 15 Þ by the interval 9 km=h, as shown in equation ( 4) The car-following distance d can be divided into equal intervals, and the interval is ðd 1 ; d 2 ; Á Á Á d 36 Þ, as shown in equation ( 5) Given that the data size of vehicle speed v and distance d is inconsistent, the normalization of all these data is required.The state vector s ¼ 25 9 v; d À Á and kernel vector s ij ¼ 25 9 v i ; d j À Á are defined, and the Gaussian radial kernel function is shown in equation ( 6) If s 2 is selected, then the space expanded by the kernel function can be neither overfitting nor underfitting.After experimental verification, s 2 ¼ 5 is defined to ensure that the kernel function can obtain a relative equilibrium between underfitting and overfitting.

Calculate reward function and cumulative reward
The requiring reward function R can be written by a linear combination of 540 kernel functions based on the kernel function with state vector, as shown in equation ( 7) Next, the value of each component R i; j ðv; dÞ at each state s of Rðv; dÞ should be calculated, that is, for the action sequence of drivers fs 1 ; s 2 ; Á Á Á ; s N g, R i; j ðs l Þ; 1 l N is calculated.The solving process is demonstrated in Figure 2.
R i; j ðsÞ is an instant reward, and the cumulative reward V i; j ðsÞ of each state measures the long-term rewards of the state s.Equation (8) shows the calculation of the cumulative reward where g is the discount factor, which evaluates the discount rate of drivers.In this work, the value is selected as g ¼ 0:9.

Determine the weights of reward function
After calculating R i; j ðsÞ, the next step is to determine the parameters q i; j .Under driving strategy p Ã , the long-term cumulative reward V ðsjp Ã Þ is superior to the rewards under other strategies V ðsjpÞ, as shown in equation ( 9) where q is the vector expanded by q i; j , s Ã t represents the carfollowing state sequence under the optimal strategy p Ã , s ðiÞ t represents the car-following state sequence under other strategy, a Ã t is the action sequence under the optimal strategy p Ã , and a ðiÞ t is the action sequence under other strategies.The corresponding relationship between these variables is shown in Figure 3.
Equation ( 9) can be transformed into the optimization problem under the inequality constraint, as shown in equation (10) Furthermore, equation ( 10) can be simplified to equation (11) if only one strategy for each state is present The number of samples is sufficient (for each test, N % 60000); therefore, equation ( 11) is taken into account in the calculation.For each optimal car-following state, one of the other car-following actions is randomly selected for the solution.The effect of traversing the other strategies for an average performance can be achieved.The selection of the other car-following action is based on the statistics of acceleration of the testing vehicle, which can provide the range of acceleration ½a min ; a max .Then, the interval is divided into 10 points as the action set, so other strategies are randomly selected from nine nonoptimal car-following actions.The parameter q i; j is solved by Lagrange multiplier  method or "linprog" function in MATLAB, as shown in equation ( 12)

Experiment setup
Hardware and software.The working conditions of vehicle and the road environment must be precisely controlled to study the car-following behavior of drivers under given conditions.Therefore, this work is based on the dynamic driving simulation test bench of Tsinghua University.The dynamic driving simulation test bench is shown in Figure 4, and its system components are shown in Figure 5.
The hardware part of the simulator consists of five parts: simulation cockpit, external visual environment simulation system, vehicle motion simulation system, sound environment simulation system, and operation tactile sensation simulation system. 35The software part of driving simulator consists of six parts: system control module, environment control and scene creation module, simulation calculation module, input and output module, graphics calculation and rendering module, and actuator control module. 36,37vironment modeling.This work is about the car-following behavior of drivers under a single lane (no lane change, no overtaking, and no traffic light); therefore, the freeway is selected as a road scene.A two-lane road of 200 km in length is designed.The road includes a fast lane, a slow lane, and an emergency lane with widths of 3.75, 3.75, and 2.5 m, respectively.The road model is shown in Figure 6.
Vehicle model.The most common vehicle is chosen as the front car (BMW 3 Series as a template), and the dynamics model of the vehicle was generated by CarSim.In addition, the brake lights turn red when the vehicle is decelerating, which is consistent with the real situation.The road scene of the actual testing process is shown in Figure 7. Three computer screens correspond to three projection screens  (middle, left, and right) of the external virtual environment system.These screens constitute the front view of the drivers.
In order to intuitively demonstrate the effectiveness and generalization ability of the IRL algorithm, the experimental data are obtained by selecting two subjects who have been driving for more than 5 years as drivers A and B. According to two different operating conditions of the New European Driving Cycle (NEDC) and Japan's 10-15, the following experiments are carried out on the driving simulator, the following data of drivers A and B are recorded, and the reward function r of drivers A and B is visualized, giving the NEDC and Japan's 10-15.For each working condition, two randomly selected tests are performed to verify the reproducibility of the test results.

Experiment result
The entire space ðv; dÞ is traversed, and the test driver's reward function is shown in Figure 8.The reward information of the drivers can be obtained by analyzing the shape of the reward function surface in Figure 3.A high surface height corresponding to any point on the plane ðv; dÞ indicates great instant reward value at the same point.
The 3-D graphics are transformed into a 2-D plan, as shown in Figures 9 to 12. Figure 9 shows the diagram of the reward function of driver A under NEDC conditions (two randomly selected trials), and Figure 10 shows the reward function of driver A under Japan 10-15 condition.Similarly, Figure 11 shows the diagram of the reward function of driver B under NEDC conditions (two randomly selected trials), and Figure 12 shows the reward function of driver B under Japan 10-15 condition.

Experiment analysis
The comparison of the results presented in Figures 9 to 12  corresponding to the peaks of two reward functions increase.For constant speed, the distance corresponding to the peak of the reward function of A is large, whereas the distance to the peak of the reward function of B is small.Therefore, the reward function of A is generally close to the coordinate axis of distance, and that of B is close to the axis of speed.This finding indicates that the carfollowing distance of the driving strategy of A is large, and that of B is small.In addition, the gradient of the reward function of A is small, and that of B is large.The reward function of driver A is shown in Figure 13, the reward function of driver B is shown in Figure 14, This finding indicates that A is less sensitive to changes in vehicle distance and speed, but B is more sensitive to the changing information.
) The IRL algorithm is designed to learn the reward function R of drivers from driving simulator data.(3) The reward function R of different drivers is visualized under different conditions.(4) The similarities and differences are analyzed, and the IRL algorithm is optimized.The remaining part of this article is organized as follows: The second part is "Reinforcement learning and inverse reinforcement learning."The third part is "Design of IRL algorithm."The fourth part is the "Experiment and analysis" based on the simulation platform and the rest part is "Conclusion and future work."

Figure 2 .
Figure 2. Solving process of R i; j ðs l Þ.

Figure 3 .
Figure 3. Relationship between variables of max-margin algorithm based on IRL.IRL: inverse reinforcement learning.

Figure 7 .
Figure 7. Road scene of the actual test.

Figure 8 .
Figure 8. Diagram of the reward function.

Figure 9 .
Figure 9. Driver A (female): diagram of the reward function under NEDC condition.(a) The result of randomly selected trial 1.(b) The result of randomly selected trial 2. NEDC: New European Driving Cycle.

Figure 10 .
Figure 10.Driver A (female): diagram of the reward function under Japan 10-15 condition.(a) The result of randomly selected trial 1.(b) The result of randomly selected trial 2.

Figure 11 .
Figure 11.Driver B (male): diagram of the reward function under NEDC condition.(a) The result of randomly selected trial 1.(b) The result of randomly selected trial 2. NEDC: New European Driving Cycle.

Figure 12 .
Figure 12.Driver B (male): diagram of the reward function under Japan 10-15 condition.(a) The result of randomly selected trial 1.(b) The result of randomly selected trial 2.

Figure 13 .
Figure 13.Reward function of driver A.Figure 14. Reward function of driver B.

Figure 14 .
Figure 13.Reward function of driver A.Figure 14. Reward function of driver B.