Connected and automated vehicle control at unsignalized intersection based on deep reinforcement learning in vehicle-to-infrastructure environment

In order to reduce the number of vehicle collisions and average travel time when vehicles pass through an unsignalized intersection with connected and automated vehicle, an improved Double Dueling Deep Q Network method with Convolutional Neutral Network and Long Short-Term Memory is presented in this article. This method designs a multi-step reward and penalty method to alleviate the sparse reward problem using positive and negative reward experience replay buffer. The proposed method is validated in a simulation environment with different traffic flow and market penetration under the mixed traffic conditions of automated vehicles and human-driving vehicles. The results show that compared with traditional signal control methods, the proposed method can effectively improve the convergence and stability of the algorithm, reduce the number of collisions, and reduce the average travel time under different traffic conditions.


Introduction
Traffic congestion and safety have long been the focus of traffic management authorities and drivers. With the continuous increase of car ownership around the world, traffic congestion becomes more frequent, and the number of traffic accidents increases significantly. Therefore, to effectively alleviate congestion and reduce the occurrence of vehicle accidents is an urgent problem to be solved.
Vehicle-to-infrastructure (V2I) technology can effectively improve vehicle safety and reduce unnecessary parking, 1 reduce fuel consumption and exhaust emissions of vehicles at intersections. 2 5G network can greatly improve the efficiency of data collection, storage, and processing in vehicle infrastructure integration, making this technology to have a broader application scenario.
An automated vehicle (AV) can achieve safe and efficient driving by receiving control commands from controllers in real time through wireless information interaction function, which helps the application of auto-driving technology at intersections. It is expected that by 2035, the AV may occupy about 25% of new car market. Research shows that online controlled transportation can significantly reduce traffic accidents and improve road traffic efficiency. 3 However, as the AV is introduced, there will be a period of mixed traffic flow of AV and human-driving vehicle (HV), then gradually evolve into an environment full of AV. Therefore, it is important to study the intersection control problem in the case of mixed traffic flow of AV and HV.
For the control of unsignalized intersection, the current research mainly focuses on designing vehicle control algorithm to control the vehicle movement process. 4 In the study of micro-control at unsignalized intersections, the method of optimizing the model or setting safety constraints in the simulation model to improve the safety of vehicles is adopted by most of the studies, while there are few studies on the method of directly coordinating vehicle driving behavior in real time to avoid collision.
Reinforcement learning has the advantages of autonomous learning by agents, and is appropriate for solving decision-making problems under the setting of highdimensional state space and high-dimensional action space. 5 Over the years, deep reinforcement learning (DRL) has evolved into a more mature research framework. In traffic control, 6 automatic driving, 7 and other fields, DRL has obtained some meaningful research results.
AV control at unsignalized intersections refers to the optimization of driving strategies of AV using microcontrol methods, so as to achieve the goals of minimizing the collision and delay of vehicles within the control range of unsignalized intersections. Therefore, focusing on the control of unsignalized intersection, this article proposes an improved Double Dueling Deep Q network method, that is, 3DQN method, to reduce the number of vehicle collisions and improve the traffic efficiency.
Thus, considering the mixed situation of AVs and HVs, this article designs a connected and automated vehicle (CAV) control method based on DRL method for unsignalized intersections with V2I technology. The approach proposed in this article can provide a new control method for CAV to drive in an unsignalized intersection. The main contributions of this article are as follows: 1. Aiming at the micro-control problem of leftturning CAV at an unsignalized intersection, a control method based on an improved 3DQN method is presented in this article. The left-turn CAV approaching, inside, and leaving an intersection is integrated into the control method. 2. In the proposed 3DQN method, in which double deep Q network and dueling deep Q network are combined, important features are extracted with convolutional neural network and an advantageous strategy is obtained based on historical and existing features with Long Short-Term Memory (LSTM) network. A new multi-step reward and penalty method is proposed to solve the problem where there is relatively little experience in vehicle collisions and vehicle passing. At the same time, the sparse reward problem is alleviated using positive and negative reward experience replay buffer (PNRERB). 3. To solve the problems studied in this article, a central control method and multi-agent control method of unsignalized intersection are designed based on the improved 3DQN method. When constructing the DRL model, considering the particularity of the problem, the state of the model is processed. In addition, this article uses micro-simulation software VISSIM to build a virtual environment and uses it as an agentlearning environment. The left-turn and straight-line vehicle in the simulation environment are used as the learning agent, so that the agent can learn independently in the virtual environment. Finally, to verify the effectiveness of agent training in different environments, this article analyzes the effect of unsignalized intersections for different traffic flow levels at different market penetration rates of 0, 0.2, 0.4, 0.6, and 0.8.
The remainder of this article is organized as follows. The section ''Literature review'' summarizes the current state of research on control and DRL for CAVs at unsignalized intersections. The section ''Problem formulation'' describes the control problem in V2I environments and the assumptions studied in this article, along with the state space, behavior space, and reward function used in this article for DRL. The section ''Improved double dueling deep Q network 3DQN'' discusses the improved 3DQN method with Convolutional Neutral Network and LSTM (3DQN-CNN-LSTM) method proposed in this article, including multi-step reward and penalty methods, positive and negative sampling experience pools, and so on. The section ''Experiment'' describes the simulation experiments conducted and analyzes the experimental results. Section ''Conclusion'' concludes the article and identifies future research directions.

Unsignalized intersection control of CAV
The aim of intelligent CAV control at unsignalized intersection is to alleviate vehicle collision and reduce delay by optimizing the driving strategy of CAV. To achieve a safe and efficient access of vehicles in unsignalized intersections, researchers mainly design vehicle control algorithm that control the movement of vehicles. A simulation experiment was designed by Shao. 8 In the environment of possible conflict, by optimizing the driving strategy of vehicles, and studying the influence of sight distance conditions and warning conditions on collision avoidance of drivers, a collision avoidance model in no control intersection was designed. The results indicate that the safety of vehicles can be improved by their model, but more complicated situation of straight and left turn hybrid driving is not considered. A distributed and collective intelligence framework is proposed by Kalantari et al. 9 to provide navigation for vehicles in intersection. Through the simulation of intersection, it is proved that the proposed framework can reduce collisions and vehicle travel time at the same time. However, the paper did not analyze the optimization effect of control methods under different market penetration. To solve the problem of vehicle collision at the unsignalized intersection, an improved reinforcement learning method was proposed by Isele et al. 10 It is verified that the proposed method can effectively ensure the safety of vehicles at unsignalized intersection through simulation experiment. However, the situation of fully autonomous vehicle is only considered through the research.
A two-car collision judgment model based on time to collision (TTC) is established by Duan et al. 11 to reduce collisions within the range of unsignalized intersection. The current state of the vehicle can be adjusted by the proposed model through estimating the possible collision area in advance. With the results of simulation, it is shown that the model can reduce the collisions effectively.
Wu et al. 12 have proposed a multi-agent learning method decentralized coordination learning of autonomous intersection management (DCL-AIM) to minimize the intersection delay under the condition of no collision constraints, which was verified by simulation experiments. From the experimental results, it is shown that the proposed method can reduce vehicle delay effectively. The safety constraints are considered to avoid the vehicle collision. In the paper by Li et al., 13 two kinds of unsignalized intersection control methods, namely priority-based method and discrete forward rolling optimal control (DFROC) method, are proposed through coordinating the driving behavior of vehicles within the control range of intersection. The safe driving of vehicles was ensured, and the traffic efficiency of vehicles was improved. It is found that two kinds of unsignalized intersection control methods can reduce vehicle delay through simulation.
In addition, to reduce vehicle collision at unsignalized intersections, a lot of methods have been proposed. Xu et al. 14 proposed a distributed no collision cooperation method, which coordinated the driving strategies of multiple vehicles at the same time to realize the vehicle no collision passing. Through numerical simulation experiments, it was found that the proposed method can realize vehicles passing through the intersection without any collision. Moreau et al. 15 proposed a Bayes curve optimization method. The problem of avoiding collision with obstacles was transformed into an optimization problem. Lagrange method and gradient method were used to solve the problem. It was shown that the method can reduce the occurrence of vehicle unsafe behavior. However, for unsignalized intersections, in most of the studies, avoiding vehicle collision was considered as the optimization objective, and collision avoidance constraints were set as the constraints in the model. In addition, in the current research, the vehicle collision optimization problem was considered in the case of fully CAVs. However, the mixed driving of AV and HV was considered rarely.
In conclusion, the weakness of current research is as follows: 1. Most of the current research in unsignalized intersection control problem has a limitation. In simulation experiments, only the situation that all vehicles are combined with CAV is considered while the situation with mixed AV and HV is not considered often. However, the latter situation might be the main traffic state in the future. 2. Vehicle efficiency, which is described as the total throughput of vehicles in an intersection within a given period of time, can be regarded as a key evaluation indicator for unsignalized intersections control. However, in much of the research, fuel consumption and exhaust emission were considered as the optimization objectives, while the vehicle efficiency is considered relatively less. In addition, setting of safety constraints in the optimization model or simulation model is adopted by most researchers to improve the safety of vehicles, while directly using real-time coordination of vehicle driving behavior to achieve collision avoidance is relatively lacking.
In view of the above problems, aiming at the control problem of unsignalized intersection, this article proposes a reinforcement learning model to study the control problem of unsignalized intersection. Therefore, the methods of DRL will be reviewed below.

DRL
Reinforcement learning 16 is favored by researchers because of the advantages of agent autonomous learning. To solve practical problems more effectively, the advantages of deep learning, such as priority in solving the multidimensional state space and multidimensional action space decision problem, are fully utilized. The application of reinforcement learning has been developed to a new stage, which is known as DRL. Through years of study by researchers, a relatively mature framework of DRL has been proposed, with research findings in many fields, such as robotics, 17 traffic control, 18 automatic driving, 19 and so on.
In view of different characteristics of actual problems, various improvements in DRL have been proposed by researchers to solve practical problems more efficiently. In addition, when solving practical problems, it is necessary to consider the specific situation and choose the most appropriate method. DRL methods can be divided into four categories: (1) Value function-based DRL (VF-DRL) method, (2) Policy gradient-based DRL (PG-DRL) method, (3) Value function and policy gradient-based DRL (VFPG-DRL) method, and (4) Multi-agent DRL (MADRL) method. Since this article is mainly based on DQN, a classical value function-based reinforcement learning method, the VF-DRL method will be highlighted among the four methods, while the other methods will be briefly described.
Q learning method is one of the VF-DRL methods. 20 The agent interacts with the environment, stores the historical data in the Q-table, and updates the Q-table to learn the optimal strategy. In addition, Q-learning can be applied to practical problems. For instance, to optimize the traffic flow of highways, the problem was described as a Markov problem in Walraven et al., 21 and Q-learning was used to explore the optimal strategy, that is, the maximum driving speed on the expressway. Meanwhile, the effect of the method was verified through simulation. The outcomes of simulation show that the strategy learned by Q-learning can greatly reduce traffic congestion in high traffic demand. However, one weakness of Q-learning is that a Q-table related to state and action is necessary to be established, and the size of state and action space depends on the computer memory, which may affect the calculation speed of the algorithm.
To further develop the reinforcement learning method, in 2013, the deep learning method was combined with the reinforcement learning method by Google deep mind team, and the DRL structure was proposed, which is named as deep Q network (DQN). 22 In 2015, the improved DQN algorithm was proposed and successfully published in Nature magazine. 23 DQN has also obtained some achievements in other practical problems. To solve the problem of obstacle avoidance of underactuated unmanned ships under unknown environment interference, the obstacle avoidance algorithm was designed by Cheng and Zhang 24 with DQN structure. The researchers found that the method based on reinforcement learning can avoid obstacles more accurately. Certainly, there are some problems of DQN method in application, such as overestimation and other weaknesses. Therefore, some improved methods for DQN were proposed by Van et al., 25 which can alleviate the overestimation problem of DQN method. The concept of advantage function was introduced by Wang et al. 26 in DQN, and the competitive DQN was proposed. A priority based DQN was designed by Schaul et al. 27 for priority sampling of important experience. After these studies, DQN series methods are gradually applied to practical problems. Based on the actual driving data constructed, the bidirectional DQN method is applied by Zhang et al. 28 to control the vehicle speed. The experimental results are compared with the traditional DQN, which shows that the accuracy of the estimation value and the quality of the strategy are greatly improved. The real-time GPS data is used by Zeng et al. 29 to replace the neural network in DQN with recurrent neural network. A new deep recurrent Q network (DRQN) is designed and applied to solve traffic light control problem at unsignalized intersection. The simulation shows that the effect of DRQN is better than DQN. Based on the above studies, it is clear that there is an excellent performance of VF-DRL method on discrete action space problems. However, many of the practical problems are the problems with continuous action and state space, which are high dimensional space problems. Therefore, VF-DRL has some limitations in solving more extensive practical problems.
In addition to the VF-DRL method, PG-DRL, VFPG-DRL, and MADRL methods also exist. These methods may outperform VF-DRL in some specific scenarios. PG-DRL method can be used to solve the continuous action space problem, and the convergence speed is relatively fast. However, the method is easy to converge to the local optimal solution and difficult to evaluate the policy. VFPG-DRL, a DRL method based on value function and strategy gradient, combines the advantages of VF-DRL and PG-DRL methods. However, there are some problems, such as low sampling efficiency and overestimation of evaluation network. Therefore, in practical applications, it is necessary to regulate and improve the algorithm according to the actual problem. In MADRL method, the agent can obtain the state information and reward information from the environment based on interactions among the agents and environment, and adjust the strategy of the agents based on the acquired information, thus learning the optimal strategy.

Problem formulation
Through the V2I environment, the information of the vehicles can be obtained by the Road side unit (RSU) within the control range in real time, and the information is transmitted to the control center. The current driving strategy of the CAV is calculated by the control center according to the vehicle information to avoid collision with other vehicles and accelerate the vehicle traffic efficiency. The yellow dot denotes the collision point where the vehicle may have a collision. In addition, in order to describe the state of vehicles within the control range more accurately, this article discretizes the road network of unsignalized intersections at the interval of length M, and Figure 1(b) shows the discrete diagram of the road network.
In Figure 1, Agent 1, 2, 3, and 4 were used to obtain optimal driving strategies for vehicles in the straight and left-turn directions, respectively. For example, Agent 1 controls the vehicles going straight from Lane 2 to Lane 17 and Lane 18, and the vehicles turning left from Lane 1 to Lane 20. In Figure 1(a), at the intersection, left turn and through vehicles are more likely to collide, that is, there are more collision points. In order to focus on such complex traffic environment, right turn is not considered in this article. However, the possible collision situations with right turning vehicles in other directions may also be considered. As long as the through vehicle and right turning vehicle are designed in the same lane and the model proposed in this article is trained, the situation including right turning vehicles can be obtained.
This article is based on the following basic assumptions: 1. Within the control range of unsignalized intersection, vehicles are not allowed to overtake or change lanes. 2. In the control range of unsignalized intersection, when receiving the control information, the vehicle drivers are completely subject to the control command of the controller. 3. Within the control range of unsignalized intersection, the delay of information communication between vehicles and road, vehicles and vehicles are acceptable, and there is no packet loss.

Problem description
State space. In this article, the position and speed of vehicles in the control range of unsignalized intersection are considered in the state space, and the state is determined according to the road network discrete diagram in Figure 1 s In equations (1) to (3), c is the control vehicle identification of the control direction, that is, it is used to distinguish it from CAV and HV. S t denotes the state of the agent at time t, which is a matrix composed of the position and speed of all vehicles in the control area; L t represents the position matrix of all vehicles in the control area at time t. l i c, t represents the ith lattice of the direction controlled by the agent at time t, i 2 I and I represent the total number of lattices of the control direction. l i j, t represents the ith lattice of the jth possible collision direction at time t. Comparing l i c, t and l i j, t , l i c, t represents the current direction controlled by the agent, while l i j, t indicates the direction that conflicts with the direction controlled by the agent. l i c, t ={-1,0,1}, c 2 f1, 2, 3, 4g, so it can be shown as l i 1, t ={-1,0,1}, l i 2, t ={-1,0,1}, l i 3, t ={-1,0,1}, l i 4, t ={-1,0,1}. Among them, 0 means no vehicle occupied the grid, -1 means other uncontrolled vehicles occupied the grid, and 1 means controlled vehicles occupied the grid. V t represents the velocity matrix of all vehicles in the control area at time t. v i c, t represents the velocity value of the ith lattice in the direction controlled by the agent at time t, and v i j, t represents the velocity value of the ith lattice in the jth possible conflict direction at time t. The relationship between v i c, t and v i j, t is the same as that between l i c, t and l i j, t . In v i c, t = {0,v c,t }, c 2 f1, 2, 3, 4g, so it can be described as v i Action space. This article aims to reduce the number of vehicle collisions in the control range of unsignalized intersection, and improve the traffic efficiency of vehicles through the control of vehicle acceleration and deceleration behavior According to equation (4), a t denotes the action in time t. In this article, there are two actions, acceleration and deceleration. In the action designed in this article, acceleration and deceleration are not a fixed value. Although two discrete values of acceleration and deceleration are designed in this article, VISSIM itself sets a situation similar to actual driving, that is, it includes the process of gradual acceleration and gradual deceleration.
Reward function. The objective of this article is to reduce the number of vehicle collisions in the intersection and improve the traffic efficiency of vehicles at the same time. In this article, at the unsignalized intersection, the first problem to be solved is vehicle collision. Therefore, this article sets the penalty when a vehicle collision occurs as the maximum penalty, as shown in equation (5a). And the vehicle passing the intersection safely is given a larger reward, as shown in equation (5b). In other cases, in order to ensure the vehicle passing the intersection as soon as possible, a fixed reward is given to the vehicle when it is driving at normal speed, while a penalty of the same value is given when the vehicle speed is small, as shown in equations (5c) and (5d). The reward function formula, equation (5) in the original text, is as follows where r t means the immediate reward value of the control vehicle at time t, Col means that the control vehicle has collided, Pass means that the control vehicle has successfully passed the intersection, and d means the predefined minimum speed value. The setting of this parameter is to avoid the long-time waiting of the vehicle. v t indicates the speed of the vehicle at time t, and u 1 , u 2 , and c 1 are parameters with values of 8, 40, and 0.2, respectively, which is based on several iterations in the process of the experiment. The principle of the reward function is as follows. In order to reduce the number of collisions of control vehicles at unsignalized intersections, if the control vehicles collide, a large penalty value will be given, which is related to the speed of collision. When the vehicle passes through the unsignalized intersection successfully, a larger positive reward value is given, which is also related to the speed of the vehicle passing through the detection point. In other cases, in order to make the vehicles stay in the control range of unsignalized intersection, when the control vehicle speed is less than the threshold, a smaller penalty can be given, otherwise a smaller reward can be given. Based on the reward function, it can avoid collision and ensure the vehicles to pass through the intersection as soon as possible.

Improved double dueling deep Q network 3DQN
3DQN-CNN-LSTM Figure 2 shows the structure diagram of the improved double dueling deep Q network (3DQN-CNN-LSTM) model proposed in this article, which is different from the structure diagram of the double dueling deep Q network (3DQN) in the work by Gong et al. 30 In this article, CNN are used to extract important features. Meanwhile, LSTM networks are used to select the best strategy combined with memory information and current information. In addition, the positive and negative reward experience buffer pool method, 26 multi-step reward, and penalty method 15 are used in this method to speed up convergence.

Multi-step reward and penalty
The less experience of vehicle collision and vehicle passing leads to the decrease of learning efficiency of the algorithm. Therefore, the article proposes a multi-step reward and penalty method based on the error correction method of failure experience proposed by Zhang et al. 28 The experience of collision or possible collision, and the experience of vehicle successfully passing or promoting vehicle passing are mainly increased through the method. The main idea is that if the vehicle collides or successfully passes the intersection at time t, adjust the immediate reward value at time t -1, time t -2, until time t -d, where d is the first d step and d = {1,...,D}. D is the predefined maximum step. The specific adjustment method is that if there is a collision at time t, the experience penalty at time [t -d ... t] will be given; if the vehicle passes the intersection successfully at time t, the accumulated reward at time [t -d ... t] will be given. The corresponding reward and penalty are calculated according to equation (6). In equation (6), the immediate reward at t -d is r t-d and the discount factor is v 1 and v 2 However, in some cases, in order to prevent the possibility of collision, the behavior of deceleration or braking is taken by the agent, which is not supposed to be punished. Therefore, the algorithm is designed to determine whether the agent takes the behavior of deceleration or braking. If the agent takes actions to avoid collision, equation (6) is not used to calculate the reward value from time t -d to time t. Otherwise, if the vehicle does not take any beneficial action to avoid collision, equation (6) is used to calculate the reward value from time t -d to time t.
According to equation (6), the reward value at time t -d, that is, r t-d , is calculated according to the reward value at time t, that is, r t . If the vehicle collides or the vehicle successfully passes the intersection, the reward value at time t -d, r t-d , is calculated according to the reward value calculated by equation (6). Otherwise, the reward value r t-d is directly calculated according to equation (5).

Positive and negative sampling experience pool
At present, a large number of researchers have applied Deep Deterministic Policy Gradient (DDPG) to solve many problems of continuous action space. 31 However, in the application of DDPG algorithm, there are still some problems. For example, when sampling in the experience playback buffer pool, the selected historical experience is randomly selected. Therefore, it is difficult to balance the proportion of positive reward and negative reward experience, which leads to the poor stability of the algorithm.
In the original DDPG, 32 the experience replays buffer (ERB) is mixed with positive and negative experiences. In order to solve the problem of vehicle delay caused by parking within the control range of unsignalized intersection, the historical experience is divided into positive reward experience and negative reward experience, the positive and negative experience are stored in the PNRERB, respectively, in this article. Similar to the original DDPG method, the sizes of the two cache pools are initialized first in this article, and the historical experience are replaced with the new one after the cache pool is full of experience.
In addition, small batch learning method is used to train DRL network in this article. Therefore, let the agent interact with the environment, collect a certain amount of historical experience, and then extract experience from the cache pool to train the neural network. In order to improve the sampling performance of the algorithm, the experience pool is divided into two experience pools, which are the experience pool with reward value greater than 0, that is, positive reward experience pool, and the experience pool with reward value less than or equal to 0, that is, negative reward experience pool, respectively. The sampling ratio from positive and negative experience pool is set as 3:1.

Central control method and multi-agent control method based on DRL
At present, intersection control can be divided into central control and distributed control. 5 Based on the idea of these two control methods, this article proposes a central control method and a multi-agent control method for unsignalized intersection based on improved DRL. In these methods, the implementation of multi-agent control method is based on the distributed control of multi-agent reinforcement learning method. The structure diagram and pseudo code of these two control methods are introduced in the following section.
Central control method. The central control method is to set up an agent in the central control system, and the agent controls all the CAV in the intersection control area, as shown in Figure 3. Table 1 gives the pseudo code of the central control method.
Multi-agent control method. Multi-agent control method is mainly to set up multiple control agents in the central control system, which will control all the CAV in their respective control areas of the intersection.
In this article, the CAV of the four entrances in southeast and northwest directions are regarded as the control vehicles of the four agents. Each agent controls the whole process from the CAV entering the control area to the vehicles leaving the control area. Figure 4 shows the interaction structure diagram of multi-agent control with the environment. From Figure  4, it can be seen that multiple agents interact with the VISSIM environment separately. Taking the four agents in this article as an example, each agent obtains the state information of its own control direction within the control range of the intersection from the VISSIM environment, and unites with the states of the other three directions to constitute the combined state. Each agent makes the action of the current control direction of the CAV according to the joint state. Finally, by sending its action to the VISSIM environment, each agent gets the immediate reward under the current state-action and the state of the next moment. Table 2 shows the pseudo code for the multi-agent control method.
In this article, the main difference between Algorithms 1 and 2 manly lies in the Lines 14 to 16 of the pseudo code. It can be seen from Tables 1 and 2 that the central control method in the pseudo code Line 14 mainly uses the central control agent to select the corresponding action according to the state in time t. Meanwhile, in the multi-agent control method, in Line 14 of the pseudo code, N agents select the corresponding action according to the state at time t. In the multi-agent control method, multiple agents work in a distributed way. For agent i, if there is a connected vehicle in its control area, agent i will be activated and the action at time t will be selected according to the state at time t. Otherwise, the input of agent i will be empty.

Experiment
The optimization effect of different DRL methods has been verified through the simulation experiment of vehicle control at unsignalized intersection, which is presented in this section. First, the simulation platform and parameters of the unsignalized intersection are described. Second, the control methods based on DRL are discussed. Specifically, the structure and parameter setting of each method is explained. Third, the scheme in this article is outlined and finally, the result based on the DRL control method is analyzed.

Simulation platform and parameter setting
A virtual road network is built in the experiment under the VISSIM simulation environment. The road network structure is shown in Figure 5. The cycle length in the signal control intersection is 80 s, the green time of east-west and north-south straight line is 37 s, the yellow time is 3 s, the green time of east-west and northsouth straight line is 22 s and the green time of leftturning is 8 s, and the yellow time is 3 s.
The differences between AV and HV are mainly explained from the following two aspects: (1) In terms of the concept defined, AV refers to the intelligent connected vehicle. HV refers to traditional manned vehicle. AV mainly simulates the actual intelligent vehicle behavior, and HV mainly simulates the actual human driving vehicle behavior; (2) In terms of the driving strategy, the optimal driving strategy of AV are obtained according to the DRL algorithm proposed in this article, while the driving strategy of HV are obtained according to the built-in model of VISSIM simulation system. HVs are used to simulate the actual driving behavior of human beings. It receives the driving strategy of the built-in model of VISSIM simulation system and does not receive the control of DRL agent. DRL agent is mainly used to control AVs driving behavior.
The simulation interval in VISSIM is 1 s, that is, 1 frame/1 simulation step. This article mainly adopts VISSIM and Python interactive simulation. The DRL algorithm is realized in Python environment. At time t, the DRL algorithm obtains the simulation data from VISSIM, calculates the optimal driving strategy, and then provides the optimal driving strategy to VISSIM. The connected vehicle (CV) in VISSIM obtains the driving strategy and continues the simulation at the next time.
The results of the two classes of methods are compared in this section. One class is mainly based on DQN methods, which contains Deep Q Network method (DQN-NN), 26 Deep Q Network method based on CNN (DQN-CNN), 27 Double Deep Q Network  if CAV does not exist in the current road network 7: if the current time reaches the maximum cycle time T, then end if 8: if the entrance detector detects the entry of a vehicle, then add the vehicle number 9: if the exit detector detects a vehicle leaving, then remove the vehicle number 10: Else if CAV existed in the road network, then go to Line 11 Else, run VISSIM simulation for single step, then go to Line 5, 11: Store the current status S t of each CAV 12: while True 13: a t = a t , p random \s max a Q Ã (s t , a; u), p random .s

14:
Execute a t , calculate r t and s t + 1 based on equation (5)  15: If the historical experience of the controlled vehicles is d, 16: if vehicle collision happens or pass the intersection at time step t, calculate r t-d according to equation (5), store r t-d in R + and R -. 17: else store the r t-d in experience pool 18: Select batch of historical memory from memory pool randomly 19: Set y j = r j end of episode r j + gQ 0 (s 0 , arg max a Q(s 0 , a) not the end of episode 20: Set L = (r + gQ 0 (s 0 , arg max a Q(s 0 , a) À Q(s, a)) 2 Update gradient rL rL = E s, a, r, s 0 ½(r + gQ 0 (s 0 , arg max a Q(s 0 , a) À Q(s, a))r u Q(s, a) 21: In order to verify the optimization result of the vehicle performance index of the proposed method for the actual intersection, the mixed driving of straight vehicles, and left turning vehicles in the unsignalized intersection is considered in this article. The experimental scheme is given in Table 3.
In the scheme, there are straight vehicles and left turning vehicles driving in the unsignalized intersection. Table 3 shows the distribution of traffic flow at each entrance lane of the scheme. When the vehicles in the unsignalized intersection are straight vehicles and left turning vehicles, the total flow is 800 vehicles/h, 1680 vehicles/h, and 2560 vehicles/h, respectively. The first column in Table 3 is the total flow. The second to ninth columns show the flow of each inlet. According to Figure 1, Inlets 1, 4, 7, and 10 are left turn lanes, and Inlets 2, 5, 8, and 11 are straight lanes.
According to Figure 1, Inlets 3, 6, 9, and 12 are reserved lanes for these right-turning vehicles. Meanwhile, Lanes 13 to 20 are outlets, and these lanes are controlled by the agents in the corresponding directions. In the simulated environment of the intersection, there are three inlets but only two outlets in each direction. If no CAV existed in the current road network 7: if the current time reaches the maximum cycle time T, then end if 8: if the entrance detector detects the entry of a vehicle, then add the vehicle number 9: if the exit detector detects a vehicle leaving, then remove the vehicle number 10: If CAV exist in the network, go to Line 11, else run VISSIM simulation for single step, and go to Line 5.

11:
If CAV exist in the network, 12: Obtain the current status s i t of each CAV 13: while True 14: for agent = 1, N, do 15: Obtain the state s i t of the ith agent, i 2 ½1, N 16: if random probability less than s, choose action a i t randomly; else select action with a i t = max a Q Ã (s i t , a; u i t ) 17: Execute action ½a 1 t :::a i t , obtain the reward ½r 1 t :::r i t and the states ½s 1 t + 1 :::s i t + 1 18: if the historical experience of the control agent is d if the CAV collide or pass the intersection safely in time t.

19:
Calculate the reward value based on equation (6): Store the positive reward and the negative reward into R + i and R À i . 20: else store the reward of t -d experience r t-d into experience pool 21: Select historical experience randomly from memory space 22: for agent = 1, N, do end of episode not the end of episode

Experimental result
In this article, the success rate of vehicles successfully passing through the intersection 12 is taken as the evaluation index to test the effect of the method based on DRL. The calculation formula of vehicle success rate is given in equation (7) where R denotes the success rate, CV denotes the number of vehicle collisions within a certain simulation time, and TF denotes the total traffic issued within a certain time.
The calculation formula of average travel time (ATT) is where B Average_Travel_Time denotes the ATT gain, ATT s denotes the ATT of the vehicle under the signal control method, ATT RL denotes the ATT of the vehicle under the DRL method. The two key evaluation metrics in this article are the average vehicle travel time and the total throughput of the vehicle. Equation (8) shows the correlation between these two metrics Equation (9) represents the ATT of all the vehicles, from entering the control area to leaving the control area of the intersection. Here, k ATT represents ATT, TT k represents the travel time of the k th vehicle, and n indicates the total throughput of the intersection control area during the total simulation time.  100  100  100  100  100  100  100  100  1680  120  300  120  300  120  300  120  300  2560  140  500  140  500  140  500  140  500 EL: entrance lane. In the following section, the success rate, ATT, and vehicle trajectory experimental results of vehicles passing through the intersection under this scheme will be analyzed, respectively.
The success rate. Figure 6 shows the success rate of vehicles passing through the intersection when the total flow of the intersection is 800 vehicles/h based on DQN method and 3DQN method. It can be seen from Figure  6 that under the five permeability levels (20%, 40%, 60%, 80%, and 100%), the success rate based on DQN method and 3DQN method is greater than that when the permeability is 0%, and when the permeability is 100%, the success rate based on 3DQN method can be higher than that based on DQN method. This is also shown at the intersection of left turn and straight traffic. In addition, in the 3DQN-based method, under the same permeability, the success rates of 3DQN -CNN-LSTM method and 3DQN-CNN method are higher than those of multi-3DQN-CNN-LSTM method and multi-3DQN-CNN method, respectively, indicating that the optimization effect of central control method is better than that of multi-agent control method.
The percentage of success rate under two methods when traffic flow is equal to 800 vehicles/h, 1680 vehicles/h, and 2560 vehicles/h is shown as Table 4. The values highlighted in gray in Table 4 indicate the optimal values under the same penetration rate. In Table 4, and MPR is market penetration rate. According to Table 4, under three different traffic flow, the percentage of success is more than 70% based on DQN method and 3DQN method. However, under the same traffic flow, the percentage of success based on the two methods is rising gradually with the increase of penetration rate. However, under the same penetration rate, the percentage of success is decreasing with the increase of traffic flow. Therefore, comprehensively, under three different traffic flows, the percentage of success can be optimized effectively based on DQN method and 3DQN method. Within the same traffic flow, the higher the penetration rate, the better the optimization effect of two methods, and within the same penetration rate, the greater the traffic flow, the worse the optimization effect of the two methods. Within the same traffic flow and penetration rate, the percentage of success of 3DQN method is higher than DQN method. Based on the DQN method, the percentage of success of DQN-NN method, DQN-CNN method, Double-DQN-CNN    by 3DQN-CNN-LSTM method). However, under the same penetration rate and traffic flow, within the method based on 3DQN, the percentage of success of 3DQN-CNN-LSTM method and 3DQN-CNN method are higher than multi-3DQN-LSTM method and multi-3DQN-CNN method, respectively. In a summary, the percentage of success of 3DQN-CNN-LSTM is the highest, which means there is a better performance of central control method than the multi-agent control method in the optimization of percentage of success. Therefore, the 3DQN-CNN-LSTM method proposed in this article is the best for optimizing.
ATT. Table 5 shows the ATT of vehicles and the number of vehicles successfully passing through the intersection obtained based on DQN-based method and 3DQNbased method when the total flow is 2560 vehicles/h. The values in bold font in Table 5 represent the optimal values under the same permeability. It can be seen from Table 5 that the ATT obtained by DQN-based method is less than that obtained by 3DQN-based method, but the number of vehicles successfully passing through the intersection obtained by 3DQN-based method is greater than that obtained by DQN-based method. The signal control method has the highest total throughput at 0% penetration. However, its ATT is about 3 times that of the other methods. It can be considered that the signal control method has the highest security, but its combined efficiency is not as good as the other reinforcement learning methods. When the penetration rate is less than or equal to 100%, the number of vehicles passing through the intersection successfully obtained by DQN-NN method increases very little, perhaps because  M2  M3  M4  M5  M6  M7  M8  20%  79  80  80  80  82  81  83  82  40%  81  83  82  83  84  83  87  85  60%  82  83  84  84  86  85  91  89  80%  82  85  87  88  91  89  96  91  100%  84  88  89  89  93  93  99  DQN-NN method cannot correctly judge the driving strategies of other vehicles, resulting in wrong driving decisions. When the penetration rate is 100%, the number of vehicles successfully passing through the intersection under 3DQN-NN-LSTM method is more than that of the signal control method. Similarly, the ATT is shorter than the signal control method, and the ATT gain can reach 69%. At the same time, under the same permeability, considering the ATT and the number of vehicles successfully passing through the intersection, the effect of 3DQN-NN-LSTM method is better than that of DQN method, 3DQN-CNN method, multi-3DQN-CNN method, and multi-3DQN-CNN-LSTM method. This shows that even when the traffic flow is large, 3DQN-NN-LSTM method can still ensure the safe and rapid passage of vehicles. At the same time, it also shows that the optimization effect of central control method is better than that of multi-agent control method. Figure 7 shows the ATT of vehicles obtained by DQN-based method and 3DQN-based method with a total flow of 2560 vehicles/h. It can be seen from Figure  7 that (1) The ATT under signal control method is nearly 3 times larger compared to other DQN-based method and 3DQN-based method when the permeability is larger than 0%; (2) The ATT under 3DQN-CNN method and multi-3DQN-CNN method is higher than that of DQN method, 3DQN-CNN-LSTM method, and multi-3DQN-CNN-LSTM method.
To further verify the experimental results, different permeability environments with traffic flow of 800 vehicles/h and 1680 vehicles/h are also investigated in this article, and the experimental results are shown in Tables 6 and 7, respectively. According to Table 6, the total throughput under the 3DQN-CNN-LSTM method gradually increases with the increase of penetration rate, but is always lower than the total throughput under signal control method. According to Table 7, the total throughput of the 3DQN-CNN-LSTM method is higher than the total throughput under signal control at a penetration rate of 100%. In addition, the ATT required for the model with DRL method is consistently smaller than that of the signal-controlled approach. Therefore, it can be concluded that the 3DQN-CNN-LSTM method proposed in this article will outperform the signal control method under each evaluation indicator in the case of high penetration rate and high throughput.
To sum up, in general, the DQN-based method and the 3DQN-based method can effectively reduce the number of vehicle collisions at intersections under different permeability and improve the success rate of vehicles passing through intersections. At the same time, compared with the signal control method, the ATT based on DQN method and 3DQN method is shorter. Specifically, (1) Under different traffic flows, with the increase of the market penetration of CAV, the success rate of vehicles passing through the intersection based on DQN method and 3DQN method increases gradually. However, when the total flow at the intersection is slightly larger (greater than 1200 vehicles/h), the increase based on 3DQN method is higher than that of DQN method. In addition, based on 3DQN method, 3DQN-CNN-LSTM method performs better under any intersection flow. When the permeability is 100%, the success rate of 3DQN-CNN-LSTM method can reach 99%; (2) Compared with the signal control method, the ATT based on DQN method and 3DQN method is shorter. When the permeability is less than or equal to 80%, the number of vehicles successfully passing through the intersection based on DQN method and 3DQN method is less than that of the signal control method. When the permeability is 100%, the number of vehicles successfully passing through the intersection based on 3DQN method is close to that of signal control method, and the number of vehicles under 3DQN-CNN-LSTM method is higher than that of signal control method under some total flow (such as 2560 vehicles/h); (3) No matter the convergence of the algorithm or the reduction of the number of vehicle collisions, the optimization effects of 3DQN-CNN and 3DQN-CNN-LSTM methods are better than multi-3DQN-CNN and multi-3DQN-CNN-LSTM methods, respectively, which shows that the central control method proposed in this article is better than multi-agent control method.
In summary, the collision times and travel time of vehicles can be effectively reduced based on DQN method and 3DQN method. The evaluation indicators in this article are mainly the ATT and the total throughput. In this article, the strengths and weaknesses of the model are analyzed mainly based on these two indicators. The signal control method has the highest total throughput when the penetration rate is 0%. However, its ATT is about 3 times higher than the other methods. Therefore, we believe that although the signal control method has the highest security, its overall efficiency is inferior to the 3DQN-CNN-LSTM method proposed in this article.
The goal of this article is to optimize the traffic efficiency on the premise of avoiding the collision of vehicles at the intersection. Therefore, if the vehicle collides, a greater penalty will be given. Therefore, in order to ensure that there is no collision, the vehicle may reduce the speed, thereby reducing the traffic efficiency. When the penetration rate is lower than 100%, in order not to collide with other vehicles, HV may not deliberately reduce the speed on the simulation road; however, CAV may accelerate or decelerate to reduce the collision. Therefore, in order to ensure safety, some traffic efficiency may be lost. However, when the penetration rate is set as 100%, all vehicles are connected vehicles, and all vehicles themselves will take the behavior of avoiding collision, so the traffic efficiency is further guaranteed in a relatively safe environment.
The success rate and ATT under complex traffic environment. The reason why the right turn movement is not considered in above discussion is that the purpose of this article is to focus on the unsignalized intersection control under the objective of reducing vehicle collision and delay as much as possible and considering the mixed traffic of HV and AV. In the two-way four lane unsignalized intersection environment tested in this article, the setting of collision points in the road network diagram of unsignalized intersection is shown in Figure 1. It can be seen from Figure 1 that the collision between motor vehicles mainly exists in the left turn and through traffic flow, and the right turn traffic flow has little impact on the traffic flow. In addition, this article does not consider the mixed traffic flow in the presence of both pedestrians and non-motor vehicles. Therefore, right turning vehicles may be ignored. However, if the mixed traffic flow with both pedestrians and non-motor vehicles is considered in the research, the right turn traffic flow must be considered. We also analyzed and discussed the experimental results of the success rate of vehicles passing through the intersection and the ATT of vehicles under different traffic flows, comprehensively considering the through traffic flow, left turn traffic flow, and right turn traffic flow in the next part. Figure 8 shows the success rate of vehicles passing through the intersection under the comprehensive consideration of straight, left turn, and right turn traffic flows. In Figure 8   the best results. When the penetration rate is 100%, the success rate can reach 99%, while 3DQN-CNN and multi-CNN-LSTM methods and 3DQN-CNN and multi-3DQN-LSTM are better than DQN-NN, DQN-CNN, double DQN-CNN, and dueling DQN-CNN, that is, they have a higher success rate. Figure 9 shows the ATT of vehicles under the comprehensive consideration of straight, left turn, and right turn traffic flows. Table 8 shows the ATT of vehicles and the number of vehicles successfully passing through the intersection under the comprehensive consideration of straight, left turn, and right turn traffic flows. In Figure 9 and Table 3, the total flow is 2000 vehicles/h. Different methods, such as DQN-NN, DQN-CNN, double-DQN-CNN, dueling-DQN-CNN, 3DQN-CNN, multi-3DQN-CNN, 3DQN-CNN-LSTM, and multi-3DQN-CNN-LSTM are tested. It can be seen from Figure 9 and Table 8 that the ATT optimized by 3DQN-CNN and multi-3DQN-CNN is longer, indicating that although 3DQN-CNN and multi-3DQN-CNN can increase vehicle safety, they also increase vehicle travel time. Compared with the 3DQN-CNN-LSTM method and the multi-3DQN-CNN-LSTM method, the 3DQN-CNN method and the multi-3DQN-CNN method not only have a longer ATT but also have a relatively small number of vehicles successfully passing through the intersection.
Vehicle trajectory. The 3DQN-CNN-LSTM method is adopted as control method. Through the above analysis, it is clear that there is a better performance of 3DQN-CNN-LSTM method, thus the vehicle trajectory of 3DQN-CNN-LSTM method is mainly analyzed in this section. When the penetration is 0%, the trajectory is the vehicle trajectory obtained under the signal control method, in which the network CAV trajectory is represented by red line and the HV trajectory is represented by blue line. The trajectories in this section are those of vehicles passing through the intersection  The spatiotemporal trajectory of vehicles in Lane 5 at a total traffic flow of 1680 vehicles/h is shown as Figure 10. According to Figure 10(b)-(f), there is an increment in the vehicle trajectory with the increase of penetration. Compared with Figure 10(a), the vehicle trajectory in Figure 10(b)-(f) is smoother, which means that the optimization of the vehicle trajectory based on

3DQN-CNN-LSTM method is excellent at the unsignalized intersection with through and left-turning vehicles.
As can be seen from Figure 10, compared with the signal control method, the vehicle trajectory optimized by 3DQN-CNN-LSTM method has less parking waiting, and most of them are relatively smoother. Therefore, 3DQN-CNN-LSTM method can effectively alleviate the phenomenon of vehicle parking queue. Under the same traffic flow, with the increase of the penetration of CAV, the vehicle trajectory gradually increases, indicating that with the increase of the penetration of CAV, the number of vehicle collisions is less.

Conclusion
In order to reduce vehicle collisions and improve traffic efficiency, the unsignalized intersection is researched and the methods based on DQN and 3DQN are designed to solve this problem. The results show that the percentage of success based on DQN method and 3DQN method increases with the increase of the penetration rate of CAV under the same traffic flow. In addition, under the same traffic flow and penetration rate, the optimization effect based on 3DQN method is better than that of the DQN method, and the success rate is up to 99%. At the same time, compared with the signal control method, the ATT of vehicles passing through the intersection based on DQN method and 3DQN method has been greatly reduced, and the ATT is in the range of 18%-72%. In addition, the optimization effects of 3DQN-CNN and 3DQN-CNN-LSTM methods perform better than those of multi-3DQN-CNN and multi-3DQN-CNN-LSTM methods in terms of vehicle collisions. It shows that the optimization effect of central control method is better than that of multi-agent control method.
An improved DRL method for vehicle control at urban road intersections has been proposed in this article. Although the methods proposed have some results for vehicle control problems, there is still room for improvement. For example, the vehicle type studied in this article is car, and the possibility of other vehicle types coexisting has not been considered. In the future, more vehicle types and effective MADRL will be considered.