A simulation work for generating a novel dataset to detect distributed denial of service attacks on Vehicular Ad hoc NETwork systems

Vehicular Ad hoc NETwork is a promising technology providing important facilities for modern transportation systems. It has garnered much interest from researchers studying the mitigation of attacks including distributed denial of service attacks. Machine learning techniques, which mainly rely on the quality of the datasets used, play a role in detecting many attacks with a high level of accuracy. We conducted a comprehensive literature review and found many limitations on the datasets available for distributed denial of service attacks on Vehicular Ad hoc NETwork including the following: unavailability of online versions, an absence of distributed denial of service traffic, unrepresentative of Vehicular Ad hoc NETwork, and no information regarding the network configurations. Therefore, in this article, we proposed a novel simulation technique to generate a valid dataset called Vehicular Ad hoc NETwork distributed denial of service dataset, which is dedicated to Vehicular Ad hoc NETworks. Vehicular Ad hoc NETwork distributed denial of service dataset holds information on distributed denial of service attack traffic considering Vehicular Ad hoc NETwork architecture, traffic density, attack intensity, and nodes mobility. Well-known simulation tools such as SUMO, OMNeT++, Veins, and INET were used to ensure that all the properties of Vehicular Ad hoc NETwork have been captured. We then compared Vehicular Ad hoc NETwork distributed denial of service dataset with several studies to prove its novelty and evaluated the dataset using several machine learning models. We confirmed that studied models using this dataset achieved high accuracy above 99.5% except support-vector machine that achieved 97.3%.


Introduction
Globally, car accidents represent a high proportion of road traffic deaths as shown by a World Health Organization 1 report, in 2018, on road safety, reporting a road traffic death rate of 18.2 per 100,000 population around the World and about 26.6 per 100,000 population in some regions. Thus, determining ways to utilize technologies to enhance and improve traffic safety is driven by a mandatory need to safe people's lives. From this context, several studies have been proposed to employ the new technologies in reducing such traffic accident rate. Among these technologies are the Vehicular Ad hoc NETworks (VANETs), a very popular and important technology due to its link to and impact on the safety of people and communities. The system is a combination of several wireless and sensor technologies including intelligent transport system (ITS), mobile ad hoc network (MANET), and an Internet of thing (IoT) application. 2 VANET is comprised of nodes that use wireless networking technologies as a means of communication. The vehicles' nodes communicate with each other and with the road side units (RSUs) through a communication unit in each vehicle called the on-board unit (OBU), which in turn is connected to the application unit (AU) to provide an application interface. 3 VANET supports both safety and non-safety applications. The main goal of its safety applications is to minimize accidents and improve driving safety by alerting drivers regarding collision avoidance, road sign notifications, and alarms for incident management. By contrast, non-safety applications are divided into two subsections: traffic coordination and infotainment applications. Traffic coordination leverages vehicular communications to broadcast traffic information between vehicles on the road; this optimizes traffic flow and improves driver experience. Infotainment applications aim to provide drivers with contextual information such as pertinent advertisements and parking assistance, in addition to entertainment during their journey. 4 However, safety and non-safety applications are not completely separated from each other; hence, all aspects should be considered when designing VANET applications. 5 In VANET, routing is a challenging factor owing to the unique characteristics of the network, especially the rapid mobility of nodes that causes quick variations in topology. In addition, diverging density and speed of the vehicles on the road may lead to either overhead or poor connectivity due to sparse distribution. The current VANET routing protocols are categorized based on topology, position, cluster, broadcast, and geocastbased protocols. 6 The unique characteristics of VANET expose it to many threats that may compromise and corrupt the whole VANET system. 7 Threats may originate owing to vulnerabilities such as those related to communication protocols, 8 energy flow and authentication, 9 integration, 10 and information privacy and integrity. 11 Among these attacks is the distributed denial of service (DDoS) attack, which aims to deny network availability by flooding either the vehicle, the infrastructure, or both, with spurious messages. 12 There are two possible scenarios when targeting VANET with DDoS attacks-from vehicle to vehicle and from vehicle to infrastructure. 13 The first scenario occurs when a number of attacker vehicles targets a vehicle by flooding it with fake messages, so that the victim vehicle is unable to send/receive legitimate requests. The second scenario occurs when a number of attacker vehicles targets the RSU, so that it becomes unavailable to legitimate nodes. Figure 1 explains a vehicle to infrastructure DDoS attack, where the red vehicles represent attackers targeting the RSU2 by flooding it with fake messages, eventually disabling it.
In this article, we explored several studies related DDoS attack detection in VANET systems to highlight the existing limitations on the datasets used as basis for training the prediction models. Based on the found limitations and gap, we proposed and conducted a simulation framework to generate a suitable dataset that fits the VANET architecture considering a common routing protocol in VANET, namely, ad hoc on-demand distance vector (AODV). Moreover, the generated dataset is used to evaluate numerous machine learning (ML) models for the purpose of detecting DDoS attacks targeting VANET nodes.
The organization of this study is as follows: we introduce the literature review in section ''Literature review'' with a focus on the studies that involve the usage of datasets in evaluating their DDoS attack detection techniques. The simulation work is presented in section ''Simulation work for generating the dataset.'' The process of generating and adopting the dataset is presented in section ''Dataset specifications.'' Finally, section ''Conclusion'' summarizes the main points as a conclusion, and section ''Future work'' presents the future work.

Literature review
Several studies have been proposed to mitigate the effects of DDoS attacks targeting VANET or ad hoc networks. Some of the studies used statistical methods to detect the attack and others relied on ML techniques to detect and classify the attack traffic. Others still introduced frameworks on generating datasets for evaluating network attack detection techniques. Figure 2 shows the hierarchy and taxonomy of the explored studies in this article. We categorized the studies into six classes based on network environment (VANET or non-VANET), the inclusion of DDoS attack, the presence of a dataset (dataset-based or statistical method), and the evaluation method (ML-based or generation framework).
In Kolandaisamy et al., 13 multivariant stream analysis (MVSA) was proposed as a method to detect and prevent DDoS attacks targeting VANET. Similarly, Haydari and Yilmaz 14 used a statistical anomaly detection technique to detect the attack by applying an online discrepancy test (ODIT) at the detection phase for both low-and high-rate DDoS attacks and then blocking the traffic generated by the attackers based on their locations.
The study presented in Shabbir et al. 15 proposed a threshold-based framework for detecting and preventing attacks by utilizing communication time as a communication characteristic to be compared with a specific threshold to decide about alerting the other nodes to avoid further communication with the attacker nodes. This type of study was excluded as it did not involve datasets for training the proposed models.
Mirsadeghi et al. 16 introduced a cryptography method based on certificates issued by a trusted authority. To have a trusted clustered vehicular network, they proposed estimating a trust degree for each node considering the trust between vehicles and RSUs. Then, based on the estimated trust degree and other mobility measures, the appropriate cluster head is selected which in turn checks the trust degree of abnormal nodes. Any abnormal nodes will be in the blacklist of the certification authority and thus unable to communicate with either other vehicles or control units.
Bhushan and Gupta 17 discussed the features of software-defined networking (SDN) and proposed a novel flow table sharing technique to mitigate DDoS attacks that target the network by overloading the flow table, which usually has a limited size. For the proposed approach, they modeled the flow table space as an M/ G/S/C queuing model and then applied several rulebased methods to detect and prevent DDoS on SDNs by utilizing the flow table status for all the switches and the blacklist database that holds Internet protocol (IP) addresses of attack sources.
Kolandaisamy et al. 18 proposed an analysis model that is capable of detecting DDoS attacks on a VANET environment with less time needed to identify the attack compared with other techniques discussed in their study-related work. The main idea is to calculate different measures through different stages as follows. Based on the clustering score of the incoming packets, a stream position analysis was used to calculate specific determined features for the nodes including the volume of the communicated data, payload, and message rates. Then, these computed features were used to calculate the conflict field, conflict data, and attack signature sample rate, which are finally used in a statistically based model to decide on the legitimacy of a node.
Kolandaisamy et al. 19 proposed using an analytical approach that utilizes the measures gathered by packet marking based on an adapted stream region scheme. The proposed approach involves extracting the neighbor log file, calculating the node's value, deciding on the source region, identifying the routes for each region, and computing the circulation rate. Finally, the identification of a DDoS attack is based on the deviation of the circulation from the current rate. Similarly, Bensalah et al. 20 proposed a statistical method for detecting and controlling malicious nodes in a VANET using a variable control chart as a model to monitor the quality of the communication for each node. Then, based on the taken measurements, a node is considered a malicious node when its statistical quality violates the control limit.

ML-based studies for DDoS attacks on VANET
ML techniques are promising techniques for providing accurate detection and prediction mechanisms used in many areas and domains including VANETs. 45 There are several ML studies applied to detect numerous malicious behaviors on VANETs. Since our focus in this study is on the dataset that fits the VANET environment, we explored several studies that implemented ML techniques to detect DDoS attacks on VANET systems. [21][22][23][24][25] Moreover, we have highlighted the datasets, simulation tools, and ML techniques used in such studies.
Singh et al. 21 conducted an analysis on the impact of DDoS attacks on the vehicle to infrastructure (V2I) communication under an SDN architecture. They simulated Software-Defined Video Networking (SDVN) using Mininet-WiFi and scikit-learn library for ML classifiers. In addition, eight supervised classifiers were used including gradient boost, random forest (RF), logistic regression (LR), nearest neighbors, decision tree (DT), Support-Vector Machine (SVM), naı¨ve Bayes (NB), and neural network. Gradient boost classifier gives the best accuracy among the models used. However, the study did not provide details of the generated dataset or the simulation scenarios such as the number of attacker nodes and the attack rate.
Aneja et al. 22 introduced a hybrid Intrusion Detection System (IDS) to detect the RREQ Flooding attack in VANET environment where they used SUMO, MOVE, and NS-2 tools in their conducted simulation experiments. They combined Artificial Neural Networks (ANNs) with a Genetic Algorithm (GA) model as a detection model where ANN performs the classification and GA tunes the selected input features. The dataset generated in Aneja et al. 22 has a detailed description of the simulation tools used along with the related steps and parameters. However, it has certain limitations: the dataset itself is not available for other researchers, the network configuration was not presented in the study, and there is no report on the dataset features.
The dataset generated in Karagiannis and Argyriou 23 is not available online, the study did not show a clear procedure on processing the data, and there is no report on either the simulation tools or the network configuration parameters. Similarly, the dataset introduced in Belenko et al. 24 is not available online and has no description of the configuration of the network and simulation environments. Thus, the generated datasets introduced in Singh et al., 21 Aneja et al., 22 Karagiannis and Argyriou, 23 and Belenko et al. 24 have been excluded from this study owing to the unavailability of both the dataset and the configuration of simulation work used to generate the datasets.
The study in Zeng et al. 25 proposed a deep learning technique as an IDS that can perform feature extraction and classification of different attacks on VANET including DDoS, wormhole, and Sybil attacks. The dataset used to train the detection models was generated using an NS-3 simulator considering only the raw packets and logs as the output from the simulator. In addition, the ISCX IDS dataset 46 was used to regenerate and extract samples for different types of attacks such as DDoS attacks. The main observed limitations of the generated dataset are related to the generalization of the configuration parameters related to VANETs as well as the lack of enough scenarios on the experimental models. Moreover, the ISCX IDS dataset is a traditional dataset for IDSs and is not designed to capture the structure and features of VANETs.

ML-based studies for malicious attacks on VANET
Many studies have presented different ML techniques trained on datasets for detecting different types of attacks; however, they have not included or considered DDoS attacks. [26][27][28][29][30][31][32][33][34] Ghaleb et al. 26 used ANN model to detect malicious traffic in VANET. They trained ANN on a next generation simulation (NGSIM) dataset using MATLAB tools, and the results showed an accuracy of 99%. They used real traffic along with injected dynamic noises to generate a dataset that had many attacks. However, there were no DDoS attacks; moreover, the dataset did not give any details on the network configuration.
Another study in Li et al. 27 presented the usage of SVM to detect nodes with suspicious behavior in a VANET environment by considering several input parameters such as the movement speed and transmission range. The dataset was generated by the GloMoSim simulation framework. However, the study did not report the dataset specification, nor is the dataset available online.
Grover et al. 28 conducted an experimental work using the NCTUns-5.0 simulator to generate a dataset that was used to train and evaluate several ML techniques on detecting malicious nodes in VANET where Weka had been used to evaluate the specified classifiers. Although it presented the simulation work to generate the dataset, it did not explain the procedure. Moreover, it did not involve a DDoS attack in the generated dataset.
Ali Alheeti et al. 29 proposed a smart security framework to protect the outside communication system for autonomous and semi-autonomous vehicles by detecting gray hole and rushing attacks in real time. They simulated the environment using SUMO, MOVE, and NS-2 which generates a trace file to produce a set of features that can be used to differentiate legitimate from malicious behavior. Two ML classifier algorithms were applied-SVM and FeedForward Neural Networks (FFNNs), where results showed that the FFNN model had a lower false negative rate than SVM. However, SVM showed a high performance in terms of detection time and is faster than FFNN. To summarize, a detailed procedure for generating a dataset was presented, but without involving DDoS attack traffic. Moreover, there was insufficient information regarding network configuration parameters.
Aloqaily et al. 30 proposed a framework called D2H-IDS as an IDS in vehicle nodes connected through a cloud network. The effectiveness of this solution was validated through simulations where they generated normal traces using NS-3 and NSL-KDD dataset 47 for generating several attacks including a DoS attack. The features were selected by applying a Deep Belief Network (DBN), and a DT was used for the classification of attacks where results showed high accuracy and low false rates. However, generally, the datasets proposed and generated in Li et al., 27 Grover et al., 28 Ali Alheeti et al., 29 and Aloqaily et al. 30 did not fulfill the required criteria due to the unavailability of the online dataset, not reporting the network configurations, and not considering DDoS attacks as a part of the generated dataset.
Singh et al. 31 generated a synthetic dataset using an NS-3 simulator and mobility traces produced by an SUMO traffic simulator. In the network simulator, they used about 40 vehicles assuming movements of fixed speed during the simulation time. The simulation was designed to generate the traffic holding features of a wormhole attack. Then, the data were preprocessed and used as a dataset for training both K-nearest neighbor (KNN) and SVM models for detecting wormhole attacks. However, they presented no details on the procedure and methodology of simulating the environment and generating the dataset. Moreover, the dataset can only be used for wormhole attacks.
A study Singh et al. 32 proposed using SVM and LR as ML techniques to detect false position data generated by malicious VANET nodes, known as a false position attack. The evaluation of the detection model was conducted on the VeReMi dataset, 41 and the results showed a high accuracy of about 97%. However, the dataset does not involve traces for DDoS attacks.
The VeReMi dataset has been used in Gyawali and Qian 33 to validate different ML techniques (LR, Knearest, DT, bagging, and RF) on detecting misbehavior attacks including both false alert generation and position falsification attacks. Similarly, a study presented in So et al. 34 proposed ML-based techniques (KNN and SVM) for detecting misbehavior attacks on VANETs using the VeReMi dataset for training the proposed models. However, the evaluation and prediction of DDoS attacks were not a part of their studies; moreover, the dataset does not involve traces for such attacks.

ML-based studies for DDoS attacks on non-VANET
Several studies have been proposed for detecting DDoS attacks using either existing datasets or simulations to generate their own datasets for the purpose of training and validating different ML classifiers. [35][36][37][38][39][40] The main limitation of these studies is that the datasets used do not capture the characteristics of VANETs or the environment.
Kim et al. 35 used the KDD CUP 1999 intrusion detection dataset to train SVM for detecting several attacks including DDoS attacks, and the results showed an effective classification of different attacks with an accuracy of about 85%. Although the KDD CUP 1999 used in Kim et al. 35 is available online and contains DDoS attacks, it was not designed for VANETs.
Yu et al. 36 proposed a framework to detect DDoS attacks on SDVN environments by implementing three different detection models including a trigger detection model for inbound packets, a flow table feature-based detection model utilizing the features of OpenFlow protocol, and an attack detection model based on SVM. They used a combination of real and generated network traffic to generate a dataset considering different types of DDoS attacks using the Scapy and hping3 tools, and simulation results showed an accuracy of greater than 97%. However, the dataset used is not available online, and the simulation was conducted on virtual machines, which do not reflect VANET characteristics as nodes are mobile in VANET and at different speeds resulting in quick changes to the topology.
Luong et al. 37 presented a simulation work to generate a training dataset to be used with KNN classifiers for detecting flooding attacks on MANETs by considering the frequency of route request packets. Similarly, a study presented in Reddy and Thilagam 38 conducted a simulation work using Network Simulator (NS-2) for ad hoc networks to evaluate a proposed DDoS attack's mitigation technique that relies on the usage of NB classifier. Gao et al. 39 proposed an IDS for DDoS attacks using RF classifier utilizing big data technologies such as Spark and Hadoop distributed file system for implementing the proposed approach. They used both NSL-KDD 47 and UNSW-NB15 48 datasets for evaluating the proposed method for detecting DDoS attacks. However, the datasets used in Luong et al. 37 and Reddy and Thilagam 38 were generated by considering only general ad hoc network characteristics, and are not available online. Similarly, the NSL-KDD and UNSW-NB15 datasets used in Gao et al. 39 include traffic for DDoS attacks, but are not properly representative of VANET as they were designed for general network traffic and thus do not capture the characteristics of a VANET environment.
Ali Alheeti and McDonald-Maier 40 proposed a hybrid IDS for detecting malicious intrusion attacks such as DDoS and network scanning attacks on autonomous vehicles. They used multi-layer perception (MLP) with fuzzy logic techniques trained on the Koyoto dataset, 49 showing an accuracy of 99%. Although the Koyoto dataset holds traffic and features from real network communications, it does not have traffic obtained from VANET communication systems that use different types of protocols and work on a specific style of communication.

Framework-based studies for generating datasets
The last group of studies explored and evaluated in this study is of studies that presented frameworks for generating datasets that can be used for training and evaluating intrusion detection techniques against several types of attacks. Some of these studies were dedicated for VANET 41,42 and others were not. 43,44 Lyamin et al. 42 proposed a heuristic approach derived from data mining methods for real-time detection of radio jamming DoS attacks in a VANET communication environment. To train the proposed detection model, they conducted a simulation experiment using MATLAB to generate a sort of dataset holding cooperative awareness message (CAM) transmissions in the IEEE 802.11p protocol. However, the training model focused only on the CAM transmission, leading to short training sequences of 100 s. Moreover, details on the configuration of the simulated environment and the obtained traces are not available online for further studies. A recently published dataset on VANET environments presented in Van der Heijden et al. 41 held many misbehaviors and attacks but did not include DDoS attacks which is the focus in this study.
A study presented in Damasevicius et al. 43 proposed a dataset called LITNET-2020 generated using LITNET NetFlow topology and holding different attacking scenarios including DoS, DDoS, worms, land, and fragmentation attacks. They considered data flow for different protocols including IPv6, transmission control protocol (TCP), user datagram protocol (UDP), and Internet control message protocol (ICMP). The dataset is available online and the study provided a description about the dataset features and the network configuration parameters. However, the proposed dataset is not dedicated for VANETs as it does not capture the properties of VANETs nor the VANET protocols.
A framework was proposed in Al-Hadhrami and Hussain 44 for dataset generation that can be used to train and validate IDS models on IoT networks. The dataset is called IoT-DDoS and involves different types of traffic including normal traffic, flooding attacks, selective forwarding attacks, and blackhole attacks considering different protocols such as the RPL routing protocol, ICMPv6, IEEE 802.15.4, 6LoWPAN, and UDP. However, the framework does not take into account the specifications and protocols of VANETs.
As a summary, we believe that there are limited existing solutions for detecting DDoS attacks in VANETs. The limitations are due to the lack of real or synthetic DDoS datasets designed or generated for VANET environments. Furthermore, applying traditional network solutions on VANETs without considering the VANETs' unique characteristics may lead to inaccurate results. Even though some studies have generated datasets considering the VANET environment, they did not illustrate the features available in their datasets and others did not demonstrate the methods they followed to generate the datasets. Thus, it is difficult for other researchers to utilize these datasets as well as to validate and compare their results with such solutions. Consequently, we believe that these datasets cannot be used for further studies owing to one or more of the following reasons: (1) dataset is not available online, (2) dataset does not contain a DDoS attack, (3) dataset is not designed for VANET environments, and (4) unavailability of information regarding network configuration. Table 1 compares these explored datasets and studies based on the fulfillment of these four criteria. Moreover, the table presents other aspects of the evaluated studies including the type of attacks being To solve the limitations of the existing approaches discussed here, we aim to generate a novel dataset that can be used for DDoS attack detection in VANET environments called VDDD, which can be used by the researchers working in this field. To make our dataset available for others, we present with enough details the procedure we followed for generating this dataset starting from the selection of the simulation tools until the generation of the complete dataset in the VANET environment. Moreover, we analyze and evaluate the quality and performance of the generated dataset by applying commonly used ML techniques.

Simulation work for generating the dataset
When simulating a VANET environment with many vehicles conceivably broadcasting several messages per second, the selection of simulation tools becomes a crucial task. Some important parameters should be considered such as user-friendliness, scalability, and the ability to connect network communication simulators and road traffic. We implemented our model using a combination of four frameworks: OMNeT++, 50 SUMO, 51 INET, 52 and Veins. 53 The simulation tools and versions used in this study are shown in Table 2. To create a realistic testbed, we simulated the traffic on King Fahad Highway, which is located in the Eastern Province of the Kingdom of Saudi Arabia, and connects between two cities Dammam and Al-Khobar, as shown in Figure 3.
OMNeT++ records simulation results as scalar values, vector values, and histograms. In addition, it supports exporting the results to different formats including Python files, SQLite files, and others. INET Framework is an open-source model library for the OMNeT++ and it supports the implementation of many transport layer protocols such as TCP, UDP, and stream control transmission protocol (SCTP). In addition, it supports wired/wireless interfaces like Ethernet, IEEE 802.11, and many other protocols and components.
Veins (vehicles in network simulation) is an opensource framework for running vehicular network simulations. It allows dynamic interaction between OMNeT++ and SUMO through implementing TraCI (traffic control interface). Veins was selected due to the unique features it has such as supporting realistic maps, realistic traffic, and the use of different protocols including the routing protocols provided by INET. Finally, SUMO is an open-source traffic generator which creates mobility scenarios on real road maps based on user-specified parameters. SUMO provides an application programming interface (API) called TraCI, which stands for traffic control interface. TraCI allows accessing and synchronizing retrieved values of the SUMO simulated scenarios by managing the TCP connections under the client/server architecture.
The selection of the aforementioned simulation tools was a result of evaluating various simulation tools in two stages. The first stage was exploring the simulation tools in the literature review and their features as well as how the researchers use these tools to simulate their network. In the second stage, we selected some tools based on the following criteria: model development, customization, ability to generate different types of network traffic, scalability, integration with other tools, and capability to record and analyze the generated events. The combination of the used frameworks satisfies all requirements that we need to simulate VANET including normal and UDP flooding attacks as well as  recording all the traffic events within the configured environment.

Overview of proposed work
The main idea of the proposed work is to simulate a VANET environment considering both normal and DDoS traffic for the purpose of generating a synthetic dataset based on several simulated scenarios to be used as an input for ML methods. The proposed work involves three main stages as shown in Figure 4. Starting from the bottom, the first stage is to generate realistic network mobility traffic using SUMO. The second stage is to import the SUMO mobility traffic into OMNeT++ to generate the network traffic (normal and DDoS) utilizing both Veins and INET. The final stage is to for collect and prepare the dataset that will be used for evaluating and studying the performance of several ML algorithms. We started using SUMO to prepare the network and generate the traffic. The first step is to export the simulation area which is the King Fahad Highway from OpenStreetMap (OSM). 54 Then, the OSM file is processed with SUMO's net-convert utility that transforms geo-coordinates to metric coordinates of the OSM map; these metric coordinates are utilized in the next step by SUMO. The following command does this task: netconvert --osm-files *.osm -o *.net.xml Besides the network file, we need to consider the obstacles found within the scenario such as buildings and parks. OSM files have the advantage of providing such information in addition to other information like streets, lanes, junctions, and the maximum speed for each street. We used the poly-convert utility to generate a poly file, which can be used in Veins to identify all the obstacles using the following command: polyconvert --net-file *.net.xml --osm-files *.osm --type-*.xml -o *.poly.xml After generating the obstacles file, the SUMO network is established and we proceed to generate the network traffic. There are two options to generate traffic for the vehicles in SUMO. The first one is to generate a random trip and the second one is to design a custom trip in a specific route. In this study, we selected the second choice and assumed that all vehicles considered in our scenarios cross the King Fahad Highway from Dammam to Al-Khobar. To simulate traffic in our network, we generated four files as follows: In OMNeT++ , we imported two frameworks-INET and Veins as shown in Figure 4. Veins uses protocols and applications provided by INET to simulate both normal and attack traffic. INET provides and supports different transport layer protocols like TCP, UDP, and SCTP. Moreover, it provides several routing protocols that can be used within the simulation. In this work, we used UDP as it is widely used in VANET owing to its ability to rapidly transport data compared to TCP. 55 Two types of UDP applications were used in this work: the first one is UDP Basic App and the second one is the UDP sink. UDP Basic App sends UDP packets to the given IP addresses in each time interval where the IP address could be a Wireless Local Area Network (WLAN) or a node IP. The UDP sink App binds a UDP socket to a given local port and prints the received packets' information such as the source, destination, and length of the packet.
In the Veins subproject, we started editing and building the simulation in three steps to meet our scenario's parameters as shown in Table 3. The first step is to replace square files (square.net.xml, square.poly.xml, and square.rou.xml) with our SUMO files previously generated by the simulation and that have the same extensions. Figure 5 shows the created contents as a result of the first conducted step. The second step is to edit the ''scenario.ned'' file to meet our scenario's parameters and other required network configuration like the life cycle Controller which manages general  operations such as shutdown, restart, suspend, and crash. Figure 6 shows the design of ''scenario.ned.'' Furthermore, we have added the AODV routing protocol to the vehicle node ''car.ned'' to be connected with the network layer as shown in Figure 7. The final step is to simulate both the normal and DDoS traffic through the usage of ''omnetpp.ini.'' OMNeT++ provides all the requirements to simulate different security attacks. Several researchers used OMNeT++ to simulate different types of DDoS attacks in traditional networks. [56][57][58] In this work, we generated normal and DDoS traffic for VANET scenarios. For the normal traffic, each node (vehicle or RSU) broadcasts UDP packets with a transmission interval of 1-5 per second. Conversely, the DDoS traffic is based on two key attributes: attack intensity and the number of attacker nodes. The attack intensity is between 10 and 50 packets per second (pps) and the number of attackers either 2 or 6 according to the designed scenario. Figures 8 and 9 show the configuration parameters for both UDP normal traffic and DDoS attack traffic, respectively. The parameters include the IP addresses, port numbers, the multicasting group, start time, end time, and the traffic rate.

Scenarios
In this study, the implemented topology includes three RSUs and N number of vehicles along a highway of 18 km where we have considered a low traffic rate of 20 nodes, N = 20, and a high traffic rate of 60 nodes, N = 60. For each rate scenario, we considered and used two levels of attack rate: 10 and 50 pps, resulting in four different scenarios. In addition, we configured one of the RSUs to be the victim unit that will be exploited by the attack traffic.  Normal and attack traffic was generated using OMNeT++ where each node broadcasts requests to all reachable nodes. All nodes send normal packets in a random manner where the transmission interval is between 1 and 5 s. The attack traffic was generated by specific vehicles to target the victim RSU with two different rates (10 and 50 pps).
We designed these scenarios with their related parameters based on recent studies, which simulated a VANET environment to either study the impact of some attacks or to generate a dataset for VANET environments. Table 4 shows the simulation parameters used by several studies from which we have adapted our parameters shown in Table 5.

Dataset specifications
Generally, evaluating an intrusion detection-based ML model depends on more than the classification accuracy result as many other dimensions should also be considered such as characteristics of the simulation area, the used dataset, and how the normal and attack traffic is being generated.
In this section, we explore the procedure to create VDDD. The following sections illustrate the steps to generate a synthetic dataset on VANET environment. This section starts with the data collection and data preparation steps. After that, we proceeded to the data pre-processing step. Finally, we presented the dataset's feature selection step.

Data collection
In this stage, we collect our data from OMNeT++ for further analysis. Two files are mainly required to generate the dataset, which are the trace file (log file) and simulation results (vector file). Figure 10 illustrates the workflow we followed starting with collecting the raw data and proceeding until we obtained a complete and informative dataset.
The log file holds the events of messages' transmission taking place among modules during the simulation. Among the information recorded in this file are event number, time, source and destination, packet  name, source and destination port, and packet length. The vector file records data values as a series of times, that is, with a timestamp, which is necessary to calculate the features in the upcoming steps. These data values are recorded and captured based on several categories or features. Moreover, OMNeT++ provides several analyses and validation tools that can be used to validate the accuracy of such data vectors. For example, Figure 11 shows a vector plot for all the transmission rates that happened during the simulation.

Data preparation
After collecting the raw data in the previous step, the data are ready to be prepared and processed in such a way that it can be used for evaluating ML techniques. As shown in Figure 12, the raw data goes through several stages until we get an informative dataset in a suitable format to be read and analyzed. These stages involve processing the log file obtained from the log viewer, processing the vector file generated by OMNeT++, merging the log and vector files using both Python functions and Jupyter Notebook, and labeling traffic instances using queries in SQLite. The purpose of processing log files is to keep only the important information in the log as well as to clean the data by removing redundant information, thus making it ready to be merged with the vector data. Figure 13 shows the final version of the log file.
The vector file has been exported OMNeT++ to a SQLite database browser. 60 This exported vector file contains 12 tables with some containing general information such as the simulation run information. We only focused on three tables as shown in Figure 14. The vector table contains information about all modules along with many statistics such as Min, Max, Sum, and others. Figure 15 shows a part of the vector table. The preparation step for this file involves correcting some errors in the vector data table such as correcting the data types of some of the fields and validating the data values exported from OMNeT++.
In order to have an informative dataset, it is necessary to merge the log file with the vector file. To achieve that, we wrote python functions and used Jupyter Notebook 61 to merge these two files by exploring each event in the log file, and then for each event calculating the current, previous, and next time for each node to obtain the 16 selected features' values in the interval between the previous and next time. Figure  16 shows the workflow of the functions that extract the features' values from both the log and vector files. The main idea is to conduct some queries on the data files  to accumulate the values related to each feature according to the given time interval that takes place between the previous and next time events. As shown in Figure  16, we developed several functions and queries to handle different features based on their natures. Functions perform queries to extract the instances related to a specific node based on the event's timestamp, previous time, and next time, and then perform procedures to accumulatively calculate the features' values.
Labeling the dataset is an indispensable stage of data pre-processing. From previous steps, we have full information about each traffic item/event. In this stage, we labeled all traffic to normal and DDoS based on the attack details such as source IP, destination IP, attack times, and duration. Labeling the dataset was done by applying queries in the SQLite DB browser. Table 6 shows the number of instances and their label class in each dataset.

Data pre-processing
Usually, in the ML field, raw data may contain wrong data or missing values. So, data pre-processing is required before applying any classifiers. The preprocessing stage in our proposed architecture involves three steps: data normalization, feature selection, and balancing. In this section, we leveraged Weka capability when pre-processing data and the following sections give a detailed explanation for each data pre-processing step.  Data normalization. Data normalization is the process of rescaling the dataset attributes to lie in one particular range, for example, between 0 and 1 or 21 and 1. According to equation (1) Normalizing data often makes the dataset ready for applying any classifier. In addition, to increase accuracy results, we applied normalization to our dataset using Weka.
Feature selection. Feature selection is one of the data reduction methods where selecting features significantly influences the performance as it reduces the training time and improves the accuracy. Conversely, keeping irrelevant or partially relevant features can negatively affect performance. Various feature selection techniques are available today such as correlation-based feature selection (CFS), information gain (IG)-based feature selection, and gain ratio (GR) feature selection. 49 A brief description of each of these techniques is presented as follows.   CFS. CFS is a popular technique for estimating a correlation between the subset of attributes and their corresponding classes, as well as the inter-correlations among the features. It measures the relevance of a group of features as a high value of the correlation between the features and the classes indicates the group has more relevance, whereas a high value of inter-correlation shows a lower relevance of the group of features. 62,63 The measure of CFS is presented in equation (2) where Ms refers to the heuristic of a subset containing K features, rcf is the mean correlation between the features and the classes, and r ff is the average correlation only between features. After calculating CFS, we selected only those attributes that have a high positive or negative correlation. In other words, these attributes  must be close to 21 or 1. We discarded the low correlation attributes that were close to zero.
IG-based feature selection. IG or entropy is another popular feature selection technique as it measures the contributed information for each feature on the class. The value varies from 0 to 1, where highly informative features get the highest values and 0 means that the feature has no information or impact on the classes. 64 The measure of IG is presented in equation (3) X and Y in equation (3) represent the random variables, and the entropy of a random variable X is written as H (X).
GR feature selection. The GR is a ratio of IG to the intrinsic information, which can be obtained by dividing IG over the entropy of X as shown in equation (4) When the data of X completely forecast Y, then the value of GR = 1. However, when there is no relation between Y and X, then the value of GR = 0. The GR favors variables with small values which is a conflict with IG. 64 Usually, with supervision information, feature significance is assessed via its correlation with the class labels. 65 Based on that, we used a CFS, which is supported by Weka. Tables 7-10 show the rank attribute for each dataset obtained by Weka. Based on the ranked attributes, we selected our cutoff to be equal to or greater than 0.2. Thus, if the attribute has a rank value equal to or greater than 0.2, it is considered an important feature to be included in the evaluation process. Otherwise, it has to be discarded.
Balancing. In our scenarios, we simulated both normal and attack traffic, and according to a real-world  67 However, with these balancing techniques, there was no significant increase in the accuracy of the evaluated models. Thus, we consider applying only the SMOTE sampling method as it gives the best accuracy among the metaclassifiers used in this study.

Dataset description
In each record of the dataset, there are 29 different features including one class attribute as either a DDoS class or a normal one. The features in bold in Table 11 are the ones chosen after applying CFS. Note that some non-qualified features were excluded such as IP addresses, protocol type, and times (the first 12 features in Table 11) from the initial feature set to ensure that the classification model is not reliant on particular acquisition biases. Overall, 10 features were selected from the original 29 for the next stage. Table 11 shows all features alongside their descriptions and an example of each feature.

Dataset evaluation
One of the main objectives of this work is to generate VDDD from a VANET environment and to share it with other researchers. To evaluate the validation and the quality of our generated dataset, VDDD, we followed the 11 criteria proposed by the Canadian Institute for Cybersecurity as a framework to evaluate datasets. 68 VDDD fulfilled nine out of eleven criteria: the two criteria that our dataset did not satisfy are heterogeneity and attack diversity. Moreover, VDDD contains different attack scenarios that have diversity in the attack rates and attack sources. Table 12 demonstrates how VDDD achieves/does not achieve each criterion.  One of the important points to be considered is the feature set. In this study, we provided all the available features from the simulation experiments and then we applied feature selection techniques to select the feature set that gives the best accuracy. Table 13 summarizes the features recently studied and used in detecting DDoS attacks on VANETs and other environments using ML techniques.
To evaluate the validation of VDDD, we examined the performance and accuracy of the selected features with different ML techniques including J48, SVM, RF, KNN, ANN, and NB, which are commonly used with DDoS attack detection as shown in Section ''Literature review.'' Here, we used the VDDD generated for the fourth scenario discussed early in section ''Scenarios.'' The experimental results have been evaluated in terms of the accuracy, precision, recall, and F1-score. The accuracy reflects the percentage of correctly classified instances recorded in the test dataset. The precision criterion measures the ratio of total relevant results that are correctly classified as positives out of all the samples that are predicted as positives. The recall criterion measures the ratio of total relevant results that are correctly classified as positives out of all samples that are actually positive. Finally, the F1-score, also called the F-score, combines both recall and precision to reflect the test's accuracy. The experiments were conducted using the Weka tool with classifiers' parameters shown in Table 14 for all the applied ML classifiers.
The results of the classification on VDDD presented in Table 15 show that all the applied classifiers achieved high detection accuracies generally greater than 99% except for SVM, which shows an accuracy of 97%. The RF classifier achieved the highest accuracy at 99.7% compared to other applied ML classifiers. Table 16 presents the confusion matrices for each classifier. Based on these statistics, Figure 17 presents the false rates for classifiers, where SVM shows a higher false rate compared to other classifiers.
To calculate the computing time for our proposed approach, we subtracted the time taken to generate the whole dataset. The focus was on the other computational aspects, namely, the classifier's building time and feature weight calculation time. 7 We conducted several experiments on VDDD starting with the feature selection method and then executing all the studied ML methods to get the average of the computation time for both building the classifier and ranking and selecting the features. Generally, we considered the average of these computation measures from a total of seven runs on a computer running Windows 10 with CPU of 2.6 GHz and 8.00 GB RAM. The results are as follows: the average time taken to do the feature selection using IG ranking method was about 0.23 s. The average time taken to build an ANN model was the longest at about 2.54 s, followed by SVM at 1.8 s; the rest of the models had a comparable time of no more than 0.12 s. Moreover, we presented a receiver operating characteristic (ROC) curve 69 as shown in Figure 18 which reflects the quality of the decision made by the classifiers. ROC is one of the best measures to evaluate the performance of classification models based on the threshold setting values as it reflects the model's ability to predict classes. Under VDDD, the experimental results showed that all models have around 99% as an area under the ROC curve, except for the SVM model, which showed 97%. These results indicate the effectiveness of all models in predicting classes with very low false rates.

Conclusion
An insecure VANET can lead to fatal accidents, physical disability, and even deaths. Accordingly, the security concerns regarding VANET required the attention of researchers and developers with special consideration of the unique characteristics of the network. In addition, VANET must be capable of accurately detecting and preventing possible threats that might occur on the network. In this article, we explored several aspects of   the VANET system including its architecture and characteristics, as well as introducing a literature review on recent studies about securing VANETs against DDoS attacks. Due to the lack of available DDoS attack datasets that fit the VANET environment, we simulated a VANET environment involving a real highway scenario using several tools including OMNeT++, INET, Veins, and SUMO. The simulated scenarios were used to generate and build a dataset for detecting DDoS attack in VANET environment. The dataset records were processed effectively following all the principles of data preparation and pre-processing that existed in the literature review. The proposed dataset VDDD has fulfilled the majority of the requirements of being a valid dataset as it overcomes the issues with existing VANET datasets such as ignoring the VANET characteristics, dissimilar network configurations, and unavailability of the datasets to the public. Several ML models were trained on the generated dataset and all showed significant accuracy in detecting DDoS attack traffic.

Future work
As a future work, the study can be extended by considering other aspects when simulating VANET such as context and weather conditions. Moreover, the generated dataset, VDDD, currently contains only UDP flooding attacks. It can be extended to be more generic by adding more types of DDoS attacks, as well as other types of attacks. More ML techniques can be trained on the dataset as a sort of evaluation of the VDDD dataset. Generating more attack traffic to balance between legitimate and malicious traffic can be considered as an extension to this work.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.