Asynchronous event feature generation and tracking based on gradient descriptor for event cameras

Recently, the event camera has become a popular and promising vision sensor in the research of simultaneous localization and mapping and computer vision owing to its advantages: low latency, high dynamic range, and high temporal resolution. As a basic part of the feature-based SLAM system, the feature tracking method using event cameras is still an open question. In this article, we present a novel asynchronous event feature generation and tracking algorithm operating directly on event-streams to fully utilize the natural asynchronism of event cameras. The proposed algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The event-corner detection unit addresses a fast and asynchronous corner detector to extract event-corners from event-streams. For the descriptor construction unit, we propose a novel asynchronous gradient descriptor inspired by the scale-invariant feature transform descriptor, which helps to achieve quantitative measurement of similarity between event feature pairs. The construction of the gradient descriptor can be decomposed into three stages: speed-invariant time surface maintenance and extraction, principal orientation calculation, and descriptor generation. The event feature tracking unit combines the constructed gradient descriptor and an event feature matching method to achieve asynchronous feature tracking. We implement the proposed algorithm in C++ and evaluate it on a public event dataset. The experimental results show that our proposed method achieves improvement in terms of tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime.


Introduction
Over the past several years, simultaneous localization and mapping (SLAM) has been widely studied and developed for augmented and virtual reality, self-driving cars, and unmanned aerial vehicles. 1 The combination of depth learning and SLAM 2,3 has also become a hot research topic at present. But, due to the complexity of the real environment, existing visual SLAM systems using single vision sensor are still faced with many problems, such as tracking failure. To enhance the robustness of SLAM systems, many researchers fuse the data derived from two or more sensors, such as cameras, Lidar, GPS, IMU, and so on. 4,5 However, there are still many challenges to these systems when faced with challenging scenes, such as high-speed motion and high dynamic range. Recently, bioinspired vision sensors 6,7 have aroused many researchers' interest and have become a hot research topic for robotics and computer vision. Event cameras respond to local pixel-level brightness changes, transmitting asynchronous events only when brightness changes are detected rather than frames with a fixed time interval, intrinsically different from standard cameras. Each event is a tuple ½x; y; t; p, where ðx; yÞ is the coordinate in the imaging plane, t is the triggered timestamp, and p is the sign of the brightness change. The advantages of event cameras include low latency, low power consumption, high dynamic range, high temporal resolution, and no motion blur. Therefore, event cameras have the potential to help SLAM systems to overcome the limitations in challenging environments. For example, owing to the natural ability of being sensitive to dynamic objects, event cameras could be used for object detection and tracking, 8 and have great potential for improving the performance of the SLAM pipeline in dynamic environments, 9 for which we might have to filter the dynamic moving objects from the raw image data for better SLAM performance. 10,11 Unfortunately, the asynchronous events from event cameras are intrinsically different from the intensity images, so standard computer vision methods cannot be directly applied for event cameras. 12 Researchers have to explore new methods to bring event cameras' potential into full play. Until now, there have been a large amount of research efforts focused on event cameras in multiple directions, such as SLAM, 9,13,14 segmentation, 15,16 reconstruction for visual information, 17,18,19 and control for unmanned aerial vehicles. 20,21 More related research contents can be found from the survey articles 12,22 and the list of event-based vision resources (https://github.com/uzhrpg/event-based_vision_resources).
As one of the basic methods in SLAM, feature-based SLAM methods extract features from intensity frames, and each feature is associated with a descriptor, such as scaleinvariant feature transform (SIFT), speeded-up robust features (SURF), oriented FAST and rotated BRIEF (ORB), and so on. The extracted descriptor preserves the information of the local area around the feature point and provides a quantitative comparison with other feature points. 5 Then, data association is performed to associate similar features to complete feature tracking tasks. To the best of our knowledge, there is still no visual SLAM system using asynchronous feature tracking method, which is proposed for event cameras. Driven by the demand for an efficient asynchronous feature tracking method for subsequent SLAM system based on event cameras, we propose an asynchronous event feature generation and tracking algorithm working directly on event-streams. The proposed algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The results of asynchronous event feature tracking are shown in Figure 1. The main contributions of this article can be summarized as follows: We propose an asynchronous event feature generation and tracking algorithm, which can work directly on asynchronous event-streams. The proposed algorithm includes an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. We address a novel asynchronous event feature gradient descriptor. The descriptor can be constructed by speed-invariant time surface (SITS) 24 maintenance and extraction, principal orientation calculation, and descriptor generation. The descriptor is used to represent the distribution of the local gradient information for event-corners and help to achieve quantitative measurements of similarity between event feature pairs. We implement our proposed algorithm in Cþþ and evaluate it on the public dataset. 23 The experimental results show that our proposed method can improve the tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime.
The rest of the article is organized as follows. The related works are given in the next section. Then, we give the overview of the presented algorithm, which is followed by the introduction of the proposed gradient descriptor. Later, the details of the event feature tracking method are outlined, and the following section presents the experimental results and the corresponding analysis. Finally, the conclusions are drawn and the future work is given. Our event feature generation and tracking algorithm works directly on asynchronous event-streams based on our proposed gradient descriptor. This figure shows the event feature tracking results in the spatiotemporal space with our proposed algorithm on shapes scene of the public event camera dataset. 23 Different colors indicate different tracked event features.

Related works
In computer vision, a feature may be a specific structure, such as interest point, edge, block, or object, which differs from its immediate neighborhood in the image. The feature-based tracking method is widely applied for visual odometry, SLAM, and augmented reality. Feature-based tracking method generally consists of feature detection, feature description, feature matching, and feature tracking. In the whole process, feature description is one of the most significant steps for tracking.

Feature descriptors for standard images
As one of the most widely used features, SIFT 25 is invariant to image scale and rotation, and robust to changes in illumination and affine distortion. The generation of SIFT feature descriptor has four stages: scale-space extrema detection (based on the difference of Gaussian pyramid), keypoint localization, orientation assignment, and keypoint description. After the first two stages, keypoints will be selected including their locations and scales. In the step of the orientation assignment, one or more orientations will be assigned to each keypoint based on local image gradient information at the local patch region around the keypoint location. So, every keypoint can be assigned with the location, scale, and orientation. Finally, the descriptor with multidimensions can be calculated for each keypoint at the selected scale based on the local patch region around its location. The SURF descriptor 26 was proposed based on the idea similar to SIFT. SURF is faster than SIFT, and it is also scale invariant and rotation invariant. As a binary descriptor, the binary robust independent elementary feature (BRIEF) descriptor 27 allows very fast Hamming distance matching, but it is not scale invariant and rotation invariant. Another binary descriptor, called ORB, 28 combines the oriented features from accelerated segment test (FAST) detector 29 and rotated BRIEF descriptor. ORB is rotation invariant but not scale invariant. Compared to BRIEF and ORB, SIFT and SURF need significantly more computation effort. However, SIFT and SURF use binary strings as feature descriptions, which result in larger mismatch rates.

Event-based corner detection
In recent years, many asynchronous event-corner detection and tracking methods 30,31,32,33,34 have been proposed based on event-driven data. In detail, Vasco et al. 31 applied an adaptation of the original image-based Harris corner detector 35 for event-based data, while Mueggler et al. 32 presented a FAST-like event-based corner detector faster than the method proposed by Vasco et al., 31 inspired by image-based FAST corner detection method. Li et al. 33 studied a fast and asynchronous event-based corner detection method, called FA-Harris, with a corner candidate selection and refinement strategy. Alzugaray and Chli 36 proposed a faster asynchronous event-corner detection method inspired by the method of Mueggler et al. 32 and a simple asynchronous event-corner tracker. The tracker utilizes a directed graph to record the tracks of event-corners. Then, Alzugaray and Chli 37 improved the asynchronous event-corner tracking algorithm by introducing the normalization descriptor for extracted event-corners. FA-Harris detector achieves better performance in terms of accuracy with moderate computation performance, compared with the other aforementioned corner detection methods.
All the above event-corner detection methods operate directly on asynchronous event-streams using the surface of active event (SAE) 38 (also called time surface. 39 ). Time surface maps the position of the latest event to its timestamp. In other words, time surface keeps the absolute timestamps of the latest events triggered at the imaging plane. Manderscheid et al. 24 proposed the SITS, which is invariant to the motion speed of cameras or scene objects. SITS keeps the relative timestamps instead of absolute ones. They utilized the SITS to detect event-corners from event-streams by training a random forest.

Event-based feature tracking
Some event-based feature tracking methods work based on event frames (synthesized by events with a fixed number or in a fixed temporal window) or the absolute intensity information on images. In the study of Tedaldi et al., 40 they first extracted Harris corners and Canny edge features on intensity images and then tracked the features on asynchronous event-streams. Kueng et al. 41 presented an event-based visual odometry method to track the six degrees of motion of the camera, and the proposed method is also based on corners and edges. Zhu et al. 42 accumulated events in a temporal window to integrate event frames. Based on the integrated event frames, they applied the original Harris corner detector and then tracked the detected corners with expectation-maximization scheme. Afterward, Zhu et al. 43 further introduced the inertial measurement into the system and proposed an event-based visual-inertial odometry method. Gehrig et al. 44 detected Harris corners on the intensity frames and tracked them on event-streams. Li et al. 45 proposed a feature tracking method using events, intensity frames, and IMU data. They first extracted Harris and Canny feature on intensity frames, and then, the feature templates are tracked using an expectation-maximization iterative closest point strategy. Besides, Alzugaray and Chli 46 addressed a method to track generic patch features event-by-event without the requirement for detecting event-corners and descriptors.
To fully utilize the natural asynchronism of event cameras, we propose a novel asynchronous event feature generation and tracking algorithm inspired by frame-based feature tracking techniques. The algorithm can work directly on event-streams without the requirement for intensity frames, artificially synthesized event frames, or other prior knowledge of scenes or camera motion. The proposed algorithm is based on a novel asynchronous event feature gradient descriptor inspired by the frame-based SIFT feature descriptor. The gradient descriptor represents the distribution of the local gradient information for eventcorners, and it is used for feature matching during the asynchronous tracking process.

Overview
Inspired by standard computer vision tasks, we propose an asynchronous event feature generation and tracking algorithm in this article. As shown in Figure 2, the algorithm includes an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit.
The event-corner detection unit is based on a fast and asynchronous event-corner detection method, 33 which is called FA-Harris. It detects event-corners directly on event-streams without using intensity images, which mainly consists of five steps, including event filter, global SAE maintenance, local SAE extraction, corner candidate selection, and corner candidate refinement. In the proposed event feature generation and tracking algorithm, the eventcorner detection unit utilizes the FA-Harris detector to extract event-corners from event-streams, and the event filter included in FA-Harris detector is not used in our method here, which we found would not contribute to the performance improvement of the tracking method.
After detecting event-corners, we design a novel asynchronous event feature gradient descriptor for each eventcorner based on the SITS. 24 The gradient descriptor can be constructed by SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The descriptor can represent the distribution of the local gradient information for event-corners and help to achieve quantitative measurements of similarity between event feature pairs in the following event feature tracking unit. By introducing the gradient descriptor, we can define the event feature as a tuple ½x; y; t; d, where ðx; y; tÞ is the spatiotemporal coordinate of the event feature and d is the gradient descriptor of the event feature. The details of the proposed gradient descriptor will be introduced in the following section.
Finally, the generated event features will be tracked using the event feature tracking unit. The tracking unit is achieved using the constructed descriptor and an event feature matching method to achieve asynchronous event feature tracking. The proposed gradient descriptor is used to provide the similarity measurements between event feature pairs. The event feature matching method is implemented based on a directed graph, which is composed of multiple structured track trees.

Gradient descriptor
This section introduces our proposed gradient descriptor. The construction of the gradient descriptor can be divided into three stages: SITS maintenance and extraction, principal orientation calculation, and descriptor generation, which is mainly inspired by the last two stages of the frame-based SIFT descriptor. We firstly select the eventcorners (keypoints) from the incoming events including their locations, timestamps, and polarities based on FA-Harris detector. Compared with the keypoints in the SIFT method, our event-corners do not contain the scale information. For the frame-based SIFT descriptor, the first stage is the scale-space extrema detection based on the difference of Gaussian pyramid before the keypoint localization step. For the consideration of simplicity, we did not utilize the scale space compared with the frame-based SIFT descriptor, which would be a feature direction for further research. As mentioned above, the keypoint localization for our method is achieved using the FA-Harris detector for event Figure 2. The overview of the proposed asynchronous event feature generation and tracking algorithm. The algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. The input of the algorithm is the eventstreams, and the output of the algorithm is the tracks of the tracked event features. The event-corner detection unit is based on a fast and asynchronous event-corner detection method. It extracts the event-corners through global SAE maintenance, local SAE extraction, corner candidate selection, and corner candidate refinement. The gradient descriptor is constructed by SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The tracking unit is achieved using the constructed descriptor and an event feature matching method to achieve asynchronous feature tracking. SAE: surface of active event; SITS: speed invariant time surface.
cameras rather than localizing the keypoint in the scale space. We apply the SITS method 24 to provide temporal information and gradient information of events. The global SITS structure will be maintained based on the incoming events. It has the same size width Â height as the imaging plane, where each position is associated with the corresponding pixel position in the imaging plane and will be used to provide the gradient information for each eventcorner. For each incoming event-corner, the gradient descriptor will be generated based on the local patch region extracted from the global SITS structure around its location. The construction process of the gradient descriptor is asynchronous, in other words, the algorithm generates a gradient descriptor once a new event-corner arrives.

Speed-invariant time surface maintenance and extraction
Since there is no concept of intensity images for event cameras and a single event does not bring any information to provide gradient information for descriptor construction, we choose the SITS 24 (updated asynchronously, with every incoming event) to store the temporal information of events and provide the gradient information, rather than using the intensity image as in the frame-based SIFT descriptor. 25 According to Manderscheid et al., 24 SITS is invariant to the motion speed of cameras or objects in the environment, which can contribute to the speed-invariant property of event features. To distinguish event-corners from eventstreams, SITS keeps relative values for timestamps rather than absolute ones. The method tries to maintain one SITS structure for each polarity of event, which stores a single value for each pixel location. Specifically, all the values in the SITS are initialized to 0. When a new event arrives, the values, which are larger than the value at the corresponding event pixel position ðx; yÞ, within the window l Â l will be reduced by 1. According to the study of Manderscheid et al., 24 l is set to 11. The value at the corresponding event pixel position ðx; yÞ will be modified to l 2 . SITS is invariant to the motion speed of cameras or objects in the environment.
In our proposed algorithm, we maintain one global SITS structure for each polarity same as in the method of Manderscheid et al. 24 For each incoming event, the global SITS structure corresponding to the polarity of the new coming event will be updated. When a new event-corner arrives, we extract the local patch P with size ð2 º R ß þ 2Þ Â ð2 º R ß þ 2Þ around the pixel position of the event-corner on the global SITS structure corresponding to the polarity of the new event-corner. As shown in Figure 3 r is decided by the radius r of the sampling region for the gradient descriptor. The extracted local patch P will be used for descriptor generation.

Principal orientation calculation
On the extracted local patch P extracted from the global SITS structure, we calculate the magnitude mðx; yÞ and orientation qðx; yÞ of gradient for every position as follows where d x ¼ Pðx þ 1; yÞ À Pðx À 1; yÞ is the differential in x direction of the local patch (which is the same as the x direction of the imaging plane), d y ¼ Pðx; y þ 1ÞÀ Pðx; y À 1Þ is the differential in y direction of the local patch (which is the same as the y direction of the imaging plane), and x; y ¼ 1; 2; :::; 2 R b c. A gradient histogram with n 1 ¼ 36 bins (according to Lowe 25 ) will be generated for the local patch. The histogram is essentially a vector with 36 bins corresponding to angles 0; 10; . . . ; 350. The magnitude values will be weighted using Gaussian weighting function w with spread s ¼ 1 pixel before adding to the gradient histogram. Hence, the gradient magnitude near the event-corner will have greater weight. For all magnitudes m xy and orientation q xy 2 ½0; 360 , where x; y ¼ 1; 2; :::; 2 R b c, the generation of a gradient histogram H ¼ ðH 0 ; H 1 ; :::; H k ; :::; H n 1 À1 Þ, where k ¼ 0; 1; :::; n 1 À 1, can be formalized as w xy ¼ e Àðx 2 þy 2 Þ 2s 2 x; y ¼ 1; 2; ::: The orientation corresponding to the peak value in the gradient histogram represents the gradient orientation of the local patch, and it is also regarded as the principal orientation of the local patch. Since the orientation we get from a gradient histogram is essentially an interval of 10 , we apply the parabolic interpolation processing to get the specific orientation. More specially, the selected orientation and the orientations adjacent to it are used for parabolic interpolation.
To enhance the robustness of feature matching, we choose the orientation corresponding to the maximum value in the histogram and orientations where the value is greater than 80% (same as in reference 25 ) of the maximum value. In this way, one or more orientations are assigned to each event-corner on the local patch. The orientations belong to circular data rather than linear data, and the circular mean 47 can provide a more intuitive estimate of the "center" of the distribution for this kind of data compared with the arithmetic mean. So, to obtain the final principal orientation from the orientations assigned to each eventcorner, we compute the circular mean of these orientations rather than the arithmetic mean. To be specific, for the orientations ( 1 ; 2 ; . . . ; n ), the circular mean is calculated as follows The circular mean is regarded as the principal orientation of the local patch.

Descriptor generation
In this stage, we reorient the local patch to its principal orientation to generate the gradient descriptor vector, that is, we rotate the x direction of the local patch (which is the same as the x direction of the imaging plane) to coincide with its principal orientation. The neighborhood with size 2r Â 2r around the event-corner position (center of the local patch) is taken as the sampling region for descriptor generation. As shown in Figure 3, the sampling region is divided into c Â c cells. We set c to 2 here. In the literature, 25 the author suggested setting the value of c to 4. However, according to our findings, the number of features tracked by the algorithm in this case is not enough. Therefore, with this in mind, we set the value of c to 2 to guarantee that we can track a sufficient number of features without too much impact on the other performance of the algorithm. We calculate the orientation and magnitude of the gradient for every pixel position on the local patch and then determine the reoriented pixel value as follows where x; y ¼ 0; 1; :::; 2 R b c þ 1. After the reorientation, only pixels falling into the sampling region contribute to descriptor generation. Since the coordinate values of the reoriented pixels are not integers, we compute their contribution to each adjacent cell using trilinear interpolation same as in SIFT. 25 Then, a gradient histogram with n 2 ¼ 8 bins (same as in the literature 25 ) will be given for each cell based on the reoriented local patch. We generate a N-dimensional descriptor vector, where N ¼ c 2 Â n 2 . The descriptor vector will be normalized using L2 norm to remove the scale, and finally, we can get the gradient descriptor for an event-corner.
After we get the gradient descriptors for event-corners, we need to compute the descriptor distance between event feature pairs to measure the similarity between them. For two gradient descriptor vector d 1 ; d 2 , we compute their distance D as follows

Event feature tracking
In the event feature tracking unit, we combine the constructed gradient descriptor and an event feature matching method to achieve asynchronous event feature tracking. For each incoming event-corner, we assign a gradient descriptor to it based on the above-mentioned gradient descriptor construction method. Our proposed gradient descriptor is used to represent the distribution of the local gradient information within the event-corner neighborhood on the time surface space. We define the event feature as a tuple ½x; y; t; d, where ðx; y; tÞ is the spatiotemporal coordinate of the event feature and d is the gradient descriptor of the event feature. The gradient descriptor is regarded as a quantitative measurement of similarity between event feature pairs for feature matching. The event feature matching method 37 used in our presented algorithm is based on a directed graph. The implementation details of the event feature matching method are summarized in Figure 4. For each new incoming event feature, the algorithm generates a new vertex v new . The global vertex memory with size width Â height saves the latest event features corresponding to pixels within the temporal window Dt. The latest event features within Dt for each pixel position are saved in the vertex queue. The directed graph is composed of multiple structured track trees, and each tree represents a set of multiple possible tracks for the same event feature. An address is assigned for every event feature. The address is a tuple encoding the path to find where the event feature is in the directed graph. Therefore, a vertex is associated with an individual event feature and encodes the information about an event feature and the associated address. Each tree node encodes the information about a vertex, the depth of the node and the state (active or inactive), and its children nodes. The edge between two nodes represents the association between event feature pairs. Each graph node keeps a hypothetical track tree and encodes the information about the tree, which includes the tree depth, the reference tree node (the reference vertex), and the pointer to the tree.
For every new generated vertex, it can be assigned to an existing tree or become the root of a new tree. Once the depth of a tree increases, the proposed algorithm will perform the reference updating operation.

Tree assignment
Considering the descriptor distance introduced in the above section, the closest vertex compared with the new vertex in the spatiotemporal window w Â w Â Dt will be selected as the matching vertex v match . The tree that v match belongs to is regarded as the matching tree. For the vertices on the matching tree, the newest vertex within the spatial window w Â w is regarded as the parent vertex of v new , and the new vertex v new will be associated to the parent vertex by adding a new edge from the parent vertex to v new . The descriptor distance between v new and v match must be smaller than the threshold d max . Otherwise, v new will be identified as the root of a new tree.

Reference updating
A reference vertex v ref for each tree maintains a maximum number of r max vertices from itself to the deepest vertex in the same tree. The vertex in the tree whose depth is smaller than v ref is regarded as the inactive vertex, and the vertex whose depth is larger than v ref is regarded as the active A child vertex, whose descriptor distance with v ref is larger than the threshold d min , is regarded as the weak vertex. Otherwise, the child vertex will be regarded as the strong vertex. The newest strong child vertex will be considered as the new v ref and the new parent vertex of the other strong children vertices as well. If there is no strong child vertex, the weak child vertex with the smallest distance will be regarded as the new v ref , and the other weak vertices together with their children trees will disconnect from the tree to generate new trees. These weak vertices disconnected from the tree will become the roots of new trees.

Track refinement
To get smoother tracks, the event feature tracks are smoothed using a simple interpolation operation. For each vertex, its pixel coordinate will be interpolated using its s predecessors and s successors in the same track. Only the event feature track which contains at least m refined vertices is used to filter out the short and noisy tracks.

Experiments
This section introduces the experimental results of our proposed algorithm, including accuracy and computational performance evaluation for feature tracking. We compare our proposed method with the tracking method 42 (referred as EOF tracker), the ACE tracker, 37 and the tracking method of reference 46 (referred as AMH tracker). The public event camera dataset, 23 generated using a DAVIS240C with a spatial resolution of 240 Â 180 pixels, is adopted to perform the comparison. The data consist of events, intensity images, IMU measurements, and ground truth from a motion-capture system. In the experiments, we choose multiple scenes with different complexity of textures from the dataset, including shapes, dynamic, poster, boxes and poster, boxes with high dynamic range of illumination. Only the first 10 s of the dataset are used for evaluation.
To perform the comparison, we implement the normalization descriptor and the ACE tracker 37 in Cþþ. The same parameters and values are used as those presented in the article. All methods are implemented in Cþþ and evaluated on a laptop equipped with an Intel i7-7700HQ CPU with 2.80 GHz and RAM with 16 GB. For the event feature tracking unit, we employ the spatiotemporal window w Â w Â Dt, where w ¼ 4; Dt ¼ 0:5 s. We set the threshold of the gradient descriptor d max ¼ 100 for the matching vertex selecting, and d min ¼ 50 for distinguishing the strong and weak vertex. r max is set to 8. s is set to 14. m is set to 50; 30; 12; 12 for shapes, dynamic, poster, and boxes, respectively. Note that the suitable values for above parameters are chosen using the trial and error method.

Tracking performance
We use the event-based feature tracking evaluation code 48 for tracking performance analysis. The ground truth feature tracks are collected using KLT-based feature tracking method on frames. The positions of the initial features on frames are interpolated from the event-based features close to the time of the frames. The initial features are tracked using KLT tracking method until they are lost, and the tracker updates the tracked features for each frame. The asynchronous event feature tracks for our proposed algorithm on shapes and dynamic scenes are depicted in Figure 5. The figures show the different feature tracks using our method with different colors over the last 0.5 s. Table 1 summarizes the average pixel error and average feature lifetime for event feature tracks on several scenes with different textural complexity. In our experimental evaluation, if the pixel error for a tracked feature is above 5 pixels, the tracked feature is regarded as invalid, and only the valid tracked feature is considered in the evaluation. The best results are made in bold in the table, and the results indicate that our proposed method can achieve better performance in terms of accuracy when compared with EOF tracker and ACE tracker. Besides, we report the average tracking error from reference 46 (the authors did not explicitly report tracking lifetime numerically), compared with which our tracking method also performs better with significant improvement, except one case, shapes_translation. When considering the average feature tracking lifetime, our method also works well on the scenes with simple or moderate texture, such as shapes, dynamic, and so on. The results on the scenes hdr_poster and hdr_boxes demonstrate that our method can perform well when faced with the scenes with high dynamic range. However, while working on the scene poster, EOF tracker performs well compared with both ACE tracker and our method. Figure 6 shows the average tracking pixel error and the percentage of the tracked surviving features over time for the ACE tracker and our proposed method on four different scenes. The results demonstrate that our proposed method achieves better performance in terms of tracking accuracy. What is more, the band around the central line is wider with our method, which indicates that our method is more robust. However, the ACE tracker performs better on the scenes with complex texture when considering the feature tracking lifetime.

Computational performance
In this section, we compare the computational performance of our proposed event feature tracking method with the ACE tracker.
The ACE tracker uses the normalization descriptor as a quantitative measurement of similarity between eventcorners. The normalization descriptor is implemented based on the simple sorting operation of the events' timestamps in a local patch. The sorted timestamps are normalized into the range ½0; 1. They measured the descriptor distance by calculating the amount of overlap between the two normalization descriptors. Table 2 presents the real-time performance of our proposed descriptor construction and the event feature matching method. We report the total time, the time spent on descriptor construction and event feature matching, and their ratios to the total time. As given in Table 2, the descriptor construction ratio with our method is larger than that of ACE. This is because our method needs gradient computation and Gaussian weighting, while the normalization descriptor used in ACE only needs a sort operation and a normalization operation. However, the matching time with our method is much shorter than the matching time with ACE, which contributes to a better real-time performance in terms of the total time for feature generation and tracking.
Also, we report the metric, real-time factor for realtime performance analysis, which indicates the total time spent processing the events of each scene with respect to the duration of dataset (10 s for each scene). For this metric, the smaller result indicates better realtime performance, and the results under 1 indicate the performance above real time. According to Table 2, our method achieves better real-time performance when facing all of the scenes compared with the ACE method. However, there are still some scenes, such as dynami-c_rotation, poster_rotation, and hdr_poster, that the real-time factors are slightly greater than 1. It means that our method has a potential in improving the calculating performance, especially in refining the matching method to reduce more processing time for better real-time performance for utilization in real-time applications. Table 3 gives the event processing ability of the ACE tracker and our proposed event feature generation and tracking algorithm. The table includes the mean rate of events, the mean rate of the event-corners, and the mean time for a single feature matching. According to Table 3, our proposed method achieves faster event-rate, corner-rate, and speed for every single feature matching than the ACE tracker, which contributes to the improvement of the realtime performance of our method.

Conclusion
In this article, we present a novel asynchronous event feature generation and tracking algorithm operating directly on event-streams for event cameras. The algorithm consists of an event-corner detection unit, a descriptor construction unit, and an event feature tracking unit. An asynchronous gradient descriptor is developed for the quantitative measurement of similarity between event feature pairs, and it is constructed through SITS maintenance and extraction, principal orientation calculation, and descriptor generation. The experimental evaluation demonstrates that our proposed algorithm performs better in terms of tracking accuracy and real-time performance when compared with the state-of-the-art asynchronous event-corner tracker and with no compromise on the feature tracking lifetime. In the future, there are still some works for us to do, such as improving the tracking accuracy and lifetime performance, such as by adding scale-invariant property to the descriptor, so that the feature tracking algorithm could fulfill the demand of visual odometry pipeline and even a SLAM system. And also, learning-based methods for event feature generation may be a great direction for further research.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Table 3. The event processing ability of the ACE tracker and our proposed event feature generation and tracking algorithm, including the mean rate of events (mean event-rate), the mean rate of the event-corners (mean corner-rate), and the mean time for a single feature matching (time per feature