Feature points selection with flocks of features constraint for visual simultaneous localization and mapping

Simultaneous localization and mapping is a crucial problem for mobile robots, which estimates the surrounding environment (the map) and, at the same time, computes the robot location in it. Most researchers working on simultaneous localization and mapping focus on localization accuracy. In visual simultaneous localization and mapping , localization is to calculate the robot’s position relative to the landmarks, which corresponds to the feature points in images. Therefore, feature points are of importance to localization accuracy and should be selected carefully. This article proposes a feature point selection method to improve the localization accuracy. First, theoretical and numerical analyses are conducted to demonstrate the importance of distribution of feature points. Then, an algorithm using flocks of features is proposed to select feature points. Experimental results show that the proposed flocks of features selector implemented in visual simultaneous localization and mapping enhances the accuracy of both localization and mapping, verifying the necessity of feature point selection.


Introduction
Simultaneous localization and mapping (SLAM) is one of the key technologies in robotics. SLAM addresses the problem of building the map of the environment surrounding a robot and estimating the position of the robot simultaneously, as shown in Figure 1. When referring a ''map,'' it may be just a set of feature points in the environment called ''landmarks.' ' Most researchers working on visual SLAM pay attention to indoor environments, [1][2][3][4] while some works dealing with the airborne applications. 5,6 Recently, SLAM is also applied in underwater scenarios. 7,8 In terms of the sensors used to perceive surroundings, SLAM can be classified into sonar based, laser based, and vision based with auxiliary sensors such as Inertial measurement unit (IMU), compass, infrared, and depth sensors. [9][10][11][12] Thanks to the development of image processing and stereo vision, visual SLAM 1-3 has been rapidly developing and been applied in a wide range of fields, such as argument reality, 1 computer games, 4 and humanoid robots. 13 In a visual SLAM system, one or more cameras can be used as sensors, 14 whereas we focus on one camera that is called mono-SLAM. 1 Figure 2 shows the output of visual SLAM including the feature map and the location of the camera.
Researchers working on SLAM have been making their efforts to increase the localization accuracy through different approaches. Most of them focus on the probabilistic framework 15 of SLAM, for example, the extended Kalman filter (EKF), 3,16,17 particle filter, 18 information filter, 19 and expectation maximization. 20 Auxiliary sensors 9 are often applied to obtain more information and more accurate results. Since the localization is calculated relative to the set of feature points, feature initialization 3,21 is crucial to localization accuracy.
The aim of this article is to enhance the localization accuracy by optimizing the selection of feature points. The selection process can be divided into two steps. First, feature points should be detected in the image with feature detection methods, 22 such as GoodFeaturesToTrack, 23 Features from Accelerated Segment Test (FAST), 24 Scale-Invariant Feature Transform (SIFT), 25 and Speed Up Robust Feature (SURF), 26 which are introduced to choose stable features for localization. Desai and Lee 27 have developed a novel descriptor called synthetic basis descriptor that provides accurate feature matching for real-time vision applications. Second, the useful subset of feature points should be selected. This article pays attention to the second step. Based on the EKF framework, a feature point selection method called flocks of features (FoF) selector is applied, aiming to enhance the localization accuracy in visual SLAM. Liu et al. 28 have successfully applied FoF in hand tracking to select feature points in the hand area and have achieved great tracking results.
The rest of the article is organized as follows. The framework of visual SLAM is briefly introduced in section ''Framework of EKF-based visual SLAM.'' Section ''Analysis of the distribution of feature points'' analyzes the significance of the distribution of feature points in improving the localization accuracy. A proposed method using FoF to select feature points for visual SLAM is exhibited in section ''Feature point selection using flocks of features.'' Section ''Experiments and discussions'' shows experimental results, and the final section gives the conclusions.

Framework of EKF-based visual SLAM
The framework of visual SLAM is briefly introduced in this section. The MATLAB [Version: 8.6.0.267246 (R2015b)] implementation and other details can be referred from the literature. 16 Figure 3 shows the flowchart of EKF-based visual SLAM. The core part is EKF framework consisting of prediction and update steps labeled by the red dotted box.

Overview of visual SLAM
The main portion of visual SLAM is EKF loop. After providing the initial state x 0 and its covariance P 0 , visual  SLAM works frame by frame with the image perceived by the camera.
Key point detection is applied to find the natural landmarks. According to the prediction step, the positions of feature points in current frame can be estimated, so the corresponding feature points can be searched near the estimated positions, which is called the active feature searching. If the number of features in one frame is less than a given threshold, new feature points should be found where the proposed FoF method is implemented. New positions of the visible feature points are input as the observations for the update step. As a result, the localization and map construction are accomplished.

The EKF
In EKF, 29 the new measurement z k and the state vector of previous frame x kÀ1 are used as input to estimate current state vector x k . Since it is a probabilistic method, the covariance matrix P k of state vector x k is calculated to represent the uncertainty of the estimated x k . The map in visual SLAM can be viewed as a state vector x and covariance matrix P. The state vector consists of camera state x v and positions of feature points y i ði ¼ 1; . . . ; nÞ, and P is a square matrix that is of equal dimension to x x ¼ Generally, the camera's state vector x v is expressed as a 13-dimensional vector 1 where r W is the 3-D position of the focal point of the camera, q WR is a quaternion that represents the camera's orientation, v W and ! R are 3-D velocity and angular velocity vectors, respectively. In the prediction step, predicted state vector and its covariance matrix can be obtained from the non-linear stochastic differential equation The predicted state vector x À k and its covariance matrix P À k are acquired by where Q is the process noise covariance, A k and W mean the Jacobian matrix of partial derivatives of f with respect to x kÀ1 and w, respectively. In the update step, the observation z k ¼ hðx k ; vÞ is used to update the state vector and its covariance, where v is a Gaussian noise. In visual SLAM, image coordinates ðu i ; v i Þ of feature points are viewed as observations through the standard pinhole model.
The current state vector and its covariance are updated as follows where K k is the Kalman gain, H k is the Jacobian matrix of partial derivatives of h with respect to x. Equations (4) to (7) are the basic equations of EKF that provide the solution to current state.

Analysis of the distribution of feature points
The number of feature points that the SLAM system can cope with in each frame should be controlled within a certain range, since too few features will decrease the localization accuracy and too many features can be timeconsuming. If there is no feature point in the first frame or the number of feature points perceived by the camera decrease after a move, new feature points need to be detected in the image. In this case, the key problem is how to promote the performance of system by selecting a  suitable subset of feature points, that is, the distribution of feature points.
The existing visual SLAM systems 1,3,30 just use random method to select feature points. In this section, both theoretical and numerical analyses of how the distribution of feature points influences the localization accuracy will be demonstrated.

Theoretical analysis
Assume a set of feature points p i ði ¼ 1; . . . ; nÞ is visible in current frame where p i is in its projective coordinate system, and ðx i ; y i Þ is the image coordinate of p i . To theoretically analyze the influence of the distribution of fp i g n i¼1 on localization accuracy, 3-D reconstruction theory 31 is applied.
Let fp 0 i g n i¼1 be the feature points in another frame from another view corresponding to fp i g n i¼1 , and rotation matrix R and translation vector t be the movement parameters of the camera between two frames The problem turns to be that how the distribution of fp i g n i¼1 ðfp 0 i g n i¼1 Þ influences R and t. According to the epipolar theory in multiple-view geometry 31 where is called the essential matrix, a set of n linear equations is obtained and can be rewritten as follows . . . ; a n T e ¼ 0 where still stands for the i-th feature point. In equation (12), matrix A is the set of feature points and reflects the distribution of them, and e is a 9-D vector made up of the entries of E in column order standing for the movement of the camera. As a result, the problem is how the matrix A influences the vector e under the restriction of equation (13). Generally, the rank of matrix A is equal to 9 due to the noises in feature point coordinates. Thus, the exact solution of equation (13) does not exist. In this case, a least-squares solution is regarded as the best solution of e. The eightpoint algorithm 32 states that the solution of e is the eigenvector corresponding to the least eigenvalue of matrix A T A. Different distributions of feature points generate different A, resulting in different errors e. For example, if some feature points are too close, the corresponding vector a i in A is close and can be viewed as one vector. Thus, the error of e is larger. An alternative geometric interpretation is shown in Figure 4. The measurements of the centralized distribution provide less information than the scattered distribution because of the overlapped information. It is expected that the feature points distribute more uniformly in the image. Extra researches are conducted to numerically analyze the influence of distributions on localization results in the next subsection.

Numerical analysis
In order to illustrate how distributions of feature points affect the accuracy of SLAM, simulations based on Sola's MATLAB program 21 are implemented. It is simulated in a 2-D plane, where a camera moves in a circle, perceives 2-D landmarks, and localizes itself simultaneously. Five simulations with different distributions of feature points are designed; see Figure 5. In the first and second simulations, landmarks are distributed uniformly and randomly in a circle, respectively. In the third, fourth, and fifth simulations, landmarks are distributed in one 1=8 arc, two 1=16 arcs, and four 1=32 arcs, respectively.
While the simulated SLAM is running, the localization error ¼ jx g À xj, which means the distance between ground truth x g and estimated position x of the camera, is shown in Figure 6. Here, the ground truth is the real position of the camera. Errors of five simulations are compared in Figure 6. Five error curves are separated into two plots in order to analyze conveniently with the error curve of the third simulation. Indexes in legends of plots are the same as indexes of simulations.
In Figure 6(a), the first distribution has the least errors as the features are distributed uniformly, while the third distribution performs worse than the former two because of its irregularity. In Figure 6(b), the third distribution still performs worst because the features are so centralized in just one segment that it cannot provide sufficient information for localization. It can be concluded from the error curves that if the landmarks are uniformly distributed, SLAM system will work with less errors. In another words, the more centralized the landmarks are distributed, the worse the accuracy will be. Overall, it is evident that the distribution of feature points has a great impact on the accuracy of SLAM systems.

Feature points selection using FoF
The analysis in the precious section reveals that it will be meaningful if some rules are applied to restrict the distribution of feature points. Thus, it is expected that the locations of feature points can satisfy following conditions: (a) the locations cannot be too converged, (b) the gravity of them should be close to the center of the image, and (c) feature points shouldn't locate near the edge of the image since great distortion exists in that region. Considering aforementioned analysis, a recent widely used bionic algorithm called FoF 33 is suitable and can be adopted as the restriction. In this chapter, the principle of FoF is stated and then the proposed method using FoF is introduced.

Flocks of features
A biological phenomenon named Flock Behavior 33 states that the members m i in the flock F ¼ ffm i g N f i¼1 g should be neither too condensed nor too scattered d min < jm i À m j j; 8i; j 2 f1; 2; :::; N f g d max > jm i À mj; 8i 2 f1; 2; :::; N f g m ¼ medianðF Þ or centroidðF Þ (14) where N f is the number of members, and m is the center of the flock. d min is the minimum tolerable distance among  the flock, and d max is the maximum tolerable distance between the center m and other members in the flock.
The well-known Boids algorithm 33 in computer graphics is widely used to simulate flock behaviors. Calling the members of the flock as boids, the flock behavior is maintained by following rules: Separation: boids try to keep a distance away from other boids. Cohesion: boids try to fly toward the center of neighbors. Alignment: boids try to match velocity with near boids.
FoF has been successfully applied in visual tracking especially in tracking articulated objects. Kolsch and Turk 34 and Liu et al. 28 proposed the FoF tracker in hand tracking using boids and obtained great results. Feature points in hand area is treated as boids and proper feature points can be obtained with the restriction of flock behavior.

Proposed algorithm for feature points selection
Feature points are matched in this frame as the observations to update the state of camera and all the feature points. As the camera has moved, some feature points will be out of view and failed to be matched. If the number of the matched feature points is less than a given threshold minNum, then the FoF selector is employed to add new feature points.
The framework of FoF selector is shown in Figure 7. According to the result of the prediction step, feature points are matched in the current frame. If the number of successfully matched feature points is less than minNum, then a feature selection algorithm is activated to find more feature points. Generally, there will be abundant detected feature points, while just a few of them will be selected as landmarks under the FoF restriction.
The core module of the basic FoF selector algorithm is summarized in algorithm 1. All the feature points are treated as boids. The successfully matched features are existing boids and the new detected features are candidate boids. In algorithm 1, W is the weighted map indicating weights of these features, and S p is the center of all the boids calculated in line 4. For every candidate p in the candidate boids, the positive driving force f p and negative driving force f n are computed in algorithm 1 from lines 7 to 14. Here, f p is the positive driving force pointing from p to the center S p , driving this point to the center. The negative driving force f n appears when two boids are too close, acting like the repulsive force between two magnets. For any other boids q, if the distance between the two boids p and q is less than a given threshold d min , add the repulsive force f ðp; qÞ to f n . Here, f ðp; qÞ is inversely proportional to the distance between p and q. Finally, the drift of each candidate is obtained as where and are corresponding parameters. If the length of the drift is less than a given threshold T , then the candidate is chosen as the new feature point because boids that drift too much are regarded as ''bad'' boids in the flock.
The new uniformly distributed feature points are gained.

Experiments and discussions
To evaluate the localization accuracy of the proposed feature selection strategy, experiments are conducted in both   ''Positive Driving Force'': f p S p À p 9.
for q 2 F pre do 11.
end for 15.
if jj Ã f p þ Ã f n jj 2 < T thens 16. simulated and real-world environments, which are carried out in indoor environments. The proposed FoF selector is compared with the original random method in the process of feature point selection. Here, the original random method selects the subset of feature points randomly. A single handheld camera is used as the visual sensor moving around and sensing the environment. The camera used in the experiment is a Logitech web camera (see the right part of Figure 1) with 320 Â 240 pixels resolution.
The original system is run at about one frame per second on a DELL computer with a dual-core processor at 3.1 GHz. Two groups of experiments are conducted. First, a long sequence of 1500 frames is input to SLAM for map building and camera localization, which costs about 25 min. Then, 10 short sequences are used to make a deep analysis.

Experiment on a long sequence
The camera moves around in the lab arbitrarily and captures a long sequence of images. The images are put frame by frame into the SLAM systems using FoF and original random method, respectively. The outputs of the systems are the trajectory of the camera and positions of landmarks. Figure 8 presents the output of six frames using two different SLAM systems and the corresponding frames are the same for a direct comparison. Feature points are detected and labeled in the left half of each subfigure, and positions of feature points and the trajectory of the camera are drawn in 3-D coordinate systems in the right half. As presented in Figure 8, features are more uniformly distributed in the first and third rows than in the second and fourth rows, which means that the SLAM system using FoF selector performs better than using original random method. Meanwhile, blue circles in the first and third rows are less than in the second and fourth rows in most frames, which indicates feature points are more successfully matched using FoF. In the 3-D coordinate system, presented in the right half of each subfigure, the size of the red ellipse covering the feature point indicates the uncertainty of the estimated position which is in relation to the estimated error covariance. Therefore, it is seen that the red ellipses are smaller in the first and third rows, meaning that 3-D positions of feature points are more accurate and quickly converged using FoF selector. Figure 9 reveals the landmarks and the whole trajectories of the camera output by the SLAM systems with FoF selector and random method, respectively. The red pluses are landmarks of the image and the surrounding red ellipse of each plus represents the error covariance. It is obvious that the landmarks are distributed more uniformly using FoF selector than the random method. Moreover, more landmarks are detected using FoF selector. As a result, errors can be spread uniformly, which restrains the drift to a certain extent.

Experiments on short sequences
To measure the localization accuracy, 10 short sequences with 100 frames for each sequence are recorded. In addition, motion trajectories of the camera are known to the system. As shown in Figure 10, the camera makes uniform linear motion or uniform circular motion to get the ground truth of the camera trajectories. In our experiments, for each kind of motion, five sequences are made to test the localization accuracy of the SLAM systems. Figure 11 shows the results referring to the ground truth using FoF and random method. In each plot, the red dotted curve and the blue curve represent the camera trajectory using FoF and random method, respectively, and the blue dotted curve represents the ground truth. Every trajectory starts from the left end point of the curve and ends on the right end point. In most sequences (e.g. L3, L5, R4, and R5), the estimated trajectories using FoF selector are more close to the ground truth than those using random method, which verifies the better performance in localization accuracy of our approach. However, in some sequences (e.g. L2 and R1),  advantages are not distinct because the feature points selected by the random method are also distributed well.

Conclusions
In this article, both theoretical and numerical analyses are made to emphasize the significance of the distribution of feature points in improving the localization accuracy of visual SLAM. According to the analyses, a FoF selector with the flock restriction is introduced to select the feature points when new landmarks need to be added to the map. Experimental results demonstrate that the map is more uniformly distributed and better localization results are obtained when the FoF selector is implemented. In addition, the more feature points are detected, the more effectively FoF selector performs. In future work, novel efficient descriptors will be employed, such as Tree Basis Sparse-coding Inspired Similarity (BASIS), 22 Synthetic Basis (SYBA), 27 and more sensory data will be combined to improve the localization accuracy for real-world applications, for example, advanced robotics, wearable computing, and augmented reality.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article. Figure 11. Camera trajectories of both methods compared to the ground truth. The blue dotted curve is the ground truth of the camera trajectory. The red dotted curve and the blue curve represent the camera trajectory obtained by FoF and random method, respectively. All the curves start from the left end points and end on the right. FoF: flocks of features.