A visual simultaneous localization and mapping approach based on scene segmentation and incremental optimization

Existing visual simultaneous localization and mapping (V-SLAM) algorithms are usually sensitive to the situation with sparse landmarks in the environment and large view transformation of camera motion, and they are liable to generate large pose errors that lead to track failures due to the decrease of the matching rate of feature points. Aiming at the above problems, this article proposes an improved V-SLAM method based on scene segmentation and incremental optimization strategy. In the front end, this article proposes a scene segmentation algorithm considering camera motion direction and angle. By segmenting the trajectory and adding camera motion direction to the tracking thread, an effective prediction model of camera motion in the scene with sparse landmarks and large view transformation is realized. In the back end, this article proposes an incremental optimization method combining segmentation information and an optimization method for tracking prediction model. By incrementally adding the state parameters and reusing the computed results, high-precision results of the camera trajectory and feature points are obtained with satisfactory computing speed. The performance of our algorithm is evaluated by two well-known datasets: TUM RGB-D and NYUDv2 RGB-D. The experimental results demonstrate that our method improves the computational efficiency by 10.2% compared with state-of-the-art V-SLAMs on the desktop platform and by 22.4% on the embedded platform, respectively. Meanwhile, the robustness of our method is better than that of ORB-SLAM2 on the TUM RGB-D dataset.


Introduction
In recent years, the visual simultaneous localization and mapping (V-SLAM) technology has developed significantly. As a low-cost solution, V-SLAM technology has lower requirements on its hardware system. [1][2][3][4][5][6] Therefore, it has preliminary applications in many fields, such as robot industry, virtual reality, and autonomous driving. 7,8 However, the state-of-the-art V-SLAMs are usually sensitive to the sparse landmarks in the environment and large view transformation of camera motion. Due to the decrease of the matching rate of feature points, the tracking thread often has a large drift or even causes failures on the loop detection and the back-end optimization process, [9][10][11] which may finally result in instability or even collapse of the whole V-SLAM system. 11 Aiming at the above problems, this article proposes an improved V-SLAM method based on scene segmentation and incremental optimization. Different from those mainstream V-SLAM systems using all feature point information in the keyframe, this article selects the landmarks near the current line-of-sight vector to match with the last motion direction of the camera so as to improve the positioning accuracy. In the front-end module, by adding scene segmentation, camera motion direction to the tracking thread, and segmenting the camera trajectory, the effective prediction of camera motion in the road segments with sparse landmarks and large view transformation is realized. In the back-end optimization module, this article proposes an incremental optimization method combining camera motion direction and angle. By incrementally updating the state parameters and reusing the computed results, high-precision results of the camera track and feature points are obtained with satisfactory computing speed. The experimental results demonstrate that our method improves the computational efficiency by 10.2% compared with state-of-the-art V-SLAMs on the desktop platform and by 22.4% on the embedded platform, respectively. Meanwhile, the robustness of our method is better than ORB-SLAM2 3,4 on the TUM RGB-D dataset. 1

Literature review
Existing mainstream V-SLAMs could be roughly divided into three categories: based on filter, based on keyframe bundle adjustment (BA), and based on direct method.
Regarding the filter-based V-SLAM systems, Davison et al. 12 proposed a most representative SLAM system named mono SLAM, in which the front end is able to track sparse map feature points and determine their poses in real time, and the back end uses an extended Kalman filter to update its mean and variance. To apply filtering-based SLAM in large scenes, Gu and Liu 13 proposed a robust visual inertial odometry, and in its extended Kalman filter update process, the direct intensity error is used as the visual metric, however, such kinds of algorithms are only suitable for the environment with rich visual feature points. In general, filter-based V-SLAM system has poor performance in large scale and high dynamic scenes, which are likely to cause cumulative errors. 14 The first V-SLAM system based on keyframe BA, with its name parallel tracking and mapping (PTAM), was proposed by Klein and Murray. 15 PTAM supports keyframe and back-end nonlinearly optimization, which uses a multithread strategy to track and map the scene, however, the main issue of PTAM is that its efficiency could be strongly affected in a medium range scenario with thousands of points. Currently, ORB-SLAM2 proposed by Mur-Artal et al. 3,4 is one of the best keyframe-BA based V-SLAM systems, in which based on the improvement of PTAM, the ORB feature is used to track the surrounding environment, and loop-detection is established to eliminate cumulative errors. Similar to filter-based V-SLAM systems, keyframe-BA-based V-SLAM is dependent on the richness of environmental features, therefore, tracking failures and mapping accuracy errors often occur in the scene with sparse landmarks and large view transformation.
Instead of using keypoints, large-scale direct SLAM (LSD-SLAM), as the most-widely known direct method V-SLAM, directly operates on image intensities both for tracking and mapping, which are able to recover the semidense map in the scene. 16 Engel et al. 17 proposed direct sparse odometry (DSO) in 2017, which improves the scale of dense map restoration, and is more robust than LSD. The key problem of such kind of V-SLAMs is that they are usually sensitive to outdoor scene illumination changes and dynamic interference, so the robustness of location and mapping is not strong enough. 13 It is generally believed that the large view transformation of camera and sparse landmarks are two big technical challenges of V-SLAM. Morel and Yu 18 proposed affine scaleinvariant feature transform (ASIFT) algorithm in 2009. ASIFT uses affine sampling to simulate the angle deformation, which makes the algorithm to have the ability of antiangle transformation and antiscale transformation, however, it could not provide real-time performance. Based on ASIFT, after surf (ASURF), after ORB (AORB), and AFREAK use SURF, ORB, and FREAK algorithms to replace SIFT algorithm, respectively, which are able to improve the operation speed. 19 Han 20 proposed a visual slam method to integrate the information of 3D structure, which uses the 3D point cloud structure to render virtual keyframes, expanding the scene coverage of matching view angle, as a result, the recall rate of loopback detection is improved in changing scenes.
In the view of the situation with sparse landmarks in the environment and large view transformation of camera motion, Pumarola et al. 21 proposed a SLAM algorithm combining semantic information and line features, which sample the straight lines by matching the 3D points. However, the accuracy of the algorithm is not ideal and has high computational complexity since some of those 3D points are obtained using interpolation method. 22 Zhang et al. 23 proposed an RGB-D V-SLAM method based on line features, which use Mahalanobis distance to measure the distance from a 3D point to a 3D straight line and use random sample consensus (RANSAC) algorithm and maximum likelihood estimation to estimate the parameters of 3D lines so that it guarantees the accuracy of coordinate extracted in 3D line segments. Nerurkar et al. 24 proposed C-KLAM based on the keyframe constraint. In the algorithm, during the back-end optimization, the sparse structure of the correlation information matrix is maintained through marginalization, and the information matrix of nonkeyframes is projected into other adjacent keyframes to realize the optimization of camera pose and feature points. This method has strong robustness, but it relies on IMU or other inertial sensors to provide accurate camera pose output. Kaess et al. 25 proposed iSAM2 based on incremental direct inference method with strong robustness for landmark points, which uses Bayes tree to solve the nonlinear optimization equation in an incremental way. Zhang et al. 26 proposed a tracking method for discontinuous features through extracting the discontinuous SIFT features, in which K-means and matching matrices are used to cluster SIFT features, consequently feature points are segmented and fused. This method is suitable for largescale scenes, however, its real-time performance is poor due to the large computing cost. Liu et al. 27 proposed an RGB-D SLAM system using an incremental cluster adjustment algorithm, and it implements local and global BA in a unified optimization framework, which can save a lot of computing time in the back-end optimization of discontinuous segments and improve the processing speed. The disadvantage is that based on the direct method, the RGB-D SLAM system is sensitive to illumination changes and has the problems of feature point drift when there are few landmarks.
Overall speaking, V-SLAM technology has developed rapidly in recent years. However, when the V-SLAM technology is used in the situation of road segments with sparse landmarks or large view transformation, the number of matching feature points may be sharply reduced, which often leads to the problems of large drift or tracking failure in the V-SLAM system. These problems limit the further applications of V-SLAM.
Aiming at the aforementioned problem, this study proposes an improved V-SLAM method based on scene segmentation and incremental optimization strategy. The main contributions of this article are as follows: 1. In the front end, this article proposes a scene segmentation algorithm combined with camera motion direction and angle. By segmenting the camera trajectory and adding the camera motion direction to the tracking thread, an effective prediction model of camera motion in the scene with sparse landmarks and large view transformation is realized. 2. In the back end, this article proposes an incremental optimization method combining segmentation information and a prediction model optimization for unstable tracking state. By incrementally adding the state parameters and reusing the computed results, high-precision results of the camera trajectory and feature points are obtained with satisfactory computing speed.

Framework
The framework of the algorithm in this article is shown in Figure 1. Similar to ORB-SLAM2, our algorithm also includes three modules: a tracking thread, a local mapping thread, and a loop detection thread. As shown in Figure 1, in the front end, a scene segmentation decision method is adopted. Then, the information of camera motion direction and angle is used to establish a prediction model, which can track the sparse landmarks more effectively. Consequently, more robust keyframes are selected by this algorithm. At the same time, in the back end, combined with the incremental optimization method, a prediction model of unstable tracking state and an optimization method of V-SLAM are adopted. By incrementally adding the state parameters to be optimized and reusing the existing calculation results, high-precision results of the camera trajectory and feature points are obtained with satisfactory computing speed.

Visual simultaneous localization and mapping tracking with scene segmentation and prediction model
The framework in this section is shown in Figure 2. By segmenting the camera trajectory and adding camera motion direction and angle, this algorithm establishes a prediction model in the road segments with sparse landmarks. The specific steps of the V-SLAM front-end algorithm are as follows: Tracking with landmarks and feature points. Firstly, the uniform motion model 24 of the visual camera is established where F is the motion equation of the camera between two frames, and x i and x iþ1 denote the camera poses at time steps i and i þ 1, respectively. u i is the measured control input, and in this article, it is related to camera motion direction. w i is the zero-mean, white Gaussian measurement noise with covariance Q i . At time step i, the camera observation model is where z ij denotes the camera's observation of the landmark, f j is the landmark's coordinate, H is a general nonlinear measurement function, and v ij is the zero-mean white Gaussian measurement noise with covariance Q 0 ij . Suppose in the road segment Seg, there are M landmark positions f j , j ¼ 1; . . . ; M, observed from these poses and N camera poses x i , i ¼ 0; 1; . . . ; N . Then, the estimation where Z i denotes a set of observation models for other landmarks at each camera pose x i . Combined with the above model, under the Gaussian and independent noise assumptions, the overall error model can be established as follows where Scene segmentation based on camera motion direction and angle. In the tracking thread of existing V-SLAMs, the large view transformation and the sparse road landmarks often lead to errors in the tracking thread, and finally, give rise to the drift of optimization iteration direction. 28 Therefore, some researchers put forward a strategy based on scene segmentation optimization, 27 which uses different clustering methods based on IMU and pose information to realize scene segmentation. It can not only reduce the error accumulation in the tracking thread but also improve the accuracy of BA optimization. Inspired by the above ideas, this article proposes a scene segmentation method combined with camera motion direction and angle.
The stable tracking algorithm in this article is basically the same as ORB-SLAM2, which uses the uniform motion mode to predict the camera pose. Our algorithm is different from ORB-SLAM2 when the number of matching feature points sharply decreases. In this case, ORB-SLAM2 will compute the bag of words (BoW) vector of the current keyframes and find matched features based on the BoW dictionary. Instead, our algorithm does not use all the feature point information in the keyframes but uses the last camera motion direction and angle as constraints, at the same time, selects the feature points near the current lineof-sight vector to optimize the camera pose. The specific steps are as follows.
Firstly, camera positionsx i and d x iþ1 at time steps i and i þ 1 are projected to a plane, respectively, to obtain , where K is the camera intrinsics matrix. Meanwhile, the number of tracking feature points n feature and the keyframes generation frequency f frame are determined in the tracking thread. By comparing them with weighted thresholds (n feature ; f frame ), which are average tracking feature points and keyframes generation frequency, respectively, segment computation will be executed to prevent the accumulation of local errors from affecting the optimization of localization and mapping results. Select the direction vectors x 0 i x 0 iþ1 and x 0 j x 0 jþ1 of two included angles angle À p 4 ; angle þ p 4 À Á as the main direction of scene segmentation, and divide the scene into two segments, where angle is the average of vector angles within the confidence range. At the same time, use RAN-SAC algorithm to fit the rest of the location points into the two scene segments. The specific process can be represented by Algorithm 1

Prediction model combined with segmentation information.
Based on the scene segmentation rule of "strict in and wide out," after the scene segmentation, a reliable and efficient camera motion prediction model is needed in the case of discontinuous segments with sparse or even missing landmarks so as to select feature points as accurately as possible and maintain the tracking state until entering the stable road segments with dense landmarks.
After entering this kind of road segments, it is very important to select the appropriate road segments. Firstly, based on the direction constraint of the camera's main motion angle in the segment, the landmarks which are close to the distance d f k i ; x j x jþ1 À Á x j x jþ1 of the direction are selected. After that, when tracking these road markings, the prediction model is as follows where x c is the predicted increment of camera pose x i in the main motion direction, lE À0:1; 0:1 ð Þ , S c is the lie algebra representation of the predicted increment, S i ; S iþ1 is the lie algebra form of the absolute camera pose at i; i þ 1, respectively, and S i;iþ1 is the relative pose between time steps i; i þ 1. Through the prediction of camera motion, the calculation d f k iþ1 *S i;iþ1 f k iþ1 of landmark f k i at the next moment can be obtained. Then, the error objective function of the prediction model can be established as follows where d f k iþ1 denotes the calculated value of k landmark at time step i þ 1, f k iþ1 denotes the actual observed value of k landmark at time step i þ 1. d f k i ; x j x jþ1 À Á 2 is the Euclidean distance between the selected landmark and the main direction of motion. By minimizing the error objective function, we can effectively select landmarks to maintain the tracking state in the segments in the scene with large view transformation and sparse landmarks. The process will not end until the scene segments with normal tracking state are entered.

Incremental BA with fusion of segmentation information
In the back-end optimization process, ORB-SLAM2 uses covisibility graph, essential graph, and spanning tree. When inserting keyframes in the local mapping thread, keyframes with a certain number of covisibility landmarks will be established, and the spanning tree saves all nodes (i.e. keyframes) as a subset of the covisibility graph. After that, the loop closing thread will compute the similarity of the covisibility graph and optimize the camera pose according to the essential graph. This algorithm heavily relies on Algorithm 1. Scene segmentation based on camera motion direction and angle covisibility landmarks to establish edge connections. At the same time, the setting and deletion strategies of redundant keyframes also increase computational complexity. The reduction of covisibility landmarks in scenes with large view transformation and a large number of invalid keyframes added frequently causes graph maintenance failure and tracking failure.
In this article, we still follow the tracking and optimization framework of ORB-SLAM2: (1) Optimize pose Euclidean transformation matrix in local BA (tracking thread) and (2) optimize pose similarity transformation matrix and global pose in global BA. But different from ORB_SLAM2, we use the factor graph to represent feature points and camera poses of keyframes. After that, we use the Bayes tree to run incremental optimization, which enables us to add the camera motion direction and angle factor to the optimization process, which consequently reduces both the dependence on covisibility landmarks and the computational complexity.
According to iSam and iSam2, 25 the optimization process in SLAM can be expressed as a dynamic Bayesian network, and the motion and observation equations can express the conditional probability between state variables. As shown in equations (1) and (2), we add the observation equation and the direction of camera motion to the Bayes tree as factor graphs. In the Bayes tree, the nodes are classified by clique, and the k'th clique can be expressed as C k . We define the conditional probability density pðF k jS k Þ for each node. Among them, separator S k is the intersection of C k and its parent group P k :C k \ P k . Frontal variable F k is the set of remaining variables in C k except S k :F k ¼ C k \ S k and u i ; z ij À Á 2 C k . Therefore, the joint probability density of the Bayes tree p X ð Þ can be expressed as In this article, an example of factor graph and Bayes tree is shown in Figure 3, where f i is the landmark, x i is the camera pose, and the segmented part Seg i 2 x j x jþ1 is based on the camera segmentation of camera motion direction Then, we will introduce how to perform incremental optimization based on the matrix formed by the factor graph.
The incremental optimization algorithm is often used with IMU, 26 so that the camera can obtain reliable pose information and improve the accuracy of the optimization algorithm. In this article, without IMU, we use an incremental BA method with fusion of segmentation information to optimize the camera pose, feature points, and the prediction model in the unstable state. Compared with the traditional method, it is able to improve the optimization speed and the robustness of the drawing thread.
In the case of incremental updating optimization, to optimize the new parameters, we update the nearest matrix to reuse the results of the existing measurements. A nonlinear error optimization objective function can be obtained using a classical linearization method where D is the state update vector, including camera pose, landmark position, predicted landmark, and predicted main motion direction in the prediction model; A and b represent the parameters of a sparse linear system with D as the solution. Through householder QR decomposition, the objective function is rewritten as where c represents the sum of squares of the residuals of the least squares. When the new vectors need to be reoptimized and updated, the previous decomposition can be modified through the classic QR update. ½A 0 jb 0 is the result of adding a new state increment a T ; b to ½Ajb. The classical QR update increases a T in the former factor R and a new scalar element b in the right vector d. In this article, the state quantity a T is sparse in the unstable tracking state, which is related to the motion direction, angle, and selected landmark of the scene segments. Then, we have After obtaining the state quantity update, we use Givens rotation to cancel the newly added row element on the left side and then change the new linear system into the upper triangle form in an incremental way. The specific method is shown in Figure 4 Gb cos' sin' Therefore, the matrix of QR decomposition is multiplied by several Givens rotation matrices to the left and for the nonzero elements of the row state cell added by the increment, consequently, an upper triangular matrix containing the update factor R 0 is generated from the left most nonzero element Similarly, using Givens rotation matrix to update the right state vector d a incrementally, we get d 0 . Then, the equivalent objective function after incrementally updating of the objective function is as follows where c 0 is the sum of residual squares after incremental updating.
In fact, there is still some work to do for the incremental optimization based on Bayes tree in GTSAM. Our work in this part is the same as iSam2, so it will not be repeated.    Through the incremental updating of error objective function, the incremental optimization results of camera pose and feature points can be obtained from the optimization of the prediction model in an unstable state so as to improve the optimization speed and the robustness of the mapping thread.

Experiments
By implementing the aforementioned algorithm, considering that ORB-SLAM2 uses g2o library based on graph optimization and OpenVSLAM 29 reproduces it with high readability, we developed a V-SLAM system based on ORB-SLAM2 and OpenVSLAM. Its user interface is shown in Figure 5, where Figure 5(a) shows the mapping and tracking thread, and Figure 5(b) shows the mapping results.

Test configuration
To evaluate the performance of our algorithm, we choose TUM RGB-D dataset and NYUDv2 RGB-D dataset to run the comparative experiments. In the aforementioned datasets, scenes have sparse landmarks and large view transformation. The basic conditions of the selected datasets are as follows:  The hardware platform concludes with a desktop platform and an embedded development platform. The configurations of them are given in Table 1.

Experimental results
We select the state-of-the-art V-SLAM systems, such as RKD-SLAM, 27 ORB-SLAM2, 4 ElasticFusion, 30 Open-VSLAM, 29 and LSD-SLAM, 16 to compare with and evaluate our algorithm and system.
Experimental results on desktop platform. As given in Table 2, the experimental results of relative pose error (RPE) between tum TUM RGB-D dataset and NYUDv2 RGB-D dataset are compared. It could be found that the average accuracy of our method is 51.7% higher than the average accuracy of the ORB-SLAM2 on the datasets fr1_floor and fr2_large_with_loop with serious landmark missing and large view transformation of the camera, and is 59.2%  higher than the RKD-SLAM in NYUDv2 lab1, respectively. As shown in Figure 6, the average processing time of frames in the dataset fr2_xyz is given. In our method, the average processing time of a single frame is 28 ms, which is 82.7% faster than ORB-SLAM2, while the absolute track accuracy is basically unchanged. Through the experiments of a large number of TUM RGB-D datasets, the average processing time of the keyframe of this algorithm is 10.2% higher than ORB-SLAM2 and its improved algorithm. Figure 7(a) shows the scene segmentation result based on the RKD-SLAM on the dataset NYUDv2 lab1, and Figure 7(b) shows the scene segmentation result based on the method in this article, where the black track is the ground truth and the color track is the segmentation result. The absolute trajectory error of this algorithm is 62% higher than that of RKD-SLAM in this dataset. As shown in Figure 8, the left figure shows the results of the ORB-SLAM2 and OpenVSLAM in TUM datasets, and the right figure shows the results of the methods in this article in the sequences fr1_desk, fr1_floor, fr2_xyz, fr3_ nostructure_notexture_near_withloop.
Experimental results on embedded development platform. To evaluate the performance of the proposed algorithm in the embedded development platform, we conducted a comparative experiment based on HUAWEI Kirin970.
The configurations of the embedded platform and the camera are presented in Table 3. The V-SLAM system based on Kirin 970 is shown in Figure 9, and it could be found that our algorithm improves computational efficiency by 22.4% than LSD-SLAM and ORB-SLAM2, which is essential for the application of V-SLAM in embedded development platform.
As given in Table 4, the experimental results of the RPE and running time of the algorithm in this article with ORB-SLAM2, LSD-SLAM, and OpenVSLAM on the TUM dataset are listed. Experiments show that the average running speed of the method in this article is 19.5% higher than that of ORB-SLAM2 and 25.3% higher than that of RKD-SLAM.
At the same time, as shown in Figure 10, on the fr1_floor, fr2_large_with_loop, and fr3_nostructure_notexture datasets, our algorithm shows better tracking stability than both ORB-SLAM2 and LSD-SLAM. And in the actual test, we chose a 25 Â 25 m 2 indoor scene. The scene is mainly empty rooms and corridors. The number of feature points in each frame of image is not more than 300. Meanwhile, it is a three-story building with a height of 10 m that has large view transformation of camera motion in the process of going up and down. The closed loop error of our algorithm is 0.5 m, and the average trajectory error is 0.83 m. For comparison, the closed loop error of ORB-SLAM2 is 1.8 m, and the average track error is 2.1 m. As shown in Figure 11, our algorithm has better positioning accuracy and loopback accuracy than ORB-SLAM, and the tracking performance is more stable.

Conclusion
Considering the problem that existing V-SLAM systems are usually sensitive to the situation of sparse landmarks or large view transformation, this article proposes an improved V-SLAM method based on scene segmentation and incremental optimization strategy. And the experimental results demonstrate that our method improves the computational efficiency by 10.2% compared with state-of-the-art V-SLAMs on the desktop platform and by 22.4% on the embedded platform, respectively. Meanwhile, the robustness of our method is better than that of ORB-SLAM2 on the TUM RGB-D dataset. In general, our improved algorithm provides strong support for the effective application of V-SLAM under embedded platforms.
In future research, we plan to tightly integrate GNSS/ BEIDOU information to further improve the SLAM accuracy in large scale and missing closed-loop scenarios.