A visual SLAM method based on point-line fusion in weak-matching scene

Visual simultaneous localization and mapping (SLAM) is well-known to be one of the research areas in robotics. There are many challenges in traditional point feature-based approaches, such as insufficient point features, motion jitter, and low localization accuracy in low-texture scenes, which reduce the performance of the algorithms. In this article, we propose an RGB-D SLAM system to handle these situations, which is named Point-Line Fusion (PLF)-SLAM. We utilize both points and line segments throughout the process of our work. Specifically, we present a new line segment extraction method to solve the overlap or branch problem of the line segments, and then a more rigorous screening mechanism is proposed in the line matching section. Instead of minimizing the reprojection error of points, we introduce the reprojection error based on points and lines to get a more accurate tracking pose. In addition, we come up with a solution to handle the jitter frame, which greatly improves tracking success rate and availability of the system. We thoroughly evaluate our system on the Technische Universität München (TUM) RGB-D benchmark and compare it with ORB-SLAM2, presumably the current state-of-the-art solution. The experiments show that our system has better accuracy and robustness compared to the ORB-SLAM2.


Introduction
Simultaneous localization and mapping (SLAM) is an extensively researched topic in robotics 1 and has been widely applied in service robot, autonomous driving, Unmanned Aerial Vehicle (UAV), virtual reality, and other fields. [2][3][4] The SLAM system mainly contains laser-based and vision-based methods according to its sensor type. In recent years, the theoretic research of laser-based SLAM has achieved extraordinary results in localization, 5 mapping, autonomous navigation, and independent exploration. 6 Meanwhile, visual information has the advantage of much quantity information, low cost, and intuitive effects, visual simultaneous localization and mapping (vSLAM) has gradually become the hot research field. 7 Visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications. According to its different principles, there are mainly feature-based method and direct method, such as the line segment detector (LSD)-SLAM proposed by Engel et al., 8 which is able to build large-scale semi-dense maps, using direct methods instead of bundle adjustment over features. It achieves the reconstruction of semi-dense scene on standard Central Processing Unit (CPU)s, without Graphic Processing Unit (GPU) acceleration, and guarantees the final stability and real-time, but it still needs to rely on the feature point method for loop detection. Since the direct method is extremely sensitive to illumination and it is difficult to satisfy the strong assumption that the gradation is invariant, the research of feature-based SLAM has become popular currently. In recent years, optimization-based 9,10 approaches have emerged endlessly due to its superior accuracy per computational unit as compared with filtering-based approaches. The first real-time application was the visual odometry work of Mouragnon et al., 11 followed by the ground-breaking SLAM work of Klein and Murray 12 known as parallel tracking and mapping.
Many algorithms can be performed with only a monocular camera. However, depth information cannot be directly observed, which results in a lack of the scale. In addition, multi-view or filtering techniques are required to produce an initial map, and the pure rotation problem cannot be effectively solved. In recent years, various low-priced depth cameras have been launched, such as Microsoft's Kinect, 13 Intel's RealSense, 14 and Asus's Xtion, which make up for the defects caused by monocular SLAM. 15 It is remarkable that the point-based SLAM systems often fail to work properly and even lead to failure in low-texture or motion-jitter scenes in which it is difficult to find enough keypoint features. However, in addition to point shapes, there are planar elements that are rich in linear shapes and object-based 16 shapes, especially the former, even in low-texture scenes such as white walls and corridor edges.
In this work, we propose a point-line fusion method to deal with weak-matching objects in Red Green Blue -Depth (RGB-D) SLAM, where line features are joined in the feature extraction part. To solve the overlap and branch problem existing in the LSD 17 algorithm, we design an adaptive line extraction method which enhances the reliability of the line features. The camera pose is optimized by minimizing the point-line reprojection error. In addition, we set up an optimization mechanism for the motion-jitter sequences, which can effectively improve tracking success rate and improve the tracking accuracy indirectly. Compared with point-based SLAM, the accuracy and robustness of PLF-SLAM are greatly improved.
The rest of this article is organized as follows. The second section discusses the related work, the third section gives the details of our proposal, the fourth section details the experimental results, and the fifth section presents the conclusions and the future work.

Related work
The KinectFusion algorithm proposed by Newcombe et al. 18 merges the depth information measured by the sensor and uses the iterative closest point 19 (ICP) to calculate the camera pose. Due to its lack of loop detection, it is often applied to small workplaces. Endres et al. 20 propose using the ICP algorithm to compute frame-to-frame motion, and using the optimization-based method in the back end. Kerl et al. 21 propose a dense vSLAM method that minimizes both photometric and depth error of all pixels, which can make better use of the information available in the image compared to feature-based methods. The ElasticFusion algorithm proposed by Whelan et al. 22 is a dense 3-D reconstruction based on depth cameras which combines with relocation and is suitable for room-sized scene. Raul et al. propose Oriented FAST and Rotated BRIEF (ORB)-SLAM2 23 on the basis of their previous work, 24 and the whole system is built around the ORB features, 25 including the visual odometry and the visual vocabulary of loop closing. Although features such as Scale Invariant Feature Transform (SIFT) 26 and Speeded Up Robust Features (SURF) 27 are better in quality than ORB, their time consumption is difficult to meet the real-time requirements of vSLAM, and GPU acceleration is required. Compared with Harris, 28 ORB has good rotation invariance and can use the pyramid to achieve scale invariance. The system also adds the mechanism of loop closing, which can effectively prevent the error accumulated in the loop.
Gomez-Ojeda et al. 29 propose a vSLAM method based on a stereo camera that combines points and line segments to work in a wider range of scenes, especially in scenes where point features are scarce or not evenly distributed. Pumarola et al. 30 propose a visual SLAM based on a monocular camera, which improves the robustness of the system by processing points and lines simultaneously. In addition, it can complete the initialization of the map by processing the lines in three consecutive frames.
The LSD algorithm proposed by Rafael Grompone et al. can obtain detection results with sub-pixel precision in linear time, which can be applied to any digital image without debugging parameters. Its transform extraction speed is far faster than Hough speed, and it has strong robustness. Compared with the Mean-Standard Deviation Line Descriptor (MSLD) 31 descriptor, the line band descriptor (LBD) algorithm proposed by Zhang et al. 32 is fast and robust, which adds weight coefficients.

PLF-based vSLAM
Our work is based on ORB-SLAM2, and the algorithm consists of three threads that run in parallel: tracking, local mapping, and loop closing. In the tracking thread, point feature and line feature are used to estimate and optimize camera pose by reasonable weight, respectively. Figure 1 shows the flow chart of the tracking thread.

Adaptive line segment extraction
The time needed for feature extraction occupies a large part in the whole algorithm. Efficient feature extraction method is the premise to ensure the real-time performance of the SLAM system. LSD 17 is a linear-time LSD, which is much faster than the Hough transform 33 method and requires no parameter tuning. However, compared to extracting ORB features, it takes two to three times longer to extract line features by the LSD method, which will greatly affect the real-time performance of the algorithm. And some of the extracted lines are redundant and useless, resulting in lower reliability. We propose a new line extraction method based on LSD and define a line segment response value function as follows where sp and ep are the two end points of the extracted line segment, L and W are the length and width of the image frame, respectively. By setting the adaptive threshold, the line segment with response value less than e r will be eliminated, which increases the extraction speed. The line segments detected by the LSD method often have problems such as overlapping or sub-line segments, as shown in Figure 2.
Among them, Figure 2(a) is the original sample, Figure 2(b) is the line segment extracted by the original algorithm, and Figure 2(c) is the enlarged result of Figure 2(b). The line segment on the floor is mistakenly considered to be the overlap of two or more line segments, and this error occurs in most of the frames, which will affect the reliability of the matching. Therefore, we propose an improved method for this flaw and merge the overlapped line segments through appropriate conditions. First, the above situation is abstracted into a geometric problem, as shown in Figure 3.
In Figure 3, d 1 is the distance from the midpoint on line l 1 to line l 2 , we define the conditions for merging the line segments as follows The included angle of the two line segments should be smaller than e 1 , where DirðlÞ is the direction angle function of the line segment, and e 1 is the angle  experience threshold; e 2 is the minimum distance threshold from the midpoint of line l 1 to line l 2 . After satisfying formula (2), for case (a), the abscissa or ordinate of one endpoint in l 1 must be between the two end points of l 2 , i.e.
where x and y are the abscissa and ordinate of the end point of the line segment, respectively. For case (b), the shortest distance from end point b to c is less than the experience threshold d min , i.e. disðb; cÞ < d min As described above, if the extracted line segments satisfy the above conditions, line segment merging is performed. We propose the least squares method for merging line segments and generate a 2-D point set according to the given line segments.
Suppose that the approximate line function of the merged line l 12 be y 12 ¼ ' 12 ðxÞ. To minimize the sum of squared deviation is minimized. The deviation formula is as follows Let the fitting formula be Then the sum of squared deviations is as follows Calculating partial derivatives for a and b, respectively, we obtain the simultaneous equations Finally, the above formula is solved to obtain the linear function of the merged line segment.

Line matching method
Line features can be matched by comparing the similarity of LBD descriptors. The common feature matching methods contain the brute force match (BFM) algorithm and fast library for approximate nearest neighbors (FLANN) algorithm. FLANN can easily lead to matching failure when tracking in low-texture scenes, thus we use BFM in our work.
Compared with the point feature, the BFM algorithm has a higher error rate when matching line features. The main reasons are as follows: Therefore, we filter the line features extracted by the BFM algorithm and reject the inaccurate matches. All matching pairs are needed to be filtered by the following conditions ( Figure 4): (a) jDirðl 1 Þ À Dirðl 2 Þj < e l 1 ;l 2 . Since the motion between two adjacent frames is small, the angle of the extracted line segment does not change too much. When the difference between the direction angles of the matched line segments is greater than e l 1 ;l 2 , the match is regarded as a mismatch.   In addition, the line feature mentioned above compares the similarity by comparing the Hamming distance between LBD descriptors, and we obtain the minimum descriptor distance by BFM. On this basis, when the minimum descriptor distance exceeds a certain range, the matched line will be discarded.

Point-line reprojection error
The optimization of camera pose first needs to find an error model of the points and lines. The variable of the error model is the camera pose and spatial coordinate of the features. The camera pose is calculated by minimizing the point-line reprojection error. In our work, the distance between the two end points of the spatial line projection and the detected line segments in the image is used as the reprojection error.
As shown in Figure 5, let P; Q 2 R 3 be the 3-D end points of the line segment, p d ; q d 2 R 2 be their detected 2D end points. For normalization, letp d ;q d 2 R 3 be their corresponding homogeneous coordinates. From the above, we can define the normalized line coefficient as After obtaining the normalized line coefficients, we define the point-line error E pl between P and the detected 2D line segment l as the distance from p d to l E pl ðq; P; lÞ ¼ ðlÞ T pðq; PÞ ð 9Þ where pðq; PÞ represents the projection of P in the image, and q represents the rotation matrix R and the translation vector t of the camera pose. Then the line reprojection error E l is defined as the sum of the point line error E pl E l ðq; P; Q; lÞ ¼ E 2 pl ðq; P; lÞ þ E 2 pl ðq; Q; lÞ ð10Þ Since there is accumulated errors in camera pose obtained in visual odometry, and the noise cannot be eliminated, the projection of 3-D points and lines on the image is necessarily different from the points and lines detected on the image, so we define the following error formula where i is the ith line feature, j is the jth point feature, and l i is dynamically adjusted according to the response value of each line feature response value. In our work, the above-mentioned error formula is used to construct a comprehensive model of the point line error and then minimize the error and find the optimal camera pose. The pseudo-code of the whole algorithm is provided in Algorithm 1

Motion jittering
In the tracking process, the camera often moves fast or shakes, which results in blurred images. As shown in Figure 6, it is easy to cause the failure of feature matching, and finally, the tracking will be lost. Tracking loss requires relocation to make adjustments. If the relocation fails after the loss, it will lead to tracking failure. The main reason for the tracking loss is that the texture of the image is not obvious or the feature difference is too large due to motion jittering, which makes it impossible to match correctly. There are two solutions to the above situation: (a) Under the premise of obtaining the blur kernel, the image with jittering is applied to the linear convolution, which makes the original blurred image clearer, but calculating the blur kernel is a major difficulty. (b) Using the Gauss filter to deal with the image captured by the camera, which can reduce the difference between adjacent images and effectively improve the success rate of feature matching. This feature can effectively reduce the probability of tracking loss.
In this article, we propose a method to select motionblurred images automatically. In tracking, we pre-match the features of the current image and the next image to calculate the success rate of the tracking in advance. When the matching degree of the two images is lower than the given threshold, Gaussian blur is used to reduce the image noise and detail, which reduces the difference between the two images and improves the matching success rate later.
When the size of blur kernel increases in a certain range, the image blur degree will also increase, so the number of feature matching will also increase. But beyond this range, the difference between adjacent images will be larger than the original image, resulting in matching failure. We propose the following improvement methods in Algorithm 2 if R l (line i ) < low set then 8: line.erase(i) 9: end if 10: end for 11: for i = 1 to N l ines do 12: for j = i + 1 to N l ines do 13: if |Dir(l i ) − Dir(l j )| > ε 1 d 1 > ε 2 then 14: continue 15: end if 16: if sp j < x i < ep j sp j < y i < ep j then 17: LineM erge(line i , line j ) 18: end if 19: if min dis(point i , point j ) < d min then 20: LineM erge(line i , line j ) 21: end if 22: end for 23: end for 24: while isT racking() do 25: for eachf rame do Through the improvement of the above method, the tracking success rate is obviously improved, and the error of pose estimation is indirectly reduced, which improves the robustness of the SLAM system.

Experiments and results
We experiment and evaluate the performance of our method on the Technische Universität Mü nchen (TUM) 34 dataset, which provides several image sequences with accurate ground truths obtained by external motion capture systems. These image sequences also include specific scenes such as normal motion, fast motion, low texture, no structure, etc. We perform our experiments on the computer which has an Intel core i5 (@2.6 GHZ) processor and 8 GB RAM with no GPU acceleration to reduce the running cost of SLAM.

Tracking accuracy
The evaluation of SLAM system performance mainly includes localization accuracy and mapping accuracy, and the latter mainly depends on the former. Therefore, we mainly focus on the localization accuracy of the algorithm in this article. To verify the effectiveness and superiority of the proposed method, we compare it with ORB-SLAM2.
In this article, we select different types of image sequences and carry out multiple groups of the same experiments on each sequence and then take the average value to avoid contingency. Finally, we calculate the root mean square error (RMSE) between the estimated pose and the true pose. We obtain the average of these RMSE data to effectively compare the results. A variety of image sequences such as fr1/360 and fr1/desk are selected as experimental comparison sequences and compared with ORB-SLAM2. The results are presented in Table 1. Table 1 shows the absolute trajectory error result of different methods for tracking the above dataset sequences separately. From the experimental results, it can be known that the RMSE data obtained by our method is lower than that of ORB-SLAM2 at different degrees in most datasets. As shown in Figure 7(a) and (b), the tracking results of PLF-SLAM are significantly better than ORB-SLAM2 on the fr1/room sequence, which is benefited from the adaptive line segment extraction method proposed in this article. When the number of point features is insufficient, the line features can be simultaneously extracted to participate in the tracking, which makes up for the lack of point features.
In addition, we also compare our method with ORB-SLAM2 based on the original LSD method. It can be seen from the experimental data in the table that, when we use the adaptive line segment extraction method in this article, most of the trajectory errors are significantly reduced, and the pose estimation is more accurate. The reason is that the line features extracted by the original LSD method have a series of situation such as overlap, segmentation. So that the difference in line features extracted between the adjacent images is large, resulting in poor matching. In our work, we deal with the overlap and segmentation, and more stringent conditions are imposed in the matching, which makes the line matching more reliable and lays a good foundation for the back-end optimization.

Algorithm 2. Motion Jittering Proccess
1: Set the matching success experience value ρ, which is derived from multiple feature matching experiments; 2: Extract the point features of the current frame and the ORB feature point of the next frame at the same time; 3: Pre-match features between current frame and next frame to obtain the pre-match number; 4: When the number of matches is lower than ρ, modify the current frame with a blur radius of 3x3 blur kernel, and modify the next frame with 5x5 blur kernel at the same time, then continue tracking task; 5: When the number of matches is higher than ρ, the original image remains unchanged, then continue tracking task.

Motion jitter evaluation
In the second section, when the camera is quickly shaken, the images taken will often affect feature matching due to the motion blur problem, so it is difficult to calculate the camera pose between the current frame and the next frame in the tracking process. When that happens, we need to perform the relocalization method to find the camera pose, otherwise, the entire tracking process will fail.
When the above TUM sequences are tested, it is found that there are different degrees of motion jitter in multiple sets of sequences. In the fr1/desk sequence, ORB-SLAM2 typically has tracking loss during the 170th and 171st images. The reason is that the image of the previous image is drifted due to the camera's fast jitter, and the difference between the current frame and the next frame is too huge, so that there are not enough matching pairs, resulting in  tracking failure. Although the camera pose can be restored by relocalization, the previous trajectory will be lost, which will inevitably increase the overall trajectory error, as is the fr2/large_loop sequence.
We found two images of the sequence that caused tracking loss and then extracted and matched the features separately. We find that there are few valid matches. For example, the matches of jitter frames in the fr1/desk sequence are only 76, which are difficult to meet the needs of tracking. After our method is used, the matches grow to 117 pairs, and the matching success rate is obviously improved. This is due to the optimization of the jitter images. When the matches are sufficient, the tracking success rate will be greatly improved. Figure 8 shows the matching of two images in the fr1/ desk sequence, where (a) is the matching image before processing and (b) is the processed matching image. It is easy to find that the feature matching after processing by this method is better. Figure 9(a) shows the comparison between the estimated trajectory and the actual trajectory in ORB-SLAM2. We can find that the tracking in the red rectangle box has been lost. Figure 9(b) shows the comparison between the estimated trajectory and the actual trajectory in PLF-SLAM. It can be seen that the tracking effect of this method in this sequence is obviously better than the former. Table 2 shows the comparison of the two methods in tracking success rate, which includes multiple groups of image sequences with good tracking and bad tracking. From the table, we can see that our method has good tracking results for most image sequences. Compared with ORB-SLAM2, the tracking success rate has been improved to some extent. This proves that our method improves the robustness of slam system to a certain extent.

Conclusion
In this article, a point-line fusion SLAM method based on depth camera is presented. Aiming at the unreliability of SLAM based on point feature in low-texture environment, a new line segment extraction, and matching method are proposed, which solve the overlap and branch problem of the original LSD method and improve the reliability of the line matching and robustness. For the motion jitter problem often encountered in tracking, a method of autonomously selecting and optimizing the motion jitter frame is proposed, which greatly improves the success rate of tracking, and the method does not require additional sensor support. We have carried out many experimental evaluations on the TUM dataset, and most of the experimental results have been greatly improved compared with ORB-SLAM2. We have specifically verified the image sequence with motion blur, and the experimental results prove the article. The experimental results verify the rationality and effectiveness. In the future, we will optimize the tracking process and improve the real-time performance of the proposed method by detecting the current tracking environment to choose when to join the line feature estimation. In addition, we will combine the  depth sensor to build a dense point cloud map in real time, and further build an octree-based grid map to enhance the intuitiveness and reusability of mapping for subsequent navigation work and other corresponding work.