A real-time semantic visual SLAM approach with points and objects

Visual simultaneously localization and mapping (SLAM) is important for self-localization and environment perception of service robots, where semantic SLAM can provide a more accurate localization result and a map with abundant semantic information. In this article, we propose a real-time PO-SLAM approach with the combination of both point and object measurements. With point–point association in ORB-SLAM2, we also consider point–object association based on object segmentation and object–object association, where the object segmentation is employed by combining object detection with depth histogram. Also, besides the constraint of feature points belonging to an object, a semantic constraint of relative position invariance among objects is introduced. Accordingly, two semantic loss functions with point and object information are designed and added to the bundle adjustment optimization. The effectiveness of the proposed approach is verified by experiments.


Introduction
Simultaneously localization and mapping (SLAM) has become a very popular research direction in recent years, which requires to construct and update an environment map while simultaneously tracking an agent's position. 1,2 SLAM has a variety of applications such as autonomous driving, mobile robots, and virtual reality. Especially, visual SLAM has received extensive attentions due to the large amount of information, wide range of application scenarios, and low cost of visual sensors. 3,4 Compared with monocular and stereo cameras, RGB-Depth (RGB-D) camera is widely used in indoor environments because they can directly provide the depth and color measurements of the scene. In this article, we focus on RGB-D SLAM.
For traditional visual SLAM, the feature-based approach [5][6][7] and direct method 8,9 are mainstream solutions, where low-level point information plays an important role. The former associates points in successive frames according to the local appearance near every feature point, while the latter tracks points on the basis of constant brightness assumption. 10 However, these methods suffer from illumination and viewpoint changes. 11,12 Different viewpoints and illuminations can lead to the variations of local appearance and brightness of the same point, which will cause the tracking failure of points with incorrect data association. Thus, the localization accuracy of the visual SLAM is decreased. On the other hand, the traditional visual SLAM mainly focuses on low-level geometric information, which possibly results in a weak interaction with complex surrounding environments. 13 With the development of deep learning, great progresses have been made in object detection and object segmentation whose high-level semantic information can better adapt to viewpoint and illumination changes. The purpose of object detection is to infer the locations and class labels of objects, where the location of the object is represented in the form of a bounding box. For object detection based on deep learning, it can be classified into approaches based on regional proposal and without regional proposal. The former is a two-stage process: firstly generate a series of candidate regions and then extract the features of the candidate regions for classification and boundary regression. Its popular methods include regions with convolutional neural network features (R-CNN), Fast R-CNN, and Faster R-CNN and so on. For the approaches without regional proposal, the global information of the image is directly used, and you only look once (YOLO), 14 YOLO9000, 15 YOLOv3, 16 and single-shot multibox detector 17 are the representative methods. Different from the bounding box of object detection, object segmentation predicts the class labels pixel by pixel, and it is related to semantic segmentation 18,19 and instance segmentation. 20 A possible problem of object segmentation is its computation cost, which makes it hard to integrate into a real-time SLAM.
Driven by object detection and segmentation based on deep learning, the researchers concern semantic visual SLAM with the combination of object detection or segmentation. Semantics can not only help SLAM achieve better localization 11,21-23 but also establish more abundant map. To improve the localization accuracy, semantic constraints are added. Lianos et al. constructed a semantic error function by utilizing semantic segmentation to promote pointpoint association. 11 An et al. evaluated the importance of each semantic category based on semantic segmentation for better visual features and the removal of outliers in the matching process. 21 On the basis, the accuracy and robustness of localization are improved. Besides semantic constraint, pose optimization of objects is also considered. A 3-D cuboid object detection approach is proposed, 22 and it is combined with the Oriented FAST and Rotated BRIEF (ORB) feature points to respectively build semantic error functions for static and dynamic environments. On this basis, poses of points, 3-D cuboids, and cameras are jointly optimized. Similarly, Li et al. utilized 3-D object detection with viewpoint classification as well as feature points for constructing semantic constraints, 23 which is suitable for both static and dynamic conditions.
It shall be noted that existing semantic SLAM approaches mainly concern the constraints of camera-landmark, camera-camera, as well as different types of landmarks, where a landmark can be a point-type, and it can also be an object type. The constraint of landmarks with the same kind is seldom considered. In fact, there exists invariance in terms of relative distance and orientation between two static object landmarks, and it may be changed if only the aforementioned constraints are employed. It is beneficial for the localization by introducing the relative constraints among objects into the SLAM optimization process. In this article, we propose a real-time visual Point-Object SLAM (PO-SLAM) approach on the basis of RGB-D ORB-SLAM2, which incorporates objectobject constraint in the bundle adjustment (BA) optimization process. To ensure the real-time performance of the system while considering the instantiation of the objects, YOLOv3 16 is adopted and it is combined with a rough geometric segmentation based on depth histogram to obtain the contours of objects, which can improve the association quality. Moreover, the object-object constraint is reflected by the relative position invariance of objects, which is converted to the length and orientation invariances of the line segment connecting every two objects in each frame. This provides additional information for pose optimization.
In the following, we will describe the proposed PO-SLAM approach combining points and objects in detail. Then, the experiments are presented, and finally, we conclude the article.

The proposed semantic PO-SLAM with points and objects
The framework of the proposed semantic PO-SLAM is shown in Figure 1, where point features, point-point association, and point-point constraint are directly used according to ORB-SLAM2. 7 In the feature extraction module, object features are extracted from the color image provided by RGB-D camera using YOLOv3. 16 Considering that object detection cannot accurately express the contours of objects, we utilize the depth image to geometrically segment the detected objects based on depth histograms. Then, combined with point features, point-object association is executed to obtain the feature points on each detected object. After extracting the features of every frame, we track the features between the current frame and the previous frame. And besides the point-point association, objectobject interframe association is also executed. On this basis, the extracted point and object features as well as the association results are involved in the BA optimization process. With the help of loop closing of ORB-SLAM2, SLAM is finally implemented. In the following, we will address the PO-SLAM in detail.

Feature extraction
Low-level point features are combined with high-level semantic object features in our SLAM. The reader may refer to the study by Mur-Artal and Tardós 7 for point features extraction, and in this section, we focus on the extraction of object features.
Object features extraction. Object features including the objects number, categories, as well as the positions are favorable for data association of SLAM due to the reliability of high-level feature. In this article, YOLOv3 16 is utilized to detect the objects at each frame, where the deep network is trained on the MS COCO data set including 80 categories of common objects. By object detection, the bounding boxes, labels, and label confidences of objects are obtained. Note that we only reserve the results with confidence of more than 70%.
Geometric segmentation. For object detection, the resulting bounding box surrounding an object cannot fit the actual boundary of object completely, and some background information is inevitably contained. In this case, it is not easy to judge whether a feature point is on an object, which will affect the determination of the object's position. Also, in spite of good performance in segmentation effect, instance segmentation based on deep learning needs to take more time. A fast segmentation solution to extract the foreground in the bounding box of an object is required. Herein, a geometric segmentation based on depth histogram is presented.
In a detection bounding box, there are only two types of pixels: background and foreground. Their differentiation may be solved using depth information that reflects the distance between an object and the camera, and a depth threshold to separate the foreground from the background needs to be determined. With the depth values of foreground and background, we utilize the Otsu threshold segmentation method 24 to segment the depth values by maximizing the interclass variance of these two parts. Otsu is a method to automatically determine the threshold; however, it is sensitive to noise. For the depth map provided by the RGB-D camera, there exists the case where the depth value of a pixel is 0, which may be caused by the pixels outside the depth range or miss detection. Those pixels with a depth value of 0 in the depth map should be first filtered out before calculating the depth threshold.
To obtain the geometrical segmentation for an object, the depth image of current frame is cropped according to the predicted bounding box, and one can obtain the depth submap d hÂw . After d hÂw is filtered, its values are scaled to [0, 255], which is used to acquire the threshold using the Otsu method. On this basis, the foreground and background corresponding to the object is separated. The detailed process is given in Þare the left bottom coordinate and the upper right coordinate of the bounding box, respectively, and h ¼ b b À b t , w ¼ b r À b l . D th refers to the depth threshold, and the segmentation mask is labelled as M. Figure 2 illustrates the segmentation result. Take the teddy bear in the original image from the TUM data set 25 (see Figure 2(a)) as an example. Figure 2(b) provides the detection result, and the depth histogram of the pixels in the bounding box is presented in Figure 2(c). One can see that the depth values are divided into two parts by the yellow dashed line corresponding to the depth threshold D th . The left part and right part of the dashed line are corresponding to the depth values of foreground and background. With the segmentation, the extracted foreground is given in Figure 2(d) for the object in Figure 2(b).

Data association
As a reflection of the common view between frames, data association is important in solving camera poses and landmark positions of SLAM. In addition to the association of interframe point features used in ORB-SLAM2, 7 we also take the correlation of point features and object features in each frame as well as the association of interframe object features into account.
Point-object association. As mentioned above, for each detected bounding box in each frame, the foreground image is separated by the depth image information, and the feature points located in the foreground area are used as the feature points corresponding to the object. The association of points and objects is used to calculate the point-object error in the subsequent BA optimization. Figure 3 gives an illustration of association results for a selected image in fr2/ desk of the TUM RGB-D data set. 25 The bounding boxes of different classes of objects are represented by different colors, and the color of feature points belonging to the same object is consistent with that of the bounding box. Notice that multiple object instances of the same class can be distinguished by the positions of their bounding boxes, and the green points do not belong to any detected object, which are considered as the background. When the points fall within the bounding box of an object and their colors match the color of the bounding box, they are regarded as the feature points associated with the object.
Object-object association. Object-object association between two frames is similar to standard object tracking. Since we have known the categories of the objects in each frame, we can only concern the object categories that simultaneously appear in two frames. At first, the center u c ; v c ð Þ of an object in the previous color image is unprojected to the world coordinate system by its depth d c and the camera pose T cw;pre of the previous frame. Then 3-D position P c of the object center is projected to the current image using the camera pose T cw;cur of the current frame.
where p and p À1 represent the projection from 3-D space to 2-D image and unprojection from 2-D image to 3-D space, respectively. u c ; v c ð Þ proj refers to the projection of P c on the current frame. After the projection of the object center on the current frame is acquired, we check the relationship of the projection and the bounding boxes in the current frame with the constraint of the same object label. If the distance between the projection and the center of a bounding box in the current frame is less than a given threshold, the corresponding two objects are considered as a successful match.

Bundle adjustment
Combining point and object features, constraints with geometric and semantic relationships are constructed to optimize camera poses and 3-D point positions. The sets of image sequence, positions of 3-D points, and objects in the world coordinate system are denoted as I ¼ I k f g; P ¼ P i f g; and O ¼ O j È É , respectively, where k, i, and j are their corresponding indexes. For a 3-D point, it is either on the object or belongs to the background. We label the position of the ith point on the jth object as j P i . Also, the object is represented by the points inside the object and its position, and fO j g ¼ f j P i g; C j g È , where C j is the 3-D position of the jth object.
We can observe the measurements corresponding to 3-D points and objects from each frame. o ¼ fo kj g and z ¼ z ki f g are used to stand for the observations of jth object O j and the ith point in the kth frame.
where c kj and l kj are the observations of object position and class label. We denote j z ki with the observation on the jth object for the ith point in the kth frame.
BA formulation. Our semantic optimization process can be described as the following problem: given the observations  z ki f g of points in the kth frame and observations o kj È É of objects O j È É in the kth frame, find the optimized camera pose T Ã cw; k and the positions fP Ã i g of the points, where T cw; k 2 SE 3 ð Þ is used to convert 3-D points from the world coordinate system to the camera coordinate system, and P i 2 R 3 . In the BA process, the optimization process is executed by minimizing the errors between the predicted values and the measured values, which is a nonlinear leastsquares problem. Our measurement errors consist of pointpoint error, point-object error, and object-object error, and the optimization function can be formulated as follows where e pp Á ð Þ, e po Á ð Þ, and e oo Á ð Þ represent the errors between the projected point on the image by the camera pose and the observation point for P i , between projected point and 2-D bounding box and between two objects, respectively. In this article, the Levenberg-Marquardt method is adopted to solve this problem.
Error functions. Point-point error. With the ORB features, point-point error (i.e. re-projection error) is given by 7 : Point-object error. Based on the point and object data association, we get f j P i g that belong to an object O j . Theoretically, after these points are projected into the current frame, they should fall into the corresponding 2-D bounding box of the object O j but that is not always the case. Our point-object error for the point j P i is as follows. where u proj ; v proj À Á is the projected pixel coordinate of j P i , and err u and err v are the u-axis error and v-axis error between projected point and 2-D bounding box. It shall be noted that when the projection point is inside the detected bounding box, the cost function is always zero, and thus this constraint is relatively coarse. Only when the projection point falls outside the detection box, does the penalty take effect.
Object-object error. we acquire the feature points belonging to the objects as well as corresponding 3-D points through the point-object data association. And then use the coordinate centroid of these 3-D points as the 3-D position of the object with the coordinate centroid of corresponding ORB feature points in the image as the observation of the object position, which are described as where N is the number of points anchored to the object O j . The relative position between two objects is constrained by distance and orientation. To solve the problem, we connect the positions of two objects into an abstract line segment, and thus the distance and direction constraints can be converted to the invariance of length and direction of the line segment. We define c kj 1 and c kj 2 as the observations of positions for objects O j 1 and O j 2 in the kth frame, respectively. Correspondingly, C j 1 and C j 2 represent the 3-D positions of objects O j 1 and O j 2 . According to Hartley and Zisserman, 26 we define c h kj 1 and c h kj 2 as the homogeneous coordinates of c kj 1 and c kj 2 for the parameterized representation of the line segment. Thus, the line through c kj 1 and c kj 2 can be expressed as follows According to the direction invariance constraint, we can infer that the projection points of object O j 1 and object O j 2 should be located on the line l. The direction error can be denoted as follows The length invariance of the line segment indicates that the distance between the projected points is the same as that of c kj 1 and c kj 2 . Then, the distance error is given by where where p 1 and p 2 represent the 2-D pixel coordinates of the projections of C j 1 and C j 2 in the image, respectively, and D Á ð Þ refers to the Euclidean distance of two pixels. e oo dir and e oo dis constitute the object-object error function.

Experiments and results
In this section, we will evaluate the localization performance of our approach and conduct the comparison with ORB-SLAM2.

Experimental setup
We adopt the TUM RGB-D SLAM data set and benchmark 25,27 to test and validate the approach. TUM data set consists of different types of sequences, which provide color and depth images with a resolution of 640 Â 480 using a Microsoft Kinect sensor. YOLOv3 scales the original images to 416 Â 416. Combining objects we concerned such as book, keyboard, mouse, TV-monitor, cup, cell phone, remote, bottle, teddy bear, and potted plant, 10 sequences related to office environments are selected.
We adopt the following evaluation metrics 27 : absolute trajectory error with root mean square (ATE) and mean relative pose error (RPE), where ATE quantifies the difference between points of the estimated trajectory and their ground truths, whereas RPE assesses the local accuracy of the estimated poses in a fixed interval. All of the experiments are repeated five times and the median of these five results is considered as the final result. To clearly demonstrate the improvement of our method, ATE rel 11

Experiment on the TUM RGB-D data set 25
Tables 1 and 2 give the comparison of our PO-SLAM and ORB-SLAM2 over 10 sequences. To better address our  approach, we also consider two other methods PO-SLAM1 and PO-SLAM2. These two methods correspond to the cases of PO-SLAM without point-object error in (3) and PO-SLAM without object-object error in (3), respectively. Noticing that the first seven sequences describe static scenes, whereas the last three sequences are related to dynamic scenes. As can be seen in Tables 1 and 2, our PO-SLAM has an improvement of up to 10.46% in ATE and up to 10.95% in RPE compared with ORB-SLAM2. Overall, our three methods perform better than ORB-SLAM2 in both ATE and RPE for most of the sequences, and PO-SLAM performs best. Figure 4 depicts the comparison of the trajectories obtained by PO-SLAM and ORB-SLAM2 on four sequences with the ground truth. It is seen that our trajectories are closer to the ground truth than ORB-SLAM2. Note that all ORB features extracted by ORB-SLAM2 are used in our point-point error. From Tables 1 and 2, our method has proved a better adaptability to dynamic environments. Figure 5 illustrates a performance comparison of PO-SLAM and ORB-SLAM2 on fr3/walking_xyz dynamic sequence. 25 Clearly, ORB-SLAM2 fails to track on the frame 696 and frame 768, while PO-SLAM is still in the SLAM mode with enough matching points with the previous frame.
The average running time per frame of PO-SLAM is demonstrated in Figure 6 for 10 sequences on the TUM RGB-D data set. It is seen that the average time is 71.47 ms with a speed about 14 fps, which meets the real-time requirement.

Conclusions
In this article, we propose a semantic visual SLAM approach combining 2-D object detection and ORB feature points with additional semantic constraints for the process of BA optimization. The object segmentation approach combining object detection and the depth histogram of 2-D bounding box is used to associate feature points and their corresponding objects. Besides, the correlation between any two detected objects within the field of view of each frame is also introduced. Experimental results on the TUM RGB-D data set indicate that our approach can improve the accuracy and robustness compared with ORB-SLAM2.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.