Smart three-dimensional processing of unconstrained cave scans using small unmanned aerial systems and red, green, and blue-depth cameras

This article focuses on a novel three-dimensional reconstruction system that maps large archeological caves using data collected by a small unmanned aircraft system with red, green, and blue-depth cameras. Cave sites often contain the best-preserved material in the archeological record. Yet few sites are fully mapped. Large caves environment usually contains complex geometric structures and objects, which must be scanned with long overlapped camera trajectories for better coverage. Due to the error in camera tracking of such scanning, reconstruction results often contain flaws and mismatches. To solve this problem, we propose a framework for surface loop closure, where loops are detected with a compute unified device architecture accelerated point cloud registration algorithm. After a loop is detected, a novel surface loop filtering method is proposed for robust loop optimization. This loop filtering method is robust to different scan patterns and can cope with tracking failure recovery so that there is more flexibility for unmanned aerial vehicles to fly and record data. We run experiments on public data sets and our cave data set for analysis and robustness tests. Experiments show that our system produces improved results on baseline methods.


Introduction
It has long been recognized that cave sites often contain the best-preserved material in the archeological record. Cave archeology has developed its methodologies for mapping and recording sites, yet few sites are mapped to true threedimensional (3D) models because it is a slow and tedious process for archeologists to record and bookkeep caves. They need to incrementally set up baseline along the cave and then measure the distance from the baseline to cave walls or objects of interest and mark walls or objects in a two-dimensional (2D) map by hand. 1 This slow process has a major negative impact on cultural relic preservation. Typically, archeological teams will visit a site and begin to record it in 1 year, but when they come back to finish data collection, it has been looted, artifacts stolen, architecture destroyed, and the archeological record disturbed. Therefore, archeologists need a faster, more efficient method of surveying and recording the sites.
To accelerate cave mapping, a system that can automate the process needs to be developed. In this system, we first focus on building globally consistent and accurate 3D models using red, green, and blue-depth (RGB-D) data recorded with unmanned aerial vehicles (UAVs). This falls into the research area of visual simultaneous localization and mapping (vSLAM). There are still many challenges in making a well-working vSLAM system that emphasizes surface estimate accuracy. 2,3 Major difficulties are from three sources. First, the camera tracking error accumulation problem can be considered unavoidable due to the incremental nature of any vSLAM system. Second, it is hard to find all the important loops. Intuitively, the more loops in data, the more information can be used to recover precise camera trajectory and the 3D model. But, in practice, when running existing vSLAM systems on data sets with loopy motion at different scales, mismatches can always be found when scanned more than once. This means that loops are not successfully detected, indicating that the recovered camera trajectory is not precise enough. Third, an effective and optimal way of optimizing loop closure in a dense vSLAM system is not yet attempted, due to perhaps the cost of reintegrating dense models.
To get better camera tracking and dense mapping accuracy, researchers tried different ways. In Lefloch et al., 4 curvature information was added into the frame-to-model iterative closest point (ICP). To detect those loops, most vSLAM systems [5][6][7] use bag of words (BoW), 8 but it is well known that it is not very reliable under lighting conditions or viewing angle changes. Since the BoW only matches sparse features from images but cannot fully utilize all camera observation data and spatial information. On the other hand, vSLAM systems tend to add the loops very conservatively to reduce the severe influence of the false loops, thus many important loops may not be connected. Even after loops are successfully detected, there is still another problem in the dense vSLAM system: how to correct reconstructed surface optimally. Since most dense vSLAM systems 9,6,10 use a frame-to-model fusion process, which makes it difficult to quantify, isolate, and remove the influence of past camera data, and it is also computational expensive for a full camera data sequence refusion. Whelan et al. 10 suggested forming a deformation graph across the reconstructed dense model to deform its surface to connect the loop. When the loop area is large, the model may not be deformed optimally, since past camera observations are not reused to manipulate the 3D model. They assumed that the scenes are elastic, but in reality, they are mostly rigid.
Motivated by the fact that humans can notice mismatches in 3D models very easily by looking at the spatial displacement of surfaces. We propose to resolve mismatches directly by closing surface loops to get a consistent 3D model and a precise camera trajectory estimate in the vSLAM system. After surface loops are detected, instead of optimizing surface directly to propagate correction introduced by surface loop, in this article, the surface loop correction is done through sparse feature bundle adjustment (BA), so that all the past camera poses can be corrected based on their observations. By running extensive experiments on different data sets, we observed that combining sparse features with surface loop closure can produce better results. Not only 3D models get improved, but also camera trajectories estimate becomes more accurate. This is because our framework can detect loops in the dense surface domain and optimize loops in the sparse feature domain. Note that our framework can detect surface loops, yet other means of detecting loops can still be utilized.
In the following, we summarize the key contributions of our method: 1. We propose a novel 3D reconstruction system that corrects surface loops with sparse feature-based BA. We demonstrate that this novel system can give much-improved camera tracking and dense modeling results. 2. We propose a fast 3D surface-based loop detection method, which is based on a new compute unified device architecture (CUDA)-accelerated point cloud registration algorithm. Experiments show that it is fast and accurate. 3. We propose a novel objective function for surface loop filtering with a sparse feature-based optimization graph. This graph is more robust to different scan patterns and can cope with tracking failure and recovery so that there is more flexibility for UAVs to fly and record data. In addition to the flexibility, experiments show that it performs better than stateof-art methods when only a limited number of loops are detected.

Related work in visual SLAM
Visual SLAM has been studied actively by researchers from different fields, such as robotics, computer vision, and computer graphics. They solve this problem with their emphases and preferences, which lead to diverse visual SLAM systems. Sparse feature-based SLAM systems are well developed because sparse features can be used to downsample data from sensor reading (e.g. images) to sparse data representation as image keypoints and feature vectors, which means less computation since data from different frames are matched solely based on feature vectors of their keypoints. Extended Kalman filter or particle filter-based filtering approaches 11,12 can take keypoints as visual landmarks and solve vSLAM as a data filtering process. A drawback of this approach is that the filter cannot be reoptimized again based on all previous data. Then, maximum a posteriori (MAP)-based approaches are used to optimize all observed camera data in a batch setting, 7 which utilizes BA from Structure from Motion, 13,14 and to get better accuracy. 7 To run BA for loop closure, the loops need to be detected. In the sparse image feature setting, BoW-based loop closure is widely used. But it gives a high portion of false loops, which can severely degrade the performance of a vSLAM system, so a very strict loop filtering is often used, 7 where many loops are rejected. This causes a big problem when there are many loopy motions in camera movement.
Another line of vSLAM research focuses on surface reconstruction. With the parallel processing power of graphics processing unit (GPU), Newcombee et al. proposed KinectFusion 9 which performs real-time dense 3D camera tracking and model fusion. It has a volumetric scene representation, which can be rendered to a depth map at a given camera pose. Tacking is done through a frame-to-model projective ICP, which is parallelized on GPU for real-time performance. Finally, new camera data are fused into the volumetric model using a running average. KinectFusion can be considered of fusing very local loops together using the model it maintains as a proxy, but it does not close large loops. To close large loops, it is important to detect loops and find relative poses between loop areas. In BoW, image keypoints and features are used for both loop detection and relative pose generation, but in dense 3D systems, there is no such comparably reliable point cloud feature. Whelan et al. use BoW in a dense SLAM system called Kintinuous, 6 which is an extended KinectFusion system. Later, to better solve the loop detection and optimization problem, ElasticFusion 10 proposed to use ICP to find relative poses of potential loops, which are proposed by two sources of information: spatial prior and appearancebased place recognition. Our work shares similarities with Whelan et al., 10 but we propose a different approach and underlying algorithms, instead of using projective ICP, which highly depends on initialization. We propose a GPU-based global point cloud registration method to detect loops with other prior information from sparse feature alignment. After surface loop was detected, in Kintinous, a pose graph of keyframes was utilized, while the authors mentioned that mesh deformation was required to get smooth 3D models, which indicates loop correction is not done optimally. In Elas-ticFusion, the pose graph is replaced by a deformation graph distributed inside the dense model. This deformation graph does not have a backing physical meaning, because most of the scenes scanned are not elastic. In our framework, we utilize BA to have a MAP correction of all past keyframes, which is theoretically optimal. BundleFusion 14 used BA to optimize loops, but they do not close surface loops. Instead, they close sparse feature loops and only use the dense surface for feature correspondence search and tracking.

Related work in loop detection
Handcrafted image features are widely used in loop closure detection. Cummins and Newman propose Fast appearance based mapping (FAB-MAP), 15 which is a probabilistic framework for navigation using only appearance data. Angeli et al. propose an online method to run visual word-based loop detection within the framework of an online image retrieval task. 16,17 Williams et al. 18 describe a relocalization module, in which relocalization is performed by a randomized lists classifier to establish landmark correspondences in the image and then random sample consensus (RANSAC) to determine the pose robustly from these correspondences. Another widely used approach, distributed BoW, 8 is proposed by Galvez-Lopez and Tardos, which use BoW for visual place recognition with features from accelerated segment test (FAST) keypoint detector and Binary Robust Independent Elementary Features (BRIEF) with a binary vocabulary tree.
There are new attempts to make further improvements, including modifications on the BoW 19-21 and detection by image sequences. [22][23][24] There are also explorations to detect loops with more than a single image. 25,23,26 Some works try to improve loop detection precision with loop verification threshold learning in RANSAC based on geometric. 27 Some others try to make improvements for special environment. 28,29 After, convolutional neural networks (CNNs) dominate computer vision-related tasks6. There are research works that try to build loop detection methods with CNN features for similarity search. [30][31][32]26 Another line of work builds special CNN architectures for loop detection problem. 33,33 Overview of the proposed system The method proposed aims at producing accurate 3D models by detecting and optimizing surface loops in sparse feature-based visual SLAM systems. By adding surface loop closure into a vSLAM system, we can get globally consistent and optimal 3D models. With our proposed algorithms, we build a novel system with five components: (1) tracking, (2) surface model fusion, (3) fast surface loop detection, (4) surface loop filtering, and (5) loop optimization.

Tracking
We employ the tracking part from ORB-SLAM2, 7 which is a very wellimplemented sparse feature-based SLAM system. Inside this tracking module, the Oriented FAST and Rotated BRIEF (ORB) features are extracted for keypoint matching. Then frames are tracked against keyframes with motion estimate and then refined with a local sparse map. Keyframes are generated when tracking is weak, or the local BA thread is free. Local BA is used to correct the reprojection error of feature correspondences among covisible keyframes in a background thread. This tracking module provides camera poses for each frame and a covisibility graph across keyframes.

Surface model fusion
We fuse surface models on a GPU using surfels as a map representation similar to the literatures. 34,10 Each surfel has a position p, normal vector n, radius r and confidence c. We fuse keyframes within every k frames (k ¼ 50 for all experiments) into a surface fragment. These surface fragments are generated for two reasons. One is to integrate out raw RGB-D data noise. Another one is to reduce the number of 3D pieces, so that loop detection computation is accelerated. After each scene fragment is generated, for the later optimization process, it is linked to the keyframe whose timestamp is closest to the first frame within the range.

Fast surface loop detection
The proposed surface loop detection is done by point cloud registration on surface fragments. To detect surface loops effectively, the covisibility graph from tracking is utilized to prefilter covisible surface fragment pairs that are already connected. Thus, a majority of unnecessary computation can be avoided. To detect surface loops efficiently, we propose using CUDA to accelerate point cloud registration. Details of this acceleration are described in the next section.

Surface loop filtering
After loop candidates are detected, they need to go through a novel loop verification algorithm, so that false loops would not diverge the subsequent optimization process. This is in the surface loop filtering section.

Loop optimization
After surface loops are detected and verified, the loop pairs are used to connect pose graph vertices and also trigger more image loop detection, which again uses spatial prior and ORB feature matching. Then, the pose graph is optimized to give a coarse pose correction and then a full BA is performed to get MAP optimally fine-tuned camera trajectory estimate. Details are in the loop optimization section.

Fast surface loop detection
To formulate surface loop detection formally, we denote a surface fragment as a set of points P with their normal. Then the surface fragment-based loop detection problem can be solved by point cloud registration methods. In a 3D reconstruction system, it is desired to get results quickly. In Elas-ticFusion, Whelan et al. resort to projective ICP, which can be performed very quickly on GPU. But, when surface mismatches are big, the initialization-dependent nature of ICP makes it difficult to converge to the right solution. So we turn to global point cloud registration in our framework, which does not depend on initial alignment at all. Our algorithm is based on RANSAC and inspired by Choi et al., 35 which uses a method that is modified from Point Cloud Library. 36 The major differences are that we accelerated the most timeconsuming parts using GPU programming with an efficient nearest neighbor (NN) search method. Traditionally, RAN-SAC is formulated as an iterative process with proved convergence. 37 But different iterations and different hypotheses can be considered to be totally independent of each other. This means different hypotheses can be mapped to different processing cores to be tested in parallel.
As shown in Algorithm 1, point clouds are downsampled to the resolution of the typical precision of RGB-D sensors to reduce unnecessary computation. Fast point feature histograms (FPFH) features are extracted for each point in P and Q for point correspondence pairs generation. To make the NN search more efficient, we precached all the NNs of P in Q using FPFH feature distance. Then, for each hypothesis, four points are randomly sampled from P, and their correspondences are found through the precached NNs. After that, a pre-rejection step, which rejects hypotheses whose point pairs cannot make a similar polygon, is performed. t is a similarity threshold and set to 0:9 in all our experiments. Then, hypotheses testing, which is the most time-consuming step, tests both inlier ratio and fitness score on GPU. We test each hypothesis on a thread block with efficient parallel reductions. For the NN search during hypotheses testing, we utilized a 3D grid to replace the k-d tree to fit the special need of GPUs, since a GPU will slow down when different threads go to different code branches during the k-d tree search. We propose using a 3D grid for the NN search, given that the point clouds to be searched can be stored in a limited volume. This guarantees that we can use a grid with a limited size for the NN search without jeopardizing search accuracy. When a point is stored into the search grid, the indices ði x ; i y ; i z Þ of its cell are calculated by the following equation where l c is the cell edge length of the NN 3D grid. The ðx c ; y c ; z c Þ is the coordinate of the center of target point cloud. It is subtracted so that the translation of the point cloud does not affect searching. When a point wants to query its NN, the search is accomplished through a table looking up, which has a time complexity of Oð1Þ, given cell edge length the same as point cloud downsample resolution. It is faster than the k-d tree which has a OðlognÞ time complexity. We observed speedup by only replacing the k-d tree with a 3D search grid in CPU only code. More importantly, there is no branching during searching, so it fits much better on a GPU than the k-d tree.

Surface loop detection evaluation
We run experiments to compare speed, recall, and precision performance with the baseline method: point cloud registration part of Choi-Zhou-Koltun (CZK) on redwood pairwise registration evaluation data set by Choi et al. We report results in Table 1. Recall and precision for CZK are from reports of article authors, but speed is measured on our machine. We use the same hyper-parameter values as in the published code of CZK: 0.05 m as point cloud downsampling leaf size, 0.1 m as normal estimate radius, 0.25 m as FPFH feature estimate radius, 4,000000 as hypotheses count, and 0.075 m as maximum point correspondence distance. We use an Intel i7-6850 K (https://ark.intel. com/content/www/us/en/ark/products/94188/intel-core-i7-6850k-processor-15m-cache-up-to-3-80-ghz.html) clocked at 3.6 GHz and an NVIDIA Titan X Pascal (https://www. nvidia.com/en-us/geforce/products/10series/titan-x-pascal/) for our evaluation.
Observations. Our global point cloud registration can finish in around 20 ms, which is around 366 times faster than CZK as in Table 1. With this speed, it can run at 50 Hz, which means we can process more loop candidates. Our recall performance is not compromised for speed. Instead, it is even better than all these two methods. From Table 1, the low precision is an issue that needs to be solved, so an effective loop filtering method is proposed and described in the following section.

Surface loop filtering
The results out of the surface loop detection algorithm in the previous section have a low precision problem. Choi et al. proposed to use a line process-based optimization to solve it. 35 But that method requires scanning to be tracked fully successfully from the beginning to the end, and each surface fragment cannot be empty. However, during a long scanning session, it is almost impossible to guarantee that RGB-D cameras always have surfaces observed within their effective range. A failure to maintain that will lead to an empty surface fragment. Thus, the graph vertices are potentially divided into more than one subgraph, which leads to erroneous optimization results. A similar problem will occur when tracking failure is present in the tracking part, which will break the optimization graph as well.

20.0%
The best performance is marked in bold.

Algorithm 1. Global Point Cloud Registration on GPU
To address this problem, we propose to minimize (2) instead, which has a supporting optimization graph that is always fully connected whenever sparse features are matched from either tracking or failure recovery The objective function optimizes over a set of point clouds X ¼ fX v 2 E 3 g estimated from sparse feature key points, camera trajectory T ¼ fT cw 2 SEð3Þg, and a set of switch variables S ¼ fs ij 2 Rg. The first term in eq. (2) builds a least-square optimization graph for BA from sparse keypoint observations, where X v is the 3D coordinate of a point visible in the cth camera.x c v is the 2D pixel observation coordinate of the 3D point X v . K is the camera intrinsics matrix. S is the covariance matrix associated to the scale of the keypoint. The second term is surface loop connections with switchable constraints 38 with s ij as a switch variable for surface loop connection that connects keyframes i and j. Let P i be the surface fragment referred by keyframe i, K ij is the set of NN correspondence pairs between T À1 iw P i and T À1 jw P j that are within distance e ¼ 0.05 m, which is typical noise level of RGB-D sensor.
In eq. (3), Cðs ij Þ is a switch function and we use a linear function Cðs ij Þ ¼ s ij as suggested in Sunderhauf and Protzel. 38 ⍀ ij is information of switchable prior constraints. It controls the influence of the loop candidate tested. Let T ij be the transformation that aligns all the p j to related p i in K ij , that is To accelerate computation of eq. (5), we follow an approximation proposed in Choi et al. 35 The local parameterization of T iw T À1 jw T À1 ij is represented with a six-dimensional vector x ¼ v; t ð Þ T ¼ a; b; g; x; y; z ð Þ T , which consists of three rotational angles a; b; g and three translation components x; y; z. Since its rotation is small under the assumption that the registration result is good and camera pose estimates are not far away from the correct solution, the approximations sinðqÞ % q and cosðqÞ % 1 are utilized for all a; b; g. Then Àb a 1 z Plug eq. (6) into eq. (5) X where where where x pi ; y pi ; z pi are the three components of p i . Then our eq. (3) becomes After optimizing eq. (2), keyframes i and j should be connected as a loop if the optimized value of switch variable s ij is greater than a threshold.  Table 2. Another set of experiments only consider point cloud pairs that are not covisible in ORB-SLAM2 tracking results as loop detections, which is denoted as Only non-covisible pairs in Table 2. A pair of scene fragments is covisible if there are any covisible frames between two sets of frames contained in the two scene fragments. This only non-covisible set of loops is more close to practical use cases because the covisible pairs can be ignored for computational speed consideration and have been well connected in SLAM systems, already.

Surface loop filtering evaluation
Observations. When only non-covisible pairs are presented, the proposed method outperforms CZK in average precision, while recall is only 2% less. For office 1 sequence, our method gives better results in both precision and recall, while CZK output only 62.5% precision. Such a low precision usually causes serious problems because too many false loops go in the loop optimization step. For the other three sequences, our method gives competitive results. When all pairs are considered, CZK performs so well that our method is closely under it. The difference is mainly in recall, while precision difference is very small. Considering precision is more important than recall for loop optimization, we would like to note that this minor difference rarely impacts on the final loop optimization result.
In addition to the cases where our method performs better, note that our method is less strict to use in practical scenarios, especially on a fast-moving UAV platform. Because our method does not require maintaining RGB-D cameras facing surfaces all the time, which is required by CZK. This requirement difference is inherently implied by underlying optimization pose graphs.

Loop optimization
After a surface loop passing verification, map optimization is followed to reduce mismatches and errors. We employ pose graph optimization and new data association and finally run a full BA to get a MAP-based optimization to correct the camera trajectory estimate and thus improve the 3D model.
When a surface loop pair fi; jg passes loop verification, we then try to find data association in the sparse feature domain. This is done by first retrieve all the covisible keyframes for both keyframes i and j noted as F i and F j . Then collect all the local sparse map points fX p g i and fX p g j that are observed by F i and F j , respectively. Then data association between fX p g i and fX p g j are constructed by both distances in image feature space and Euclidean space. After that, we run RANSAC to filter out outliers in the matches and also improve the transformation T ij . If this process converges with enough inlier matches, we add a loop connection for keyframes i and j. And update keyframe connections by merging the observations of matched sparse map points. These merged map points will help improve during the final BA process. If it does not converge, we still add a loop connection with the T ij . We do this because in some cases, even though the areas related to keyframes i and j do not have enough sparse map points for a good matching result, a loop connection can lead to a better initial start point for BA.
Finally, a full BA is performed for all the keyframes and observed map points by optimizing the following equation

Evaluation of the full 3D reconstruction system
In addition to experiments in the fast surface loop detection section and surface loop filtering section, extensive experiments are performed to evaluate the proposed full 3D reconstruction system on multiple data sets: our Maya cave data set, augmented Imperial College London and National University of Ireland Maynooth (ICL-NUIM), 35 Technical University of Munich (TUM) RGB-D data set, 39 Scene understanding 3D (SUN3D) data set, 40 and some other public data sequences. Comparisons are made with other online and offline methods. There are many SLAM algorithms and implementations. Here, we choose baseline methods in a way that they can best show the characteristics

Maya cave data set
The Maya cave data set is a data set collected by a team of archeologists using an RGB-D sensor Kinect V1 in Maya caves at Las Cuevas, Belize. Light-emitting diode lights are used to light up the environment as in Figure 1. In this data set, caves are scanned with a loopy motion for more loop optimization.

Experiment design and baseline methods.
We evaluate modules of both tracking and loop closure from different approaches on the data we collected. For RGB-D data, there are two major different camera tracking methods, which are sparse feature-based and dense frame-to-model approaches. We choose the ORB-SLAM2 as the implementation for the sparse feature tracking and the ElasticFusion for the dense frame-to-model implementation. In experiments for tracking, both implementations have their loop closure disabled so that the difference can reflect tracking performance. In tracking of ElasticFusion, there is one important parameter that controls the weight of RGB in tracking. We use the default 10 for RGB-D tracking, and a number greater than 100 can disable RGB completely so that the tracking is totally based on the depth. For loop closure, there are three different types. ElasticFusion is using Ferns-ICP-based approach. ORB-SLAM2 utilized BoW. Our point cloud registration-based surface loop detection approach is the third type compared.
Observations on tracking. From Figures 2 to 5, we can see that RGB-D tracking of ElasticFusion does not work well on cave data, especially compared with when it uses depth only. We think this is due to the moving light source. RGB-D tracking calculates a transformation matrix partially by minimizing the intensity difference of two aligned images, which assumes the lighting condition of scenes is static.  The ElasticFusion depth tracking working quite well in most sequences excepts on the chamber-floor-walking data sequence. We can see in Figure 2(b) that half of the floor data get rotated around 90 clockwise. It is almost impossible for a dense direct tracking to recover from the error; due to that, there are no strong correspondences. The tracking of ORB-SLAM2 performed very well in all the data sequences, and it provides the possibility of globally optimize the map. The robustness of feature-based tracking implemented by ORB-SLAM2 is the reason that we use it as our tracking module.

Observations on loop detection and optimization.
When we compare the performance of loop closure, our surfacefocused method performs the best. It connects important loops in all four sequences. The difficult data are the  chamber-floor-walking one, as shown in Figure 2. Neither ElasticFusion nor ORB-SLAM2 tracks the camera trajectory correctly. Even after their loop closure, mismatches are still significant. Our method shows its robustness by reconstructing consistent 3D models on all data.

Augmented ICL-NUIM data set
We use the augmented ICL-NUIM data set 35 to quantitatively analyze the performance of our system. This data set is a synthetic data set with ground truth surface models and camera trajectories. It has two indoor scenes: a living room and an office, and four RGB-D sequences, two sequences for each scene.  Tables 3 and 4.
Since different systems use different ways to fuse 3D models, for a fair comparison, we fuse 3D models using Elas-ticFusion using the same parameters with a camera trajectory estimate from each system. We use truncating depth distance of 4 m and 10 as the surfel confidence threshold for fusion.
Observations. From Tables 3 and 4, our system can give best results on all data sequence in terms of both trajectory and surface estimation accuracy. To give a more informative  comparison, we report an error map of the reconstructed model in Figure 6 on Living room 1 data sequence. We can see our results that have the lowest error across the whole model. Our method performs better because more loops get detected for improving the final loop optimization results. In ORB-SLAM2, the loop detector has trouble detecting some important loops because its consistency check can hardly be satisfied due to very local view overlappings. On the other hand, our surface loop detector does not have this problem. Compared to CZK, the performance gain comes from the advantage of sparse feature-based BA optimization, which can produce MAP results, thus more accurate camera trajectories and better reconstructed 3D models.

SUN3D data set
The SUN3D data set 40 is a large-scale RGB-D database that captures many places. It contains many data sequences. And there are eight sequences (http://sun3d.cs.princeton. edu/listNow.html) that are labeled with object annotations and widely used for evaluating SLAM and 3D reconstruction systems. Since there is no ground truth available, we follow this practice and run experiments on these sequences. We show qualitative results in the form of screenshots of reconstructed 3D models in Figure 7.
For the sequences in Figure 7, we highlight the mismatches in tracking results so that readers can better compare them with our results. For sequences harvard_c5, harvard_c6, and harvard_c8, there are no loops detected on top of tracking. So we do not include screenshots for them.
Observations. SUN3D data sequences are scanned with very loopy motion in some areas but only once for some other scene parts, thus is considered more difficult. Even though our method does not produce perfect reconstructed 3D models, it dramatically removed some significant mismatches. Also, our results are on par or better than other methods. Interested readers may compare with the results of CZK on this webpage (http://redwood-data.org/indoor/ models.html).

TUM RGB-D data set
The TUM data set 39 is an RGB-D data set that is commonly used to evaluate SLAM systems. This data set has 39 RGB-D sequences recorded in office and industrial environments with a large variety of camera motions and scenes. Along with RGB-D sequences, ground truth camera trajectories that are recorded with a motion capture system are also available. Following common practice, we run experiments only on sequences that are commonly used for SLAM evaluation. 7 Observations. From the results listed in Table 5, it shows that for data sequence fr1/desk, fr1/room, fr2/desk, and fr3/ office, our method makes improvement on tracking and achieves competitive results comparing full ORB-   SLAM2. Among these sequences, fr1/room has high absolute trajectory error (ATE) in tracking, and our surface loop significantly reduced the error. For the small performance difference, we think that it is due to the implementation difference of the tracking part and that the difference only has only marginal effects on reconstructed 3D models when ATE error is less than 0.02 as presented in Table 4 and Figure 6. For sequences fr1/desk2 and fr2/xyz, the tracking has provided good enough results that most of the frames are connected in the covisibility graph such that there are no loops to be detected. The sequence fr3/nst is a scan of a flat wall with rich texture but no geometry changes, so our method cannot detect surface loops in it; thus, no improvement is made. The TUM data set indicates that even though our method is designed for larger-scale environments with rich geometry changes, it can produce competitive results for some small-scale environments.

Other public real-world scenes
We also run experiments on public real-world data sequences for qualitative analysis and a robustness test. We run our system on Copyroom and Lounge data sequences from Zhou et al. and DysonLab data set from Whelan et al. We report our qualitative results in Figure 8.
Since there are no ground truth models available, we only report screenshots of reconstructed models from our system. Due to space limitations, we do not include the results of other systems. Interested readers can refer to the authors' release. Visually, our system produces results at least matching the state-of-the-art methods.

Conclusion
This article presents a novel 3D reconstruction system that maps both large archeological caves and general indoor environments with RGB-D cameras. The proposed system produces accurate 3D models by detecting and optimizing surface loops in sparse feature-based visual SLAM systems. By adding surface loop closure into a vSLAM system, globally consistent and optimal 3D models are generated accurately. The proposed system consists of five components: (1) sparse feature-based camera tracking from ORB-SLAM2, (2) surface model fusion powered by Surfels, (3) a novel fast surface loop detection algorithm, (4) a novel surface loop filtering method, and (5) loop optimization based on sparse feature-based BA. For fast surface loop detection, we propose and implement a CUDA-based global point cloud registration algorithm that is parallelized to run on GPUs for faster RANSAC hypotheses testing. The proposed point cloud registration algorithm can finish in around 20 ms on NVI-DIA Titan X Pascal without any precision and recall performance losses. The accelerated computational speed makes it possible to be used as a loop detection method.
In the surface loop filtering component part, a novel objective function is proposed to remove false-positive loops from entering the loop optimization step. The proposed objective function formulates a least-square pose graph with a BA term as the supporting backbone graph and robust least square terms with switchable constraints for surface loop detections. Due to the inherent difference in underlying optimization pose graphs, compared with its  closely related work: CZK, our method is less strict to use in practical scenarios, especially on a fast-moving UAV platform. Because our method does not require maintaining RGB-D cameras facing surfaces all the time, but CZK requires it. In addition to the flexibility, the proposed method is benchmarked against CZK on the augmented ICL-NUIM data set in terms of filtered loop detection precision and recall. Experiments show that the proposed method performs better (with þ37:5% precision and equally better recall) than CZK when only a limited number of loops are detected and provides competitive performance on all other scenarios. Moreover, experiments are conducted to evaluate the performance of the full novel system on multiple data sets, including our Maya cave data set, augmented ICL-NUIM, TUM RGB-D data set, SUN3D data set, and some other public data sequences. The results are evaluated with ATE or trajectory RMSE when ground truth camera trajectories are available, surface reconstruction error when ground truth 3D models are accessible, and visual comparisons when no ground truth is available. The results show that sparse feature-based camera tracking performs the best in cave environments. The results also show that the proposed system produces the most reliable and accurate 3D reconstruction performance when surface loops are detected, filtered, and optimized on a sparse feature-based objective function. Other than in cave environments, experiments on other data sets show that the proposed system produces results on-par or better than baseline methods.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by CITRIS Seed Grant entitled "Drones for Cave Archeology and 3D Mapping".