A three-dimensional mapping and virtual reality-based human–robot interaction for collaborative space exploration

Human–robot interaction is a vital part of human–robot collaborative space exploration, which bridges the high-level decision and path planning intelligence of human and the accurate sensing and modelling ability of the robot. However, most conventional human–robot interaction approaches rely on video streams for the operator to understand the robot’s surrounding, which lacks situational awareness and force the operator to be stressed and fatigued. This research aims to improve efficiency and promote the natural level of interaction for human–robot collaboration. We present a human–robot interaction method based on real-time mapping and online virtual reality visualization, which is implemented and verified for rescue robotics. At the robot side, a dense point cloud map is built in real-time by LiDAR-IMU tightly fusion; the resulting map is further transformed into three-dimensional normal distributions transform representation. Wireless communication is employed to transmit the three-dimensional normal distributions transform map to the remote control station in an incremental manner. At the remote control station, the received map is rendered in virtual reality using parameterized ellipsoid cells. The operator controls the robot with three modes. In complex areas, the operator can use interactive devices to give low-level motion commands. In the less unstructured region, the operator can specify a path or even a target point. Afterwards, the robot follows the path or navigates to the target point autonomously. In other words, these two modes rely more on the robot’s autonomy. By virtue of virtual reality visualization, the operator can have a more comprehensive understanding of the space to be explored. In this case, the high-level decision and path planning intelligence of human and the accurate sensing and modelling ability of the robot can be well integrated as a whole. Although the method is proposed for rescue robots, it can also be used in other out-of-sight teleoperation-based human–robot collaboration systems, including but not limited to manufacturing, space, undersea, surgery, agriculture and military operations.


Introduction
As a kind of service robot, rescue robots are demanded to search and save victims in various disasters, for example, mining accidents, terrorist attacks, earthquakes, city conflagration and explosions, bringing the benefits of reduced personnel requirements, less fatigue and access to unreachable areas. With these specific requirements, urban search and rescue robotics have become a challenging and hot research domain, which has attracted increasing researchers in the past decade. To promote the research and provide a public test bed, RoboCup 1 launched the Rescue Robot League to evaluate advanced robotic capabilities for emergency responders using annual competitions since 1999. 2 Another well-known event is that from 2013 to 2015, DARPA sponsored and hosted the Robotic Challenge (DRC) to promote innovation in human-supervised robotic technology for disaster-response operations. 3 Actually, rescue robots have already been employed in real disasters, especially in the United States and Japan. To date, rescue robots have participated in at least 28 disasters in six countries since the first deployment to the 9/11 World Trade Center collapse (see Murphy et al. 4 ).
As known, rescue robots usually face unstructured complex environments, which require good locomotion abilities. When they entered the field, rescue robots act as the extension of human beings for both perception and operation. Therefore, they are usually rigged with multiple sensors and effectors, which enable them to explore the space, for example, map the surrounding, search victims and determine their life status and location. In other words, rescue robots have to accomplish complex tasks in cluttered environments. However, they cannot do the job fully autonomously because their decision-making level is inferior to that of human beings. As a compromise, the robot is usually out-of-sight teleoperated during the missions.
However, operating a robot in search and rescue scenarios is very challenging. Doroodgar et al. 5 reported that the operator could become stressed and fatigued very quickly due to a loss of situational awareness. To our experience, there are three main problems in the traditional teleoperation 1. The operator can only have a third-person perspective through liquid crystal display, which shows the video stream of a camera or multiple cameras. 2. The network infrastructures have been probably destroyed, making the bandwidth restricted and unreliable, resulting in communication delay and data loss, which strengthens the challenge of teleoperation. 3. Giving control commands based on joystick, keyboard and mouse is inefficient.
In this article, we present a human-robot interaction (HRI) method based on real-time mapping and online virtual reality (VR) visualization for collaborative humanrobot space exploration. At the robot side, a dense point cloud map is built in real time by light detection and ranging-inertial measurement unit (LiDAR-IMU) tight fusion; the resulting map is further transformed into threedimensional normal distributions transform (3D-NDT) representation. Wireless communication is used to transmit the 3D-NDT map to the remote control station in an incremental manner. At the remote control station, the received map is rendered in VR using parameterized ellipsoid cells. The human operator can use interactive devices to control the robot to explore the space of interest collaboratively. By virtue of VR visualization, the operator can have a more comprehensive understanding of the space to be explored. In this case, the high-level decision and path planning intelligence of human and the accurate sensing and modelling ability of the robot can be well integrated as a whole. Although the method is proposed for rescue robots, it can be used in other out-of-sight human-robot collaboration systems, including but not limited to manufacturing, space, undersea, surgery, agriculture and military operations.
The contributions of this article are three-fold 1. Proposed an HRI method based on real-time robotic mapping and online VR visualization, which is implemented and evaluated employing our rescue robot. 2. Proposed to transmit the 3D-NDT map to the remote control station in an incremental manner, which significantly reduced the communication burden. 3. Proposed to visualize the map in VR using parameterized ellipsoid cells.
The work presented in this article is partially based on previously published results, 6 with the main additions being a detailed explanation of the system and the behavior-level HRI modes. Rest of this article is structured as follows. The second section draws related work and discusses how our work is unique compared to them. The third section pictures the framework of the proposed realtime mapping and online VR visualization-based HRI method. The fourth section details the LiDAR-IMU fusion mapping method. The fifth section presents the online VR method based on parameterized ellipsoid cells. The sixth section lists three HRI modes according to the unstructured level of the robot's underfoot region. Finally, the seventh section concludes this article and discusses future work.

Related work
HRI is a fundamental issue in human-robot collaboration systems. Therefore, a vast number of researches have been conducted on this topic to improve collaborative efficiency. 7-13 Birk et al. proposed to divide the robot remote control operation into three levels, that is, the task-, behaviour-and action levels, with regard to the commands given to the robots. 14 Among the three levels, commands in the task level are the most abstract, allowing the operator to concentrate on decision and planning issues. Every coin has two sides, to interact in the task level demands the robot to be highly autonomous. To interact in the action level only requires the robot to have expected locomotion ability at the cost that the operators have to undertake the highest cognitive and computational payload, which results in easy fatigue.
Yanco and Drury 15 methodically analysed how the HRI interface influenced the team performance during the association for the advancement of artificial intelligence (AAAI) Rescue Robot Competition. Based on the result, several HRI design principles are introduced, for example, showing the interface on a single rather than multiple monitors, avoiding tiny video windows and window occlusions and designing the interface for potential users rather than developers. In 2015, they investigated the HRI approaches being employed in eight teams of the DRC trials, where each team's interaction methods have been compared against their performance. 16 The DRC finals added difficulty that is not present in the trials, for example, tasks should be completed in succession during a run with harsher degraded communications, the robots should be completely untethered throughout task performance and with surprise tasks that changed between competition days. Therefore, they made another analysis upon 20 teams in the DRC final. 17 In this report, each team's HRI methods were distilled into a set of characteristics pertaining to the robot, operator strategies, control methods and sensor fusion. Each task was decomposed into subtasks that were classified according to the complexity of the mobility and/or manipulation actions. Performance metrics were calculated based on the number of task attempts, execution time and critical incidents. The metrics were then correlated to each team's HRI methods. The results suggested to balance the cognitive and computational burden of the human operator and the deployed robot; it is better to use a common coordinate frame for visualizing data from different kind of sensors to make it easier for situation awareness and so on.
Other than the approaches mentioned above, we present an HRI method for collaborative space exploration, based on real-time mapping and online VR visualization. The method is implemented and tested using our tracked rescue robot under the mission of exploring a rescue arena. By leveraging VR for map visualization, the operator can have a first perspective of the surrounding, which provides the basis for decision and path planning. Meanwhile, the operator can generate control commands for the robot by the VR's interactive devices. As a result, the human-robot team can efficiently explore the space of interest with the proposed HRI method.

A 3D mapping and VR-based HRI framework
In this section, we will present a novel HRI method, which is built upon real-time robotic mapping and online VR visualization, considering being applied to human and rescue robot collaborative space exploration. There are two main entities, that is, the robot and the remote control station where wireless communication is employed for data exchange, as shown in Figure 1. At the robot side, within Thread R2, a 3D dense point cloud map is built with LOAM 18 by fusing the data from a multi-line LiDAR and a MEMS IMU; the point cloud map is then transformed into 3D-NDT 19 representation for data compression; to further reduce the demand on communication bandwidth, the 3D-NDT map is transmitted to the remote control station in an incremental manner. Within Thread R1, the robot pose is estimated also by LiDAR-IMU fusion, which is transmitted to the remote control station in real-time for state monitoring and robot control. At the remote control station, within Thread S1, parameterized ellipsoid cells are employed to render the received map in VR for the operator. By viewing the VR map and the robot inside, the operator can make tactical decisions, plan timely paths or control robotic motions according to situational awareness and the exploration mission. Within Thread S2, the robot pose is updated in real-time for visualization and the interactive VR controllers are employed for command generation in different abstract levels according to the robot underfoot area. In this case, a human-robot collaborative space exploration team is formed, which provides the operator with a strong sense of presence.

The rescue robot and arena
Our custom designed tracked robot has been used to conduct experiments in this article, as shown in Figure 2. A Velodyne VLP-16 multi-line LiDAR and an Xsens MTi-100 IMU served as the primary exteroceptive and proprioceptive sensor, respectively. As can be seen, the IMU is located just below the LiDAR. Four coordinate systems are defined, which are all right-handed Cartesian system and three have been drawn in the figure. The first is the LiDAR coordinate system, with its origin at the geometric centre of the LiDAR sensor, y-axis points forwards, and z-axis points upwards; the second is the IMU coordinate system, with its origin at the geometric centre of the IMU sensor, x-axis points forwards, and z-axis points upwards; the third is the robot body coordinate system with its origin at the top of the main body of the robot, x-axis points forwards and z-axis points upwards. The fourth is the world coordinate system, which coincides with the third at the initial pose. Furthermore, all computation at the robot side is done on an industrial control computer, which runs Ubuntu 14.04 LTS. The robot software is constructed based on robot operating system (ROS). 20 The rescue arena shown in Figure 3 serves as the experimental environment, which is a part of standard RoboCup rescue arena. The arena is built to resemble a scenario that can be found in an urban disaster environment, for example, an earthquake or a fire. In this article, the rescue robot will be teleoperated to explore the arena using the proposed framework.

Real-time point cloud map construction
The problem of employing a mobile robot to map an unknown environment at the same time to estimate its pose is called simultaneous localization and mapping (SLAM). [21][22][23] Cameras 24,25 and laser range finders (LRFs) 26 are commonly used for SLAM in robotic systems. However, cameras turn out to be problematic in disaster mapping scenarios: First, severe illumination changes remain a monumental problem for cameras. Second, to gather range information using stereo vision, the scenario must feature abundant texture information. Furthermore, the range accuracy depends on the distance from the cameras to the object in relation to the stereo baseline (i.e. the distance between the two cameras). Third, an alternative to stereo vision is to use a single camera, which does not suffer from such baseline-related problems. However, the inherent disadvantage with this approach is that building a realistically scaled map requires the identification of objects with known extents to determine the global scale factor. Therefore, LiDAR is chosen as the primary sensor in this article.
Iterative closest point (ICP) and its variants 27 have been most widely used for 3D mapping. However, ICP still has two well-known problems. Firstly, the initial transformation guess is vitally essential as the convergence will deviate the optimal global solution without proper initialization.  Secondly, as an iterative optimization method its real-time performance cannot be guaranteed.
To deal with the first problem, a conventional method is to leverage pose estimation sensors such as photoelectric encoder and IMU. However, photoelectric encoders are not reliable in complex uneven surfaces like post-disaster environments, where side sliding and body shaking would likely happen during the motion. As a result, a MEMS IMU has been fused with the LiDAR in this article. For the second problem, Zhang and Singh 18 proposed to parallelize the motion estimation and mapping into two threads. As a result, the motion estimation is updated in real-time, and the mapping thread runs times slower. In this article, lidar odometry and mapping (LOAM) is employed for pose estimation and point cloud map building by fusing the data from 3D LiDAR and IMU. A typical point cloud map is shown in Figure 4, where colour indicates the height of each point.

Online VR visualization of the environmental map
In this section, we will present how to visualize the resulting 3D robotic map and the robot with regard to its estimated pose in VR online, which gives the first perspective of the robot working environment for the operator.

Map representation
In this article, the mapping algorithm is set to deliver local maps incrementally at the frequency of 1 Hz. Considering the data rate of the employed LiDAR, the points scanned is about 40,000 points per second. To reduce the map's memory requirements, each local map is down-sampled using the voxel-grid filter. However, the number of points in the map still increases too fast as the robot moves around. Besides the memory requirement, a large number of points also poses a big challenge for map visualization in VR, even with modern hardware and rendering techniques. Furthermore, the required bandwidth to transmit the map from the robot to the remote control station cannot be satisfied based on wireless communication. Therefore, the downsampled point cloud map should be further compressed for being visualized in VR; at the same time, the compression method should preserve the environmental structure information.
According to state of the art, OctoMap 28 and NDT map 19,29 are two widely applied compressing representation for point clouds. In 3D-NDT, space is uniformly divided into cube cells; then, a 3D normal distribution is fitted to the shape of the point set in each cell. In other words, NDT approximates the shape of the object in each voxel using an ellipsoid, whose principal semi-axes varies according to the 3D normal distribution. Octomap models each voxel as a probability of whether it is occupied, that is, it does not consider the shape of the object in each voxel. As a result, NDT gives a better visual presentation of using the same size voxel compared to Octomap, which can be observed in Figure 5. For this reason, 3D-NDT has been chosen for map compression and transmission. NDT representation of the point cloud map in Figure 4 is shown in Figure 6.

Online map visualization in VR
Data and command transmission between the remote control station and the robot is realized based on user datagram protocol (UDP), where the data are encoded using javascript object notation (JSON). 30 As the map expands with the robot's movement, it is inefficient to transmit all NDT cells at each frame since only a small part of cells changed during successive frames. As a result, we incrementally transmit the NDT map, which is then rendered in VR.
Assuming the current point cloud map is M t , and it will be grown to M tþk after k new frames of the point cloud is registered along the robot's movement. Comparing M tþk against M t , if the distance between a pointp i;tþk from M tþk andp j;t from M t is less than a certain threshold q, we consider that p i;tþk has been rendered in VR, and remove it. Afterwards, all remaining points in M tþk are used to generate a 3D-NDT map. Considering the environment scale and the speed of our rescue robot, q and k are set to 0.02 m and 5.
At the remote control station, parameterized ellipsoids based on Unity3D are used to render the received NDT cells, which is very efficient for online visualization using the same scale cells as the NDT. Figure 7 illustrates typical VR visualization results, where Figure 7(a) to (c) is from different exploration stages. The cells have been coloured according to height for a better view.

Interacting with the robot in VR
The HTC Vive VR system has been integrated into the remote control station in this study, as shown in Figure 8. In this section, we present how the operator can use its interactive devices to switch perspective and control the robot. Notably, three HRI modes will be proposed, one is action-level HRI, that is, using the HTC controller to give the robot velocity commands. The other two are behaviorlevel HRI, that is, using the HTC controller to specify a path or a target point for the robot, these two modes rely more on the robot's autonomy.

Perspective switching
There are two methods to change the viewpoint in the VR's scene.
1. The viewpoint will be changed according to the motion of the operator, as the operator's pose is real-time tracked in the HTC VR system. 2. Using the interactive handle to change the viewpoint.
To keep smoothness is one key issue during perspective switching, to avoid sickness caused by daze. In practice, the movement of the operator is always continuous, that is, without any mutation, which will not cause dizziness. In this study, the events triggered by six buttons of the left handle, that is, w, y, v, x, { and | are used to set the target viewing point, including its spatial location and orientation. The route from the current point to the target point is then smoothed using linear interpolation, where the camera's speed is constrained. Finally, the operator can smoothly tour in the VR scene and effectively avoid the dizziness.

Action-level HRI
The operator needs to know the robot's pose in real time, for monitoring its state, controlling its motion and planning  exploration strategies. Therefore, the robot pose estimation should be transmitted to the remote control station with which the robot model is rendered in the VR scene together with the map. However, it is impracticable to pack the robot pose and the map into one socket since they have different transmission requirements. Real time is essential for pose transmission, as delay may cause a crash; the map can be transmitted in a relatively low frequency, as the map during a short time does not change much. In this HRI mode, the operator views the robot's pose and the map in VR, and then give velocity commands based on his/her understanding of the live situation. Particularly, the HTC Vive's right handle events, such as w, y, v and x, are used to generate the forward, backward, left and right velocity control commands for the robot. Because the real-time requirement of the transmission of the control commands to the robot is the same as that of the robot's pose updating, we share the same thread for transmitting the control commands and updating the robot's pose encoded by JSON. The robot receives the control commands from the console, then parses and executes it. The system implementation is shown in Figure 9. A video showing the demonstration has been uploaded as Online Supplementary Material.

Specified path-based behavior-level HRI
To improve the interaction efficiency, especially to reduce the burden on humans, the operator can quickly plan a path for the robot instead of giving action commands. In this interaction mode, the operator manually draws a safety path in VR using button { of the right handle, which is then followed by the rescue robot. Considering operability and convenience, a virtual ray emitter is added for the right handle, as shown in Figure 10. As a result, there will be an intersection point between the ray and the map, which is located on the ground and is on the path. However, sometimes there is no intersection between the ray and the map since the map is discretized. In other words, a set of disconnected segments will be obtained after drawing a path on the map. To deal with this problem, the segments are used to fit a continuous path based on linear interpolation. An example is shown in Figure 10.

Target point-based behavior-level HRI
Although the disaster environment is highly unstructured, it usually contains some region that the ground there has no severe fluctuation, for example, is flat or with smooth  slopes. In such situations, the HRI efficiency can be further improved, based on the target point-based behaviourlevel HRI. This HRI method requires less human resources while relying more on the robot's autonomy. In this case, the operator first gives the robot a target point. Afterwards, the robot has to find a safe path to the target point in the map and then navigate to it fully autonomously. To assign the target point, the operator can point to it using a ray emitted from the right handle, which will intersect with the map at the target point, an example of which is shown in Figure 11.
The target point is then transferred to the rescue robot. We only give a brief introduction on path planning here as it is not the research focus. As known, to search a path in 3D is computationally expensive, which does not fit the onboard computer. Therefore, the map is projected to the ground, and a typical example is shown in Figure 12(a). This simplification is reasonable considering that there is no severe fluctuation on the ground. Afterwards, a cost map is constructed based on the 2D projection, which is shown in Figure 12(b). In the figure, objects are coloured in yellow, blue means dangerous area, red and purple stand for the less dangerous area and grey indicates safe area. Then, the Dijkstra algorithm is used to search the shortest safety path to the target point, which has been illustrated in Figure 12(c).

Conclusion
A novel HRI method is proposed for collaborative humanrobot space exploration, which is built upon real-time robotic mapping and online VR visualization. The method is then implemented and tested in the application of exploring a rescue arena, where our rescue robot and HTC Vive is employed as the experimental platform. At the robot side, a dense point cloud map is built in real-time by LiDAR-IMU tightly fusion; the resulting map is further transformed into 3D-NDT representation for data compression. Wireless communication is employed to transmit the 3D-NDT map to the remote control station in an incremental manner. At the remote control station, the received map is rendered in VR using parameterized ellipsoid cells. Finally, the operator can interact with the robot in the VR's scene using the HTC Vive controllers. As VR visualization is leveraged, the proposed method can provide a strong sense of immersion, where the operator feels comfortable during remote operation, and can effectively fuse the human and robot intelligence during exploring space of interest. Furthermore, the method can be widely applied to various robot systems, where the robot has the ability of mapping and localization, and communication between the robot and operator is available, for example, space robots, assembly robots, surgery robots.
However, the proposed method still suffers the following limitations. First, colour and texture information is missing in the map built with LiDAR-IMU fusion, which is required for a better understanding of the map. Second, minor structure details are blurred and lost when using parameterized ellipsoid cells for VR rendering. Third, the HRI level is limited to the action-and behavior levels. As a result, our future work will focus on fusing colour and texture information into the mapping algorithm, improving   Figure 11 to the ground; (b) a cost map constructed for (a), where objects are coloured in yellow, blue means dangerous area, red and purple stands for the less dangerous area and grey indicates safe area; (c) a safe path from the current point to the target point searched by the Dijkstra algorithm, the map has been trimmed for the purpose of path visualization.
the VR rendering quality using more shape elements and finding ways to interact with the robot at the task level.