Robust People Tracking Using an Adaptive Sensor Fusion between a Laser Scanner and Video Camera

Robust detection and tracking in a smart environment have numerous valuable applications. In this paper, an adaptive sensor fusion method which automatically compensates for bias between a laser scanner and video camera is proposed for tracking multiple people. The proposed system comprises five components: blob extraction, object tracking, scan data clustering, a cluster selection, and updating the bias. Based on the position of object in an image, the proposed system determines the candidate scan region. Then, the laser scan data in the candidate region of an object is clustered into several clusters. A cluster which has maximum probability as an object is selected using a discriminant function. Finally, a horizontal bias between the laser scanner and video camera is updated based on the selected cluster information. To evaluate the performance of the proposed system, we show error analysis and two applications. The results confirm that the proposed system can be used for a real-time tracking system and interactive virtual environment.


Introduction
Robust detection and tracking in a smart environment have numerous valuable applications, such as recognizing human behavior for intelligent surveillance, monitoring, and analyzing.To monitor human activities such as location, identity, and behavior, robust detection and tracking are necessities in various environments.
In the people motion tracking area, the one side view, which comes from a single video image, is used to verify peoples' actions.In diverse and sophisticated environments, there are numerous problems in a single video image.In ubiquitous environments, to perceive the motion of people, unfavorable conditions exist in single-sensor systems, such as illumination variation and shadow.To overcome these problems, many methods rely on fusing a number of sensors, such as the infrared cameras, laser range finders, and image cameras of many directions [1,2].However, as various sensors are added, the object calibration is a very important issue in the object tracking area for the reliable detection and tracking of multiple objects [3].
To calibrate multi-sensors, an attempt using an extrinsic calibration between the camera and the laser scanner was proposed in [3][4][5][6].This approach uses a technique in which both sensors image a planar checkerboard target at unknown orientations.Even if the calibration is reliable, it is inconvenient to adjust for tracking people simultaneously.Furthermore, the checkerboard always should be used to calibrate this system.
To overcome these problems, we present an adaptive sensor fusion method between the laser scanner and video camera.In this proposed approach, our system does not need a checkerboard.As the configuration of the system, which consists of a laser scanner and video camera, changes to intuitive positions, the traditional system requires a recalibration process.The proposed system, which has a background model, can compensate these sensor's horizontal variations automatically.
The paper is organized as follows: in the next section, we briefly introduce the proposed system.In Section 3, we present a video processing method for tracking multiple objects.In Section 4, we present a detailed method to calibrate laser scanner and video camera.Then, we present the results with real environmental data in Section 5.At last, we conclude in Section 6.

System Overview
The proposed system consists of the video processing and the adaptive sensor fusion as shown in Figure 1.To measure the position of objects in an image, we adopt Mixture of Gaussians-(MoGs-) [7] based blob extraction and inferencegraph-based object tracking approach.Then laser, scan data in the candidate region from the result of the object tracking is clustered into several clusters.A cluster which has maximum probability as an object is selected using a discriminant function.Finally, a horizontal bias between the laser scanner and video camera is compensated based on selected cluster information.

Video Processing
In order to merge the laser scanner and video camera, the proposed system performs video processing in a prior part of the system.In order to measure the position of objects in an image, we developed an enhanced view-based multipleobject tracking system based on previous research [8,9].In this section, we briefly introduce this multiple-object tracking system.

Blob Extraction.
For segmentation, MoGs is widely used due to the capability for adaptation to the various environmental changes like illuminations.But, the main problem of MoGs is that moving objects are learned as background when they stop.And thus it fails to segment objects stopped as foreground.In the proposed system, in order to segment an input image into a foreground and background, a modified MoGs [8] is used to keep moving objects segmented as foreground, even when they stop.After conventional MoGs is performed, the following four steps are further executed to manage objects stopped.First, the Gaussians are sorted in descending order of their weight.Second, the pixels moving from the first Gaussian component to the second Gaussian component are identified as pixels belonging to the objects stopped and they are put into an augmented mask.Third, the pixels in the augmented mask are added to the segmentation mask from the conventional MoGs.Fourth, the pixels in the augmented mask are removed from the augmented mask when the stopped objects start to move.This modified MoGs guarantees that the objects still identified as foreground, even when moving objects stopped.In order to remove shadows and highlights, we can adjust the intensity of a shadow pixel compared with a background model as shown in Figure 2(d).
After removing shadows and highlights, when capturing foreground pixels at frame , they are clustered into a set of   = {   |  is an integer and 0 ≤ }, where a blob    is an th set of connected foreground pixels.

Object Tracking.
A blob represents an object in the ideal case.In the real environment, however, one object can have several blobs (fragmentation), and one blob can have multiple-objects (grouping).To deal with these problems, Choi et al. [9] adopt an online multiple object tracking framework.Figure 3 shows the overall procedure of this framework.First, detecting blob association events between   and  −1 can update the blob inference graph, labeling each vertex as fragment, object, and group.Finally, localization of objects can be captured by using the blob graph.

Adaptive Sensor Fusion
In this paper, we propose an adaptive method which compensates horizontal bias between the laser scanner and video camera.In order to merge the laser scanner and video camera adaptively, there are three steps in the proposed method.First, laser scan data in the candidate region of the object is clustered into several clusters.Second, a cluster which has maximum probability as an object is selected using a discriminant function.Finally, a horizontal bias between the laser scanner and video camera is updated based on selected cluster information.

Sensor Data Clustering.
To match an object in an image to laser scan data, we determine the candidate region in the laser scan data.The candidate region in the laser scan data can be determined based on the object position in an image.This candidate region (  ,   ) is shown in the following equations: In (1),   and   are the left and right angle boundaries, respectively,   denotes the left pixel position of an object,   denotes the right pixel position of an object,  refers to image width, bias means horizontal bias between the laser scanner and video camera, fov means field of view angle, which is determined by the video camera, and  denotes the scale factor which expands the candidate region.
The archived laser scan data in the candidate region is lumped into several clusters by the nearest neighborhood clustering algorithm.We use the following equation as a threshold equation: In ( 2),  MV () refers to the distance from the measured value of the laser scanner at angle , both  1 and  2 are discrete angle values (0, 1, 2, . . .180), and  denotes the neighborhood distance threshold value, which is set as 2.5 feet (an average person's stride length [10]).Using this threshold function, we classified the datum in 2.5 feet from each other as same cluster.

The Cluster Selection Method.
In order to take an appropriate cluster which has the greatest probability to be an object, the proposed system adapts the following discriminant function: In (3),  denotes the number of clusters, and   refers to the th cluster.A discriminant function Δ is adapted to compare each cluster: In ( 4), d MAX BG denotes the maximum distance from the background model of the laser scan data,  BG MV () denotes the difference between the background model and measured value at ,   refers to angular width, and   refers to the center angle of the candidate region (  ,   ).These are shown in the following equations: In ( 5) and ( 6),  BG () denotes the distance from the background model of the laser scan data at .The background

Candidate region Cluster of object
Target bias

Center of candidate
In (9),  refers to the learning rate in the case of foreground and  refers to the learning rate in the case of the background.Equation ( 9) is adapted to the whole range of laser scan data.According to the discriminant function, the cluster volume, difference from the background model, and difference from the center of candidate region are considered as selection criteria.Using these selection criteria, the proposed system can select the appropriate cluster in case the candidate regions overlap.An example case of overlapping is shown in Figure 4.

Updating Rule.
After determining an appropriate cluster, which is selected by a discriminant function, the proposed system updates bias adaptively.Figure 5 shows the target bias which is calculated from the candidate region and selected cluster.In order to take an appropriate cluster which has the greatest probability to be an object, the proposed system adapts the following discriminant function.
To diminish the target bias, the proposed system updates the bias using the following updating rule: bias = bias +  (MidAng () −  c ) .
In (10), bias refers to the horizontal bias angle between the field of view angle and the matched cluster,  denotes the learning rate, and MidAng() means the median angle of matched cluster .The MidAng is calculated by the following equation: Using (10) as an updating rule, the proposed approach can compensate for horizontal bias adaptively.

Experimental Results
In order to evaluate proposed system, we arrange the video camera and the laser scanner vertically.Samsung SDC-415A model is adopted as video camera.It supports 768 × 494 resolution and covers 140 degree fields of view.SICK LMS100 model is adopted as laser scanner.It supports a 50 Hz scan rate over 270 degree range and 0.25 degree angular resolution.Its sensing range is 18 meters with an error of about 20 mm.The experimental results consist of two parts: the error analysis of the proposed approach and the application of the proposed system.

Error Analysis.
To evaluate the proposed approach, we performed an error analysis of the bias compensation.To measure the error of the compensation, we installed a marker at the same position which can be detected by both the vision and laser scanner.By rotating the video camera three degrees horizontally, we measured the relative bias from the origin angle after convergence.The overall results are shown in Table 1.In this experiment, measurements were taken five times and the results were averaged out.
As shown in Table 1, the average error of the proposed approach is about 0.085 degree.The saturation of bias as time is shown in Figure 6.In Figure 6, LR refers to learning rate.In the case of 0.1, the saturation time is about 40 frames.However, in the case of 0.5, the saturation time is lower than 10 frames.From these results, the proposed approach can be considered reasonable for real-time systems.

Application.
In this subsection, we show two applications of the proposed system.The first application is people tracking.The second application is a virtual pet system using augmented reality.Figure 7 shows an example of multiple people tracking.The upper two images illustrate that the candidate regions do not overlap, and the lower two images illustrate overlapping case.In both conditions, the proposed system can track appropriate people by the discriminant function.
In the second application, to apply the proposed system, we employed it in a virtual pet system based on augmented reality.As shown in Figure 8, only a person and a dog's house exist in a real environment.With the position of the calibrated object result and the human action recognition result, user can make a relationship with a virtual pet, called Cho-Rong-I.
The ground of the real environment is calibrated in the initial stage so that the bottom-center coordinate of each object can be converted into the coordinate in the ground.
We made several complex scenarios in order to probe the performance of the proposed system as follows: Cho-Rong-I follows the owner in Figures 8(a) and 8(b) and pretends to die when the owner pretends to shoot in Figure 8(c).Figures 8(d) to 8(f) show that Cho-Rong-I passes under the owner's legs.As the proposed system obtained the precise position, the augmented dog, Cho-Rong-I, also reacts with the person at the virtual region matching the real coordinates.

Conclusion
In this paper, we proposed adaptive sensor fusion methods that compensate for horizontal bias between a laser scanner and video camera for tracking people in real-time system.The usual method using the checkerboard is inconvenient to track people simultaneously because of the manual calibration in a previous research.The proposed system in this paper overcomes the problem using an automatic adaptive sensor fusion method in real-time people tracking.In this application field, the accuracy of compensation is a significant factor for a real-time system.In order to match images between the video camera and laser scanner, we propose the algorithm to merge the laser scanner and video camera simultaneously by capturing the position of the candidate region and the cluster of the laser scan datum.To evaluate the performance of the proposed system, we employed it for tracking two people and then applied it in an augmented reality where one person can interact with a virtual pet.These results show that the proposed system can be successfully employed to obtain the peoples' position by automatic sensor fusion.To enhance the proposed system, variations of the other axis should also be considered.We are currently improving our system to include the uncertainty variations of the sensor position.

Figure 2 :Figure 3 :
Figure 2: (a) An input image.(b) The result of foreground extraction.(c) Detected shadow (marked in red) and highlight (marked in blue).(d) The result of foreground extraction: an example of a figure without shadow or highlight.

Figure 4 :
Figure 4: Example case of overlapping candidate regions.

Figure 7 :
Figure 7: The people tracking by adjusting adaptive calibration.

Figure 8 :
Figure 8: Demonstration of the virtual pet system.

Table 1 :
The relative bias after convergence.