Object tracking algorithm for unmanned surface vehicle based on improved mean-shift method

The unmanned surface vehicle has the characteristics of high maneuverability and flexibility. Object detection and tracking skills are required to improve the ability of unmanned surface vehicle to avoid collisions and detect targets on the surface of the water. Mean-shift algorithm is a classic target tracking algorithm, but it may fail when pixel interference and occlusion occur. This article proposes a tracking algorithm for unmanned surface vehicle based on an improved mean-shift optimization algorithm. The method uses the self-organizing feature map spatial topology to reduce the interference of the background pixels on the target object and predicts the center position of the object when the target is heavily occluded according to the extended Kalman filter. First, a self-organizing feature map model is built to classify pixels in a rectangular frame and the background pixels are extracted. Then, the method optimizes the extended Kalman filter solution process to complete the prediction and correction of the target center position and introduces a similarity function to determine the target occlusion. Finally, numerical analyses based on a ship model sailing experiment are performed with the help of OpenCV library. The experimental results validated that the proposed method significantly reduces the cumulative error in the tracking process and effectively predicts the position of the target between continuous frames when temporary occlusion occurs. The research can be used for target detection and autonomous navigation of unmanned surface vehicle.


Introduction
Environmental awareness is essential for unmanned surface vehicle (USV) to acquire the autonomous ability in changing environments. 1 Target detection and object tracking skills are its foundation. Due to the interference factors such as background interference, object occlusion, and noise clutter, object tracking on water is a challenging task. 2,3 Object tracking plays a vital role in the computer vision field. The tracking can be divided into region tracking, feature tracking, model tracking, and deformation template tracking. In 2000, to track objects, Comaniciu et al. 4 applied the mean-shift algorithm, which is simple and practical, and fully effective when the target moves slowly. However, the target may lose when the method is applied on water due to the environmental interference and the cumulative error of the algorithm. 5,6 In order to promote the deficiency of the mean-shift method, several improved algorithms have been suggested. Collins 7 introduced a method based on the difference of Gaussian mean-shift kernel to effectively track one object through the scale space. The tracking effect was satisfactory but the time complexity was high when the object moves at low speed. Danelljan et al. 8 incorporated a factorized convolution operator in the discriminative correlation filter, the operations of reducing the number of model parameters and training the compact model generated by the sample distribution could also significantly reduce the complexity of memory and time, but the target will easily be lost if it was severely occluded. Ignoring the effects of sea clutter interference, Mohanty et al. proposed a method of determining the background pixel position based on the maximum gradient value in the column direction after the image information was proposed and averaged in the row direction. The algorithm had extreme requirements on the environment, and it was liable to fail when there was a large background interference. Dong and Desouza 9 proposed a new adaptive learning algorithm that uses multiple feature subspaces to handle background pixel mutations caused by light changes, but its accuracy was not very penetrating. For the occlusion problem in multiple environments, Possegger et al. 10 enhanced the model's recognition accuracy and matching efficiency by adding basic color features to the target appearance model, accurately distinguished the targets, and improved the robustness of the algorithm. Nevertheless, the algorithm was difficult to implement when objects were occluded. The scale and orientation adaptive mean-shift tracking (SOAMST) algorithm is another improved algorithm based on mean-shift, which enhances the adaptive ability of mean-shift when the target is deformed. When the value in the weighted image is small, the size of the candidate object area will be close to that of the actual target, the scale and size of the target in motion can be obtained by calculating the gray matrix related to the weight image. SOAMST is a relatively new algorithm at present.
To overcome the shortcomings of mean-shift algorithm tracking, many novel methods use Kalman filter (KF) to predict and correct target position based on the framework of mean-shift algorithm. Li 11 constructed a KF system model and used the initial value of the mean-shift algorithm at the target center of the KF prediction. The results of the tracking algorithm were used as adaptive KF measurement values. The KF estimation parameters were fitted by the Bhattacharyya coefficient. The algorithm was practical, but it still failed when object occlusion occurs. Gu 12 used a Gaussian mixture weighted KF in the tracking process to enhance the performance of the algorithm and avoid unnecessary maneuver detection, but the algorithm lacked stability. The KF can effectively improve the tracking accuracy, but it requires the dynamic characteristics of the system to be linear in basic parameters, which is a strict condition.
In addition, with the development of neural networks, they are commonly used in image segmentation, feature point extraction, and other technologies. Self-organizing feature map (SOFM) is a competitive neural network, with a two-layer neural network consisting of an input layer and an output layer. In the network, the output layer is spatial topological structure. 13 SOFM has been widely used in image processing and classification owning to the unique structure. Its outstanding performance in image classification can effectively address the problem of pixel interference in tracking algorithms, and its nearest neighbor rule can effectively complete the pixel classification.
As for the problems that background pixel interference and target loss while occlusion occurred, this article proposes an improved mean-shift optimization algorithm based on SOFM and extended Kalman filter (EKF), which can improve the tracking accuracy while ensuring the antiocclusion ability. This algorithm classifies the images in the rectangular frame according to the spatial topology of the SOFM, uses the nearest neighbor to select the most likely target image, removes the off-center image, and reduces the influence of the background pixels. Then, the EKF is differentiated and integrated into the mean-shift algorithm. After detecting the occlusion of the object, the new center position of the target is obtained based on the dynamic prediction ability of the EKF. The rest of the article is arranged as follows: the second section introduces the basic principles of mean-shift algorithm, the third section introduces SOFM-based target detection algorithms, the fourth section introduces the target positioning process based on improved EKF, the fifth section describes the experimental results, and the sixth section summarizes the conclusions.

Mean-shift tracking algorithm
Mean-shift tracking algorithm is a nonparametric feature space analysis method built on kernel density estimation. As color is the most widely used feature in computer vision problems, mean-shift uses the color histogram as its matching feature for tracking objects. The target model is constructed based on the color characteristics of the current frame. If the target still exists in the next frame area during the tracking process, it will be defined as the region proposal. Traditionally, mean-shift tracking algorithm calculates the color histogram of the target model and the candidate model separately and compares the similarity of the two models according to the Bhattacharyya coefficient to complete the feature module matching. During the process of object tracking, the initial object is selected manually in the initial frame, and a rectangular window with a tracked object is drawn by manual annotation. 14,15 Target model During the selection of rectangular box in the initial frame, it is assumed that n pixels are contained, every pixel position is represented by fz i g i¼1;2;...;n , and the color space of the selected area is evenly divided to find a feature histogram composed of m equal intervals. The probability density q u ðu ¼ 1; 2; . . . ; mÞ of the target model is expressed as follows where z Ã i indicates the normalized pixel position with object center as origin, calculated by the following equation ðx 0 ; y 0 Þ is the target center coordinate, K is the Epanechikov kernel function matrix, bðz i Þ is the histogram index function, which is the red, green, blue (RGB) value corresponding to each pixel in the target frame, u is a certain eigenvalue (u ¼ 1, . . . , m), the function d is used to determine whether the gray value at the pixel z i in the target area belongs to the uth cell in the histogram, if yes, it is 1, otherwise it is 0. C is the normalization coefficient, which is obtained from the following equation Candidate model At the tth frame, the target center position f 0 of the (t À 1)th frame is used as the center of the search window to obtain the center position coordinate f of the candidate target, and the region proposal histogram of the current frame is calculated. The pixels in this area are represented by fz i g i¼1;2;...;n . The probability density of the candidate model can be solved by the following equation where h is the size of the kernel window and determines the weight distribution.

Similarity measure
Bhattacharyya coefficient is the similarity function, which is defined as The similarity function r ranges from 0 to 1, the larger value indicates that the two models are more similar.

Mean-shift iteration
During the iteration of each frame, the center position of the candidate model is continuously updated until the Bhattacharyya coefficient reaches the maximum value. As for calculating the histogram distribution of the search window under the weight of the kernel function, it uses Hillclimbing algorithm for the density histogram of the data in the current frame and iteratively calculates the data histogram multiple times to make mean-shift find the target area with the greatest similarity along the direction of the data density distribution gradient.

Target detection based on SOFM
In most cases, background information will inevitably appear in the rectangular frame that is manually selected. Therefore, when the traditional mean-shift algorithm uses the color histogram of the target as the target's feature representation, it is easy to cause background interference and target occlusion. The SOFM is introduced into the traditional mean-shift target tracking algorithm to detect target pixels using the time continuity of SOFM and the constraints of the spatial topology.
By setting up SOFM, the clustering of various image slices in the rectangular box is carried out to complete the separation of the background interference pixels. After the image is being divided into several image pieces, these input data are mainly classified into three characteristics, namely, the image film (p 1 ) which does not contain the target pixels at all, the image chip (p 2 )which contains the target pixels, and the image chip (p 3 ) which contains both the target pixels and the background pixels. The overall process is as follows.

Argument initialization
A set of weights is set by random numbers, the weights w k ¼ [w k1 (t), w k2 (t), w k3 (t), . . . , w kt (t)], and they are normalized. Set the learning-rate-time function as T 1 , the neighbor-radius-time constant as T 2 , and the number of neurons in the competitive layer as m.

Enter the input of the vector
The target box image is divided into several image slices, and the image feature vector X ¼ ðx 1 ; x 2 ; x 3 ; . . . ; x m Þ T is entered into the input layer of the network. And the input vector is normalized. In the mapping layer, the European distance of the weight vector and input vector of various neurons is calculated, and the two are made internal products. The neuron with the largest inner product is marked as the winning neuron. Since the sample vectors and weights are normalized, the neuron with the smallest Euclidean distance is the one with the largest inner product. The distance between the jth neuron and the input vector of the mapping layer is as follows where w ij is the weight between ith neurons in the input layer and jth neurons in the mapping layer.

Weight learning
Update weights of the winning neuron topology area with the Kohonen rule w ij ðt þ 1Þ ¼ w ij ðtÞ þ hðx i ðtÞ À w ij ðtÞÞ; j 2 N ðtÞ where h is a constant between 0 and 1, gradually reduced to 0 over time; N(t) is the neighborhood of winning neuron.

Judgment of algorithm convergence
The algorithm determines whether the number of iterations has reached the preset maximum. If the maximum number of iterations has not been reached, it proceeds to the second step, otherwise, the algorithm ends. Image slice is a collection of regional image pixels with structure information, which uses SOFM to classify the image slices in rectangular boxes and maintain its spatial topology. The image slice that is most likely to be targeted is determined using the proximity selection mechanism, and the image piece that deviates from the center of the target will be removed. Target center image slices and images containing targets and targetless images are weighted to determine the center position of the next target frame: where P i is the center coordinates of the ith image and w i is its weight. The algorithm flow is shown in Figure 1.

Object localization based on EKF
Object tracking is a dynamically changing process, whose eigenvalues will change differently over time. When the state at (t þ 1)th frame is not a linear function of the state at tth frame, we need to use the local characteristics of the nonlinear function to locally linearize the correlation process and evolve the KF into EKF, which can achieve the optimal estimation of the target center position. The dynamic characteristics of the object are selected through EKF as input parameters, and the eigenvalue at the next moment is determined according to the change of the input.
The system state equation is as follows The system observation equation is as follows State transition matrix is and observation matrix is where X t is the system random noise vector, Z t is the system random observation vector, f(Á) is the system prediction function, h(Á) is the system observation function, u t is set to 0 generally, and q 1 is the estimation error and q 2 is the observation error, both of which are zero-mean vectors with Gaussian distribution. The covariance matrices of system state noise and observation noise are Q t and R t , respectively. 16,17 EKF predicts the state of the next moment according to the state of the previous moment, then combines the difference between the observed value and the predicted value to modify the predicted value by the filter gain matrix. In a word, the essence of EKF is to give the dynamic system state value at the current moment through the optimal estimated value at the previous moment and the actual observed value. The EKF process is shown in Figure 2.  During the updated process, EKF needs to solve the Jacobian matrix at each moment, which increases the time complexity of the algorithm. This article uses the differential linearized EKF method. 18 We solve the Taylor series expansion of the function y ¼ f ðxÞ in the neighborhood of x, and keep only the first term of the expansion In high-dimensional nonlinear systems, the form of the Jacobian matrix is generally used to express the derivative calculation of equation (11). In order to ensure the real-time performance of the tracking algorithm, equation (12) is used instead of the Jacobian matrix: Finally, the real observation Z is obtained by the mean-shift algorithm to represent x 0 , as given in the following equation During mean-shift algorithm object tracking, an initial position needs to be set before each iteration calculation. Considering the dynamic system estimation capability of EKF, the starting position of the iteration can be provided by EKF prediction, which reduces the running time of mean-shift algorithm and improves the system processing capability. Due to the temporal continuity of the video image, the position and speed of the target between continuous frames will not change unexpectedly, 19 so the movement of objects between two continuous frames can be regarded as constant. The object area state vector is X t ¼ ðx t ; y t ; v xt ; v yt ; w t ; h t ; Dwt; DhtÞ T , where x t ; y t ; v xt ; v yt ; w t ; h t ; Dwt; Dht are the coordinates of the target center on the horizontal and vertical axes, and speed, length, width, and changes on length and width, respectively. Kalman prediction vector is where T is the total time of all frames of the video and N is the number of frames. Because the target objects in two continuous frames move at a uniform speed, the equation of state between the frames can be regarded as a linear equation. This article uses the mean value of state transitions to predict time t. The state transition matrix is The observation function is defined as The state noise matrix is When the target object is not blocked, the tracking steps are as follows: 1. Using the optimal estimation of the object center position of the previous frame X tÀ1jtÀ1 to solve the prediction of the position x tjt of the next frame. 2. Iteratively calculating the object center position in the current frame as the EKF observation according to the position of the previous frame Z t . 3. Using Z t to modify the predicted value, then calculating the nearest estimate position x tjt for the current frame. 4. Continue iterating with x tjt as the starting point until the target object is tracked, it is regarded as the center position of the target in the current frame, and the estimation ends.
When the object is occluded, the mean-shift algorithm cannot calculate the correct center position, which gives birth to the condition that EKF is not able to obtain the true observation value. A threshold G t is presented to determine whether the target is out of occlusion. Similarly, two continuous frames are selected for comparison. The Bhattacharyya coefficient is introduced to determine whether occlusion occurs during object tracking. In this article, P t and P tÀ2 are defined as the Bhattacharyya coefficients of the current frame and the previous frame, respectively, and the threshold G t determines the occlusion degree in the tth frame image where G t is set to 0.3. When the object is out of occlusion, the mean-shift algorithm starts to iterate, and the EKF is provided with valid observations from the target's true center position. Combined with the description in this section, the specific algorithm flowchart is shown in Figure 3.

Experiment design and result analysis
The SOFM algorithm can be regarded as a process of classifying image pieces in a rectangular frame. According to the object size in the actual test image, the network dimension is set to [3 5], the initial training step covering the input space is 100, the initial neighborhood size is 3, the layer topology function uses the hextop function, and the distance function between neurons is linkdist function. 20  The experiment program is developed with Cþþ based on Visual Studio 2017 and the OpenCV 3.4.6 computer vision function library was used. 21 As Figure 4 shows, the program running result of a sailing USV video well reflects the space topology structure of the SOFM algorithm. As described in the third section, the testing picture is divided into 3 Â 3 slices, and they are classified according to the features p1, p2, and p3. Images 2 and 3, which include the object pixel, and images 1 and 4, which do not include the object pixel at all are labeled, respectively. As shown in the classification results, two neighboring image slices will be allocated to two neurons adjacent to each other in space. Using the spatial structure of the SOFM algorithm, the adjacent image slices are analyzed as an entirety to deal with the influence of background interference.
Under the same development environment, an outdoor USV sailing video is used to conduct an object occlusion test experiment. In order to guarantee the reliability of the experiment, the improved algorithm is compared with the SOAMST algorithm. In the comparison experiments, the red rectangular frame is the manually selected rectangular frame in the improved algorithm, while the black rectangular frame is the manually selected rectangular frame in the SOAMST algorithm. In the6 video, the floating box on the water surface will cause a complete occlusion to the USV, and the background interference and reflection interference challenge the algorithm too. The tracking results are shown in Figure 5. As the first three figures show, in the condition of sailing with slow speed, tracking effects of both two algorithms are ideal, however, the SOAMST algorithm gives rise to some fluctuations. while the improved algorithm has less accumulated errors due to the application of SOFM. When the USV is occluded, as showed in the next three pictures, the SOAMST algorithm loses the target, while the improved  algorithm accurately estimated the target position due to the prediction ability of the EK. After the disappearance of occlusion, the improved algorithm still finds the target correctly.
In Figure 6, the blue continuous curve describes the pixel distance error of the SOAMST algorithm during the experience, and the orange dotted curve represents the error of the optimized algorithm. Due to the cumulative error of the algorithm and the influence of occlusions, the tracking result of SOAMST algorithm becomes worse and worse over time. In addition, the algorithm starts EKF position prediction when the object is occluded, which effectively shortens the overall running time.

Conclusion
This article analyzes the advantages and disadvantages of the mean-shift algorithm and proposes optimization ideas for the cumulative error and background interference problems. First, the spatial topological features of the SOFM classification algorithm is used to eliminate image slices with background interference and to reduce background interference factors. The EKF's dynamic system estimation capabilities is used to accurately predict the position of the next frame. By incorporating EKF into the traditional mean-shift algorithm, a new target center position can be and efficiently obtained, and the accurate position of the target can be found. The algorithm is proved in the experiments that the background pixel interference can be effectively narrowed. When the occlusion time is short, the algorithm has an excellent anti-occlusion effect. In further research, the similarity function of occlusion determination needs to be optimized to improve the universal applicability of the algorithm.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.