Soft labeling with quasi-Gaussian structure for training samples of deep classification trackers

Deep classification tracking aims at classifying the candidate samples into target or background by a classifier generally trained with a binary label. However, the binary label merely distinguishes samples of different classes, while inadvertently ignoring the distinction among the samples belonging to the same class, which weakens the classification and locating ability. To cope with this problem, this article proposes a soft labeling with quasi-Gaussian structure instead of the binary labeling, which distinguishes the samples belonging to different classes and the same class simultaneously. Like as the binary label, the signs of labels for target and background samples are set to be plus and minus respectively to distinguish samples of different classes. Further, to exploit the difference among samples in the same class, the label values of samples in the same class are designed as a monotonically decreasing quasi-Gaussian function about Intersection over Union. Therefore, the corresponding response function is a two-piecewise monotonically increasing quasi-Gaussian combination function about Intersection over Union. Due to such response function, deep classification tracking trained with this proposed soft labeling achieves better classification and location performance. To validate this, the proposed soft labeling is integrated into the pipeline of the deep classification tracker SiamFC. Experimental results on OTB-2015 and VOT benchmark show that our variant achieves significant improvement to the baseline tracker while maintaining real-time tracking speed and acquires comparable accuracy as recent state-of-the-art trackers.


Introduction
Visual object tracking, one of the most important tasks in many robot applications, has been widely used in many fields such as intelligent manufacturing, human-computer interaction, video surveillance, and robotics. [1][2][3] It is an indispensable part of robots 4,5 serving as the "eye" for robots to communicate with the world as the Figure 1 shown.
Visual object tracking mainly contains five components: motion model, feature extractor, observation model, model updater, and ensemble post-processor, where feature extractor has the greatest impact on visual tracking performance. 6 Due to the powerful representation of deep networks, deep object tracking [7][8][9][10][11][12] has become the research hotspot and the state-of-the-art algorithm in the field of visual tracking. Generally, deep object tracking can be divided into two main categories: deep regression tracking 7,8,[13][14][15] and deep classification tracking. [9][10][11]16 Deep regression tracking outputs a response map through a regressor that learns a mapping between input deep features and the soft label. Deep classification tracking treats object tracking as an object and background twocategory problem based on deep features, which classifies the samples into target or background through a classifier usually trained with the binary labeling. Recently, with the development of deep classification tracking, it has been able to achieve the real-time while ensuring certain tracking performance.
Other than Gaussian soft label of deep regression tracking, the label used in deep classification tracking is the binary label fÀ1,þ1g or f(1,0),(0,1)g. The samples with Intersection over Union (IoU) values greater than the threshold are considered as target samples, whose labels are set to þ1 or (1,0). The other ones are considered as background samples, whose labels are set to À1 or (0,1). Although such binary label has the ability to distinguish the samples of different classes but inadvertently overlooks the difference among samples in the same class. This drawback makes the response map of deep classification tracking difficult to accurately reflect the target location. As shown in Figure 2(a), Deep classification tracker SiamFC 11 trained with the binary labeling can discriminate between target and background samples, but the maximum position of its response map does not correspond to the target position accurately, which results in the target drift problem. As the tracking phase advances, the drift will accumulate and affect the subsequent frames. What's more, neglecting the difference among samples within the same class in the training phase will reduce the classification ability of the tracker. As shown in Figure 2(b), SiamFC trained with the binary labeling misjudges the target and background samples due to such information neglect.
How to design a special labeling to solve the above problem? As we know, IoU characterizes the overlap rate between samples and the target, which can represent the probability of samples as the target to some extent. Inspired by this, this article uses the IoU values as the design criteria  and proposes a novel soft labeling with quasi-Gaussian structure instead of the binary labeling to distinguish samples belonging to different classes and the same class simultaneously. Thus deep classification tracking trained with this proposed soft labeling performs better classification and locating ability as shown in Figure 2.
In the rest of this article, related work is introduced in the second section. The third section describes the proposed soft labeling with quasi-Gaussian structure and applies it to the deep classification tracker SiamFC. Then we compare and analyze the variant with its baseline tracker and the state-of-the-art trackers on popular tracking benchmarks: OTB-2015 and VOT, in the fourth section. Lastly, we conclude this article in the fifth section.
The main contributions of this article are summarized below:

Related work
In 2012, AlexNet 17 won the ILSVRC-2012 18 competition and showed the powerful representational capabilities of deep features to the world. Since then, deep object tracking 7,9,15,16,19 has emerged, which makes the field of visual object tracking a leap. Deep object tracking replaces manual features 20,21 with the more powerful deep features as representation and achieves more remarkable performance than traditional object tracking. [22][23][24][25][26][27] According to the different nature, deep object tracking can be classified into two main categories: deep regression tracking and deep classification tracking.

Deep regression tracking
Deep regression tracking outputs a response map through a regressor that learns a mapping between input deep features and the soft label. According to the different mapping methods, deep regression trackers can be mainly divided into DCF-based deep regression trackers, 7,8,13,15,28 deep regression trackers based on convolutional regression networks 14,29,30 and deep regression trackers based on the Siamese networks. 31 DCF-based deep regression trackers directly adopt VGG-M, 32 a convolutional neural network pre-trained on the multi-classification dataset, as feature extractor and then output the response map through an online learned regressor which regresses all the circularly shifted versions of the input image into Gaussian soft label. Deep regression trackers based on convolutional regression networks pre-trains the convolutional regression networks on the tracking dataset end-to-end to establish a mapping between the input image and the Gaussian soft label and then fine-tune the convolutional regression networks online as feature extractor and regressor simultaneously. Despite the top performance, DCF-based deep regression trackers and deep regression trackers based on convolution regression networks cannot achieve real-time performance. Other than the other two trackers, deep regression trackers based on the Siamese networks utilize Siamese networks pretrained off-line on the tracking dataset as feature extractor and regressor simultaneously, which no longer fine-tunes the networks during the tracking phase to achieve the realtime. Although the deep regression trackers based on the Siamese networks achieve high real-time (100 Fps), their performance is not ideal. Overall, the existing deep regression tracking cannot achieve a good balance between accuracy and robustness on the one hand and real-time performance on the other.

Deep classification tracking
Deep classification tracking treats object tracking as a target and background two-category problem. It classifies the samples into target or background through a classifier usually trained with a binary label. Deep classification tracking mainly includes SVM-based deep classification trackers, 16 deep classification trackers based on multi-domain convolutional neural networks, 9,10,33,34 and deep classification trackers based on the Siamese networks. 11,12,35-38 SVMbased deep classification trackers directly adopt R-CNN, 39 a convolutional neural network pre-trained on the multiclassification task dataset as the feature extractor and classify the samples into the target and background through the binary classifier SVM. Different from SVM-based deep classification trackers that can hardly benefit from end-to-end training, deep classification trackers based on multi-domain convolutional neural networks utilize the multi-domain convolutional neural networks as features extractor and binary classifier simultaneously to process the tracking task, which makes the end-to-end training possible. But to acquire the information about specific target and scenarios, they need to fine-tune the network online, which makes it difficult to achieve the real-time. Other than online fine-tuning, deep classification trackers based on Siamese networks obtain the specific information through the Siamese networks. Deep classification trackers based on Siamese networks utilize the Siamese networks to convert the target and samples to the same embedding space and then classify samples into target or background by similarity comparison. The early deep classification tracker based on Siamese networks SINT 35 has excellent tracking performance, but it is still far from being real-time due to the full connection layer and online update. Distinct from SINT, SiamFC 11 adopts a fully convolutional Siamese network and no longer update the neural network online so that its real-time (86.5 Fps) reaches the first place in the deep classification trackers at that time while simultaneously guaranteeing a certain tracking accuracy. Therefore, recently more and more deep classification trackers 12, [36][37][38] have been improved on SiamFC so as to achieve high realtime while ensuring the certain tracking accuracy. In general, with the development of the deep classification tracking, it has been able to achieve a good balance between tracking performance and the real-time, and have achieved the start-of-state results. However, we note that the binary labeling for deep classification trackers distinguishes the difference among samples in different classes but inadvertently elides the difference among samples within the same class. The neglect of the difference among the target samples makes the response values of the target samples difficult to accurately reflect the target position and causes the target drift problem. What's more, due to such information neglect in the training phase, the classification ability of the deep classification tracking weakens and the misjudgment arises. To cope with problems of the binary labeling in deep classification tracking, this article proposes a soft labeling with quasi-Gaussian structure instead of the binary labeling to enhance the classification and locating ability of the deep classification trackers. Compared with the binary labeling, the soft labeling with quasi-Gaussian structure adds more information about the difference among samples within the same class into the training phase while considering the difference among the samples in different classes simultaneously.

Soft labeling with quasi-Gaussian structure for deep classification tracking
We firstly describe the problems of the binary labeling and then propose a soft labeling with quasi-Gaussian structure for deep classification tracking. Lastly, we integrate the soft labeling into the pipeline of the deep classification tracker SiamFC to validate it.

Problems in the binary labeling for deep classification tracking
There are two kinds of binary labels for deep classification tracking, namely fÀ1,þ1g and f(1,0),(0,1)g. Deep classification trackers only outputting positive scores of samples 11,12,[35][36][37][38] generally adopt the fÀ1,þ1g binary label while those outputting 2-D binary classification score 9,10,33,34 adopt the f(1,0),(0,1)g binary label, which is shown in the Figure 3(a) and (b). Moreover, as the Figure 3(c) shows, these two kinds of binary labels are essentially the same. For simplicity, we adopt the fÀ1,þ1g binary label as representation for the problem description. The logistic loss function corresponding to the fÀ1,þ1g binary label is expressed as following where y i and v i denotes the label value and the response value of the sample x i respectively. Denoting y i Á v i as t i , then the logistic loss function L can be expressed as Theoretical derivation and experiment (see Appendix 1 for details) indicate that t i will approximately converge to a constant c. Hence the response value v i of target samples and background samples will converge to c and Àc, respectively. As shown in Figure 4, although the response value can distinguish target samples and background samples, the samples belonging to the same class cannot be distinguishable. Such disadvantage will result in the target drift problem and weakens classification ability in the tracking phase.

Soft labeling with quasi-Gaussian structure for deep classification tracking
To overcome the drawbacks of the binary labeling, we propose a soft labeling with quasi-Gaussian structure instead of the binary labeling to enhance the classification and locating ability of deep classification tracking. The proposed soft labeling takes into account the difference among samples belonging to the same and different classes simultaneously. Like as the binary label, to distinguish samples of different classes, the signs of labels for the positive and negative samples are set to be plus and minus respectively. Further, to exploit the difference among samples in the same class, the label values of different samples belonging to the same class are no longer the same but related to their IoU values.
As analyzed above, t i ¼ y i v i will converge to a constant c and the response values v i are inversely proportional to the label values y i . In order to make the response value of the samples further representing the probability as target to distinguish samples in the same class, the label value should be designed to be inversely correlated with the probability as target. As we know, IoU characterizes the overlap rate between samples and the target, which can represent the probability as the target to some extent. Therefore, as shown in equation (3), the proposed soft labeling is designed as a two-piecewise continuous quasi-Gaussian combination function about IoU to distinguish samples belonging to different classes and the same class simultaneously where 0 q 1 is the IoU threshold for dividing positive and negative samples, i is the index of samples; p and n are the symbols for positive and negative samples; and s are the mean and standard variance of Gaussian distribution; a and l are the scale and bias factors. In addition, some constraints should be satisfied as shown in equation (4). The first one makes the label value of samples in the same class inversely correlated with the target probability, that is, IoU. The others are designed to make the absolute values of the label less than 1 to enlarge the difference between positive and negative samples The function curve of soft labeling with the quasi-Gaussian structure and its corresponding response function curve are shown in Figure 5. Like the binary labeling, the response values of target and background samples are always positive and negative respectively so that the difference between them is large enough to distinguish them well. However, different from the binary labeling, this difference between the response values of target and background samples becomes more significant, which will enhance the classification ability of deep classification trackers. More importantly, different from the binary labeling, the response values of samples belonging to the same class are no longer the same, but positively correlated with their IoU values, which makes the target location more accurate.
Intuitively, Figure 6 shows the diagram of deep classification trackers trained with the soft labeling. Different from the trackers trained with the binary labeling shown in Figure 4, the deep classification trackers can exploit the difference among the samples of the same and different classes simultaneously in the training phase due to our proposed soft labeling. In the tracking phase, the sample with the maximum IoU value is preferred to regard as the target so that deep classification tracker can locate the target more accurately. Thus, the tracker can possess a better classification and locating ability. For the tracking speed, we only replace the binary label with our proposed soft labeling in the off-line training phase of deep classification trackers, which will not affect the amount of computation in the online tracking phase. Therefore, the tracking accuracy can be significantly improved by the soft labeling while the tracking speed is not affected.  In order to verify the effectiveness of the proposed soft labeling with quasi-Gaussian structure, we apply it to SiamFC, denoting the variant as SiamFC-label. Since the IoU value of the sample is negatively correlated to the center distance between the sample and the searching region in SiamFC, we set this relationship as I i ¼ ÀbR i þ 1, where R i and b > 0 denotes the center distance and negative correlation coefficient, respectively. Then the soft labeling of SiamFC-label is expressed as equation (5) and its visualization is shown intuitively in Figure 7

SiamFC trained with the soft labeling
As the Figure 7(c) and (d) show, comparing with SiamFC, SiamFC-label has the following two advantages: (1) the response values for samples belonging to the same class are no longer the same but positively correlated with their IoU values; (2) the difference among samples of different classes is more significant. Due to such two advantages, SiamFC-label can locate the target more accurately and perform better classification ability in the online tracking phase as the Figure 2 shows. What's more, only the parameter values of the pre-trained network are changed in the online tracking phase so that the amount of computation will not be affected. Therefore, SiamFC-label can perform significantly improved tracking accuracy while achieving high real-time performance.

Experiments
In order to evaluate the effectiveness of soft labeling with quasi-Gaussian structure, we compare the SiamFC-label with the baseline tracker and the state-of-the-art trackers on OTB-2015 40 and VOT 41 benchmark datasets. In this section, we firstly introduce the implementation details. Next, we compare the variant SiamFC-label with the baseline tracker on the popular benchmark datasets. Then, we evaluate our proposed method on OTB-2015 and VOT benchmark datasets in comparison with the state-of-the-art trackers. Lastly, we present extensive attribute-based performance analysis to further illustrate the effectiveness of our proposed soft labeling with quasi-Gaussian structure for improving the locating precision and classification ability of the deep classification trackers.

Implementation details
In this article, the experiments are conducted on the popular OTB-2015 and VOT-2016 benchmarks. The OTB-2015 benchmark contains 100 challenging sequences, which includes various tracking scenarios and challenges. The OTB-2015 benchmark provides two evaluating indicators, overlap success rate, and distance precision (DP). The overlap success plot shows the rate of bounding boxes whose IoU score is larger than a given threshold. Area under curve (AUC) of the overlap success plot is applied to rank the trackers. The DP plot shows the DP for different thresholds. Usually, the DP at 20 pixels is applied to rank the trackers. On the OTB-2015 benchmark, all trackers are evaluated with one-pass evaluation (OPE). The VOT-2016 benchmark is the fourth VOT challenge, which includes 60 sequences. The expected average overlap (EAO), accuracy, robustness, average overlap (AO), and equivalent filter operations (EFO) are used to evaluate trackers on VOT-2016. The main evaluating indicator, EAO, synthetically reflects the overall performance of the trackers.
Our tracker is implemented in Matlab using MatConv-Net. 42 SiamFC with three scales is selected as baseline tracker since this version runs faster than the one with five scales and only performs slightly lower. We set the parameters of soft labeling with quasi-Gaussian structure in equation (5) as Table 1 (4), which makes the difference between target and background samples more significant compared with the binary label. Finally, the other parameters, such as the stride, the center distance, and negative correlation coefficient, are set to be same as that in SiamFC 11 for the comparisons with baseline trackers. We randomly sample from the dataset ILSVRC15 18 to train the parameters of the Siamese network by minimizing the loss with SGD using the deep learning toolbox MatConvNet. Our machine is equipped with a single NVIDIA GeForce 1080Ti and an

Comparisons with baseline trackers
For a more comprehensive validity evaluation of our proposed soft labeling with quasi-Gaussian structure, we compare the SiamFC-label with its baseline tracker on OTB-2015 and VOT-2016 benchmarks. Note that, SiamFC 11 provides two tracking models, denoted by SiamFC-color and SiamFC-colorgray in this article. The difference between these two trackers is that SiamFCcolorgray converts 25% of the pairs to grayscale in training phase to handle the gray videos. We replace the binary labeling of these two trackers with the proposed soft labeling in the training phase, denoting the variants as SiamFC-label-color and SiamFC-label-colorgray respectively.
For SiamFC-label-color, only its label is different from SiamFC-color while all other hyper-parameters are the same as SiamFC-color. Experiment results shown in Figure 8 indicate that SiamFC-label-color achieves overall 1.8% and 1.9% improvement to SiamFC-color in terms of precision and success metric on OTB-2015 benchmark. What's more, SiamFC-label-color performs better than SiamFC-colorgray on the precision and success metric, even without the trick for handling the gray videos.
To maximize the improvement caused by our proposed soft labeling, we make appropriate adjustments to the hyper-parameters and adapt the trick of handling the gray videos for SiamFC-label-colorgray. (1) Hyper-parameters: As described in the "Soft labeling with quasi-Gaussian structure for deep classification tracking" section, the soft   labeling makes the difference between response values for different classes more significant, which is more conducive to classifying samples but slows the convergence process. Thus, compared with training over 50 epochs in SiamFC, 11 we train two more epochs, a total of 52 epochs. For 52 epochs training, the learning rate of the first 50 epochs is decayed geometrically after epoch from 10 À2 to 10 À5 , which is consistent with SiamFC, 11 while the learning rates of the last 2 epochs are 9.3260eÀ06 and 8.1113eÀ06, respectively. (2) The trick for handling the gray videos: We adopt the trick of re-training a special gray network with all grayscale pairs in SiamFC-tri 38 instead of the trick in SiamFC 11 to handle the gray videos. For the special gray network, we only convert all pairs to grayscale while the other hyper-parameters in the training phase are all consistent with the color network. As shown in the Figure 8, comparing with SiamFC-color, SiamFClabel-colorgray achieves 3.5% and 2.7% improvement on precision and success metric, respectively. Further, SiamFC-label-colorgray achieves overall 2% and 0.8% improvement of precision and success metric respectively in comparison with SiamFC-colorgray. In addition, we take SiamFC-label-color as the representation of SiamFC-label to compare with the baseline tracker SiamFC on VOT-2016 benchmark. As shown in the Table 2 and Figure 9, compared to the baseline tracker, SiamFC-label(-color) performs more favorably

Comparisons with state-of-the-art trackers
We compare the trackers SiamFC-label-color and SiamFC-label-colorgray with the state-of-the-art trackers using OPE with DP and overlap success metrics as proposed in OTB-2015 benchmark datasets, which mainly includes LCT, 43 KCF, 44 SRDCF, 45 SAMF, 46 DSST, 47 MEEM, 48 and CFNet. 36 As shown in Figure 10, SiamFC-label-colorgray and SiamFC-label-color respectively achieve the first and fourth best DP (79.1% and 77.4%) while the second and third best performance in success metric (59.0% and 58.2%). Although SiamFClabel-colorgray and SiamFC-label-color rank slightly lower than SRDCF in terms of success metrics, their real-time (85.7 Fps and 86.3 Fps) is much faster than SRDCF (5 Fps) as shown in Table 3. Furtherly, qualitative experiments on VOT-2016 benchmark against the state-of-the-art tracker are performed, which mainly MDNet_N, 9 DPT, 4 9 SiamFC, 11 deepMKCF, 50 DAT, 51 KCF, 44 SAMF, 46 DSST. 47 As shown in Figure 9 and Table 2, SiamFC-label(-color) behaves comparably with the state-of-the-art tracker in terms of EAO, ranking the second on VOT-2016 benchmark. Especially, SiamFClabel(-color) achieves the best accuracy among all these compared trackers.

Attribute-based performance analysis
Extensive performance analysis on the locating precision and classification ability is presented to further illustrate the effectiveness of the proposed soft labeling. As with the experiments in the "Comparisons with state-of-the-art trackers" section, we select SiamFC-color as the baseline tracker and compare SiamFC-label-color with SiamFCcolor and SiamFC-colorgray on the OTB-2015 dataset to rule out other interference factors.
Locating ability improvement: We selected 2, 4, 6, 8, 10 pixels instead of 20 pixels as the threshold of precision metric, and then compared the overall performance of SiamFC-label-color, SiamFC-color, and SiamFC-colorgray. Figure 11 presents the locating precision improvement percentage of SiamFC-label-color in comparison to SiamFC-color and SiamFC-colorgray at different thresholds. Experimental results indicate that the smaller threshold value (i.e. the higher locating precision) is, the larger locating precision improvement percentage SiamFC-label-color achieves. This further illustrates the proposed soft labeling can enhance locating ability of the deep classification trackers.
More specifically, experiments on Car4 sequence are presented in Figure 12 to intuitively demonstrate the location ability improvement caused by the soft labeling. Note that the location error of SiamFC-label-color is less than that of SiamFC-color and SiamFC-colorgray overall. This clearly proves that SiamFC-label-color locates the target more accurately than SiamFC-color and SiamFCcolorgray.
Classification ability improvement: Besides the locating ability, the proposed quasi-Gaussian combination soft label can also enhance the classification ability because the important information about the difference among samples in the same class is added in the training phase. Qualitative results on four sequences are presented in Figure 13 where SiamFC-color and SiamFCcolorgray both fail to track when the targets undergoing large appearance changes, whereas SiamFC-label-color can locate them robustly.

Conclusions
In this article, we revisit the binary labeling for deep classification trackers and indicate the problems in binary labeling through theoretical and experimental analysis.
To solve such problems, we propose a soft labeling with quasi-Gaussian structure instead of the binary labeling to enhance the classification and locating ability of deep classification tracking, which takes into account the difference among the samples of the same and different classes simultaneously. To verify the effectiveness of our proposed soft labeling, we apply it to improve the deep classification tracker SiamFC, and then compare the variant with its baseline tracker and the state-of-the-art trackers on OTB-2015 and VOT benchmark datasets. Further, we present extensive attribute-based performance analysis to further illustrate the validity of our proposed soft labeling. More than SiamFC, our proposed soft labeling with quasi-Gaussian structure works on other deep classification tracking algorithms, which is our further work. Moreover, in various real-world applications such as robots, unmanned surface vessel (USV), and so on, our proposed method can achieve more precise and robust tracking performance. The first derivative about t i is always less than 0, that is, @L @ti ¼ À 1 1þe t i < 0, so the loss function monotonically decreases with respect to t i ; The second derivative about t i is always greater than 0, that is, @ 2 L @ti 2 ¼ e t i ð1þe t i Þ 2 > 0, so the first derivative about t i is monotonically increasing with respect to t i .
The function of the gradient descent is expressed as following where a > 0 denotes the learning rate. And since the first derivative about t i is always less than 0, then Further, since the first derivative about t i is monotonically increasing with respect to t i and is always less than 0, then @L @t i n < @L @t i nþ1 < 0 Thus, the absolute value of the first derivative about t i nþ1 and t i n satisfies the following 0 < @L @t i nþ1 < @L @t i n ð1EÞ Substituting equation (1B) into equation (1E), then 0 < jt i nþ2 À t i nþ1 j < jt i nþ1 À t i n j ð 1FÞ Equation (1F) indicates that t i will gradually converge to a constant c until jt i nþ1 À t i n j < x where x denotes an infinitesimal quantity.
What's more, to further validate this theoretical derivation, we conduct the experiments on convergences of gradient descent for different initial values. As the Figure 1A shows, t i will converge to a constant c ¼ 7 for different initial values when x is taken as 10 À3 . Figure 1A. Experiment on convergences of gradient descent for different initial values. t i will converge to a constant c ¼ 7 for different initial values when jt i nþ1 À t i n j < x where x has a value of 10 À3 here. Each color denotes a convergence case with an initial value.