Visual tracking with multilevel feature, similarity attention, color constraint, and global redetection

Visual tracking is fundamental in computer vision tasks. The Siamese-based trackers have shown surprising effectiveness in recent years. However, two points have been neglected: firstly, few of them focus on fusing the image level and semantic level features in neural networks, which usually resulting in tracking failure when differentiating the target from other distractors of the same class. Secondly, the robustness of the previous redetection scheme is limited by simply expanding the search region. To address these two issues, we propose a novel multilevel feature-weighted Siamese region proposal network tracker, which employs a feature fusion module to construct discriminative feature embedding and a similarity-based attention module to suppress the distractors in the search region. Furthermore, a color-based constraint module is presented to further suppress the distractors with the same class to the target. Finally, a well-designed global redetection scheme is built to handle long-term tracking tasks. The proposed tracker achieves state-of-art performance on a series of popular benchmarks, including object tracking benchmark 2013 (0.699 in success score), object tracking benchmark 2015 (0.700 in success score), visual object tracking 2017 (0.470 in expected average overlap score), and visual object tracking (0.485 in expected average overlap score).


Introduction
Visual tracking is a technique to track the target in an image sequence given the target's bounding box in the first frame as the template. With the progress of computer vision, this task has received close attention in recent years and is widely used in intelligent applications, such as robotics, autonomous car, and video surveillance. In the development of visual tracking, many excellent works have been proposed. Most of them have addressed several challenging problems such as vague target, rotation, deformation, and illumination variation. Furthermore, other works in longterm (LT) visual tracking focus on overcoming the target disappearance in a video.
Recently, due to the trade-off between accuracy and speed, trackers based on the Siamese network 1-15 are considered as the mainstream methods. Siamese-based trackers were proposed. They extract features of the template and a search frame by two convolution structures with shared weights and generate a similarity map between the template and this frame. What's more, some improved Siamese-based trackers can be used to predict target deformation 5-7, 9,16-24 or even directly output target segmentation, 15 which have made significant achievements. However, two key issues remain. Firstly, the existing Siamese-based methods perform poorly in distinguishing the target and the distractors with the same class of the target. The reason is that conventional Siamese-based trackers utilize the backbone of the classification task. Thus, the high-level features fail to represent the difference between the target and other objects with the same class. Based on these features, the obtained similarity map is also unreliable to distinguish the target and others. As shown in Figure 1, the previous method has high responses on the distractors, that is, the referee, athletes marked in red, where the response indicates the similarity between these candidates and the template. Secondly, the previous redetection scheme simply expands the search region, thus limiting the performance of the LT tracking.
In this article, we argue that since the high-level cue of the target and a same-class object are similar, the low-level cue plays a key role in this case. In contrast, high-level cues are essential to distinguish the target from other similar appearance objects. Thus, the features of different levels are useful in different conditions. For this reason, we propose a multilevel feature-weighted Siamese region proposal network (MFW-SiamRPN). In contrast to the previous works, a feature fusion module (FFM) is designed to fuse all level features to a unified representation, which is used to encode the similarity between the template and each sliding window in the search region. Then, we design a similarity attention module. It utilizes channel attention to suppress channels that fail to distinguish template and distractors and utilizes spatial attention to suppress the similarities between template and distractors. Besides, to make full use of the color information lost in the deep network, a color-based constraint module (CCM) is proposed to suppress the network's output. Finally, to track a target in LT scenarios, a global redetection scheme is proposed to detect the occurrence that the target is out of the view and redetect the target, which is simple but efficient.
In general, the main contributions can be summarized as follows: To distinguish the target and other distractors, we propose an MFW-SiamRPN (MFW-SiamRPN). It utilizes multilevel cues to template and search region and suppresses other distractors by a similarity attention module. A CCM, which takes full use of color information, is designed to constraint the network output. A well-designed global redetection scheme is proposed, which ensures the robustness of the proposed tracker and outperforms other Siamese-based trackers in LT tracking. Our method outperforms the existing state-of-the-art methods on the fashionable benchmarks: object tracking benchmark (OTB)-2013, 25 OTB-2015, 26 visual object tracking (VOT)-2017, 27 VOT2018, 28 and VOT-2018-LT. 29 Moreover, abundant ablation studies verify our point of view and the effectiveness of each module.

Review of Siamese-based trackers
Recently, trackers based on the Siamese network draw great attention for their remarkable performance in both accuracy and efficiency. These methods employ two identical networks initialized in the classification task (e.g. AlexNet, 30 Visual Geometry Group (VGG), 31 ResNet, 28 and MobileNet 32 ) to extract features from the template and search region and locate the target by matching the template. Generic Object Tracking Using Regression Networks (GOTURN) 33 directly concatenates the features of the template and search region and employs several fully connected layers to locate the target. Siamese Instance Search for Tracking (SINT) 34 constructs the vectors of the target and several candidates by region-of-interest pooling and then outputs the candidate most similar to the template. Fully convolutional (FC) Siamese networks (SiamFC) 1 proposes a cross-correlation layer to obtain the similarity between the target and the search region, improving accuracy and efficiency. Based on SiamFC, RasNet 3 proposes a residual attention module, which reduces the noise in the deep features. As shown in Figure 2(a) and (b), SiamRPN 5,24 and SiamRPNþþ 7 use a well-designed region proposal network (RPN) 35 to predict the position and deformation of the target simultaneously and outperform other Siamese-based trackers. However, since the backbones used in these methods are all pretrained in the classification task, the inherent difference between the two tasks, namely, classification and tracking, making the deep feature fail to represent the difference between the target and other objects with the same class. We propose MFW-SiamRPN and CCM to overcome this issue. As shown in Figure 2, different from others, all level features are fused to represent the template and the search region.

LT tracking and redetection scheme
In visual tracking, the target is usually occluded or moved out of the view. To address the issue, the LT tracking redetects the target when the target returns to the view of the camera. There are mainly two types of methods for LT tracking. Firstly, some methods redetect the target by expanding the search region in the frame. DaSiamRPN 6 and SiamRPNþþ 7 establish the reliable confidence score by specially designed training samples but thoughtlessly enlarge the search region while losing the target, which limits the performance of the LT tracking. Secondly, a few methods construct a redetection scheme. Skimming-Perusal Long-term Tracking (SPLT) 14 builds a global redetection scheme by training an additional network called skim-module and surpasses other methods that enlarge the search region. However, the additional skimming model requires feature maps of the whole search region and downsamples it to a vector, which brings high Graphics Processing Unit (GPU) cost and risk of low accuracy. The redetection task could be divided into two steps, object detection and target identification. We find several simple but effective proposals to locate an object, [36][37][38][39][40] and the excellent Siamese-based trackers could handle the target identification. 18 In the section "Redetection scheme for LT tracking," we apply object proposal to locate candidate objects and combine our CCM with the Siamese network to identify the target.

Method
Our method is described in the following order. In the section "Multilevel feature-weighted SiameseRPN," considering that the low-level feature plays a vital role in distinguishing the target and other objects, the MFW-SiamRPN is presented, which weights features of all levels and then suppresses the similarity between the false positives and the target. In the section "Color-based constraint module," to take color information into account, a CCM is presented to constrain the network's output. In the section "Redetection scheme for LT tracking," the redetection scheme for LT tracking is proposed to address the issue that the target disappears or is fully occluded.

Multilevel feature-weighted SiameseRPN
Overview. We adopt SiamRPN 5 as the basic tracker. Figure 3 shows the structure of our network. Given the first frame I 1 and the current frame I n , the whole network takes a template z 2 R 3Â 2w T Â 2h T from frame I 1 and a search region x from frame I n as inputs, where w T and h T are the width and height of the target. And the template is centered on the target and has red, green, and blue (RGB) channels and a resolution of 2w T Â h T . Our network aims to locate the target in the search region x. In the following paragraphs, we first introduce the entire network and then represent the proposed modules in detail. Mathematically, let F?denotes fusing the feature of the backbone (see the section "Multilevel feature-weighted SiameseRPN"), the similarity between template z and search region ?can be formulated as follows where * denotes the depth-wise cross-correlation operation, 7 ?represents the feature extraction of the backbone, that is, a ResNet-50 pretrained on ImageNet. 41 where denotes the element-wise multiplication. Finally, following the same way in ref, 5 we generate many anchor boxes in search region x as the candidates of the target. And the similarity f(z,x) is utilized to predict the probability that each box is the target by using two convolution branches, namely Classification (CLS) branch and Regression (REG) branch. The CLS branch predicts the similarity between each anchor boxes and the target (denoted by N A score in the section "Color-based constraint module"). The REG branch predicts the deformation and displacement [d x , d y , d w , d h ] of each anchor box. Mathematically, the two branches can be formulated as follows where C 1 denotes a 1 Â 1 convolution operation, which adjusts the channels of f(z,x) to 2 (for CLS) and 4 (for REG). By using a SoftMax operation, CLS(z,x) is transformed into a similarity map of template and search region in the corresponding positions. During training, we use the SoftMax loss function to supervise the CLS branch, while the REG branch is optimized by Smooth L1 loss function. The total training loss is Then an FFM is used to aggregate features of all levels to a unified feature map that is further used to calculate the similarity between template region and search region. Subsequently, two attention modules are introduced to improve the similarity map. After all, the predictor locates the target in the search region. FFM: feature fusion module.
the summation of them. Formally, for the CLS branch in our network, the prediction is denoted as G 2 R 2Âw s ÂH s , the label is denoted as _ G 2 R 1Âw s ÂH s , their resolutions are same with similarity f(z,x). The SoftMax operation is utilized to calculate the probability of a pixel belong to the target: Gði; jÞ ¼ Gð1;i;jÞ Gð0;i;jÞþGð1;i;jÞ , and the SoftMax loss can be formulated as follows For the REG branch in our network, the prediction is denoted as and the corresponding label is _ y, the smooth L1 loss value at each corresponding position can be calculated as follows Feature fusion module. Based on the motivation: both the low-level feature and the high-level feature are crucial to distinguish the target and others, a FFM is proposed to capture the feature from all levels. In detail, the stage-n side-outputs of the backbone are represented as Res n. The features of the first two stages, namely Res 1 and Res 2, are down-sampled to one-eighth size of input images by two parallel pooling layers, namely average pooling and max pooling. The average pooling is to capture the relation between channels, and the max-pooling is to gather another important cue about distinctive object features. 42 Then, these down-sampled features are concatenated with Res 3, Res 4, and Res 5. Finally, a bottle-neck structure, namely a sequence of two 1 Â 1 convolutions and a 3 Â 3 convolution, is utilized to nonlinearly compress all the features and reduce the channel number to 512. Formally, the FFM can be formulated as follows where MP()denotes the max pooling operation, AP()denotes the average pooling operation, Cat() denotes the concatenation operation, C 3 () denotes the 3 Â 3 convolution operation, relu() denotes the Rectified Linear Unit (ReLU) activation.
Since the low-level features contain little semantic information, such features could be used to distinguish two objects of the same class. In contrast, high-level features could be used to separate objects with different classes. Hence, multilevel features are all necessary to distinguish the target and other objects, and the fused features are better to represent all the objects than those in the previous works.
Attention based on similarity. As shown in Figure 3, the similarity f ðz; xÞ 2 R CÂH s ÂW s is obtained by the depth-wise cross-correlation 5 of the two fused features. To further enhance the difference between the template and other distractors, we capture the attention of the similarity. Specifically, two types of attention information are captured sequentially, namely channel attention and spatial attention. These two submodules are motivated by Convolutional Block Attention Module (CBAM). 42 Firstly, the channel attention module is utilized to capture the interchannel relationship of similarity f(z,x). As shown in Figure 4, it uses the average pooling operation to aggregate the spatial information and uses the max pooling operation to capture the discriminative cue in similarity matching. Then, a shared convolution block, which is constructed by three 1 Â 1 convolutions, is utilized to fuse both pooled features. Finally, a sigmoid function is used to generate channel weights A c ðf ðz; xÞÞ 2 R CÂ1Â1 . Formally, the channel weights can be formulated as follows where Sig() denotes the sigmoid function, and X denotes the input feature. The spatial attention module A s () is utilized to capture the interspatial relationship of similarity. Specifically, the feature map is firstly passed through a channel-wise max-pooling layer and a channel-wise average pooling layer to be down-sampled to two feature maps with one channel, respectively. For the input feature X with C channels, the channel-wise max pooling and average pooling can be formulated as follows X cavgpool ði; jÞ ¼ meanðX ð0 :C; i; jÞÞ X cmaxpool ði; jÞ ¼ maxðX ð0 :C; i; jÞÞ Then, these two one-channel feature maps are concatenated and passed through a 3 Â 3 convolution and a sigmoid to generate the spatial attention A s ðX Þ ¼ SigðC 3 ðCatðX cavgpool ; X cmaxpool ÞÞÞ Finally, the output feature map can be formulated as follows where the bolded f ðz; xÞ 2 R CÂH s ÂW s denotes the feature after attention.
For the channel attention, since each channel's discriminative ability in f(z,x) is different, the channel attention gives high weights to the channels of similarity f(z,x) with intense discrimination. Thus, compared to directly applying the original similarity to match the target, the weighted similarity better distinguishes the target and other distractors. For the spatial attention, since the pooled features contain each pixel's discriminative channel, the spatial attention module weights all pixels, which enhances the response of the target and suppresses that of the distractors. Thus, our similarity attention enhances the difference between target and distractors in feature representation.

Color-based constraint module
Due to the normalization operation in Convolutional Neural Network (CNN), normalized color in different search regions and templates ranges variously, resulting in limited color information expression. To make full use of the color information, we propose the CCM, which takes the boxes predicted by the network as inputs and computes the colorbased similarities between these boxes and the target. In detail, CCM contains three steps.
Firstly, assuming that the template z consists of two disjoint regions: the rectangle region of the target with size w T Âh T (i.e. T) and the rest of the template z is denoted as background area (i.e. B), this module constructs the color histograms of the two regions, respectively. Specifically, we present one pixel in the template with RGB channels by a triplet i 2 R 3 . Then, the RGB values of all pixels are reduced by 16 times: i ¼ ½i=16 2 R 3 . Finally, since each value of ?is an integer from 0 to 15, we count the number of occurrences of each triplet in region T to construct the normalized color histogram H T 2 R 16Â16Â16 . In the same way, the histogram H B 2 R 16Â16Â16 is constructed corresponding to region B.
Secondly, according to the color histograms H T and H B , the probability that a pixel i in search region x belongs to target is obtained by the Bayes rule (see additional materials for detailed formula) where I denotes the class of the target. Based on Eq. (11), the probabilities of all pixels in x are obtained to form a probability map S, as shown in Figure 5(c). Thirdly, we rerank the boxes given by the MFW-SiamRPN. Specifically, assuming that A ¼ (x A , y A , h A , w A ) denotes a bounding box with the same representation of other articles. Thus, the target probability of box A is stated as the mean probability of all pixels inside A where S(i) denotes the value of pixel i on the probability map S. Then, we combine this color-based probability TS A with the score given by the network where N A is the similarity score predicted by the CLS branch of our MFW-SiamRPN, D xy is the distance between box A and the center of region x, P A is the deformation penalty of the bounding box A, which is referred to ref. 7 a is a constant to balance the Siamese network and the CCM. By utilizing Eq. (13), a more reliable target prediction is obtained than utilizing the network only, as shown in Figure 5(d).

Redetection scheme for LT tracking
To extend our network for LT tracking, we propose a redetection scheme, as shown in Figure 6. Since the score of the target SC given by CCM is reliable, the redetection scheme is activated if SC<0.7ÂSC 2 , where SC 2 denotes the target score in the second frame of the input video. The redetection scheme is composed of three steps: probability mapgenerating, object proposal, and target selection. Initially, the whole image is taken as the input, and a target probability map of this image is generated by Eq. (13) in CCM. Secondly, we employ the object proposal method, i.e. Edge boxes, 39 to obtain various bounding boxes with the same size as the target in the template or in the last frame. For clarity, we denote the bounding box as A, and its objectness score predicted by Edge boxes as OS (A). Finally, the boxes with the top 200 objectness scores are retained to the third step. Thirdly, all the bounding boxes fA i |1 i 200g are reranked by the color-based probability. In more detail, by utilizing the target probability map given by the first step and utilizing Eq. (13), the target probability TS Ai is obtained. Then, the score for LT tracking SL Ai is formulated to rerank all the boxes where OS A is the objectness score of a bounding box A, TS A is the CCM score in the section "Color-based constraint module," b is a constant to balance the object proposal Module and the CCM. D xy is the distance between the bounding box and the center of an image, w I and h I are the width and height of an image, respectively. By top selection of the Sl A score, the top five candidate bounding boxes are left. These five candidates have a high probability of including objects and are similar to the target in color. Finally, five search regions centered on these candidates are sent into our Siamese network to predict the target's location and size.
The whole redetection scheme in LT tasks could be seen in Figure 6. When the target disappears from the search region, it will obtain a low SC score and the tracker activates the global redetection scheme. The object proposal method helps our tracker locate candidate objects and the CCM helps our tracker filter out objects that are not similar to the target in color. Consequently, the Siamese network predicts the bounding box of the target precisely.  patches are resized to 127 Â 127 pixels. The search regions are cropped twice the size of template patches in the same ways. The proposed Siamese network is trained for 20 epochs with a batch size of 20 and an initial learning rate of 10 À2 by using the stochastic gradient descent (SGD) solver. The momentum and weight decay are set to 0.1 and 0.5, respectively. Parameters of the backbone are frozen in the first 10 epochs and unlocked in the last 10 epochs to adapt to tracking tasks.

Implementation details
Evaluation on OTB data set OTB2013 16 consists of 50 labeled videos and brings challenging scenes such as rotation, blur, and occlusion. OTB2015 26 is an extension of OTB2013, including 100 labeled sequences, and contains more complicated scenes. The precision score and the success score are used to evaluate the performance.
The precision score represents the ratio of predicted results (the distance between the predicted center and ground-truth center is within 20 pixels). The success score shows the Area Under Curve (AUC) of the successful plot. The x-axis of the success plot indicates the IOU threshold between the predicted and ground-truth result (varies from 0 to 1), whereas the y-axis shows the proportion of successful frames, which satisfied the IOU threshold. One-pass evaluation is employed for the evaluation.
Our tracker is compared with several state-of-art Siamese-based trackers, including SiamFC, 1 parallel tracking and verifying (PTAV), 4 RasNet, 3 Correlation Filter based Network (CFNet), 2 SiamRPN, 5 DaSiamRPN, 6 SiamRPNþþ, 7 SiamDW, 8 Siamese Cascaded Region Proposal Network (C-RPN), 9 Discriminative Model Prediction (DIMP), 10 GradNet, 11 and Meta-Learning Tracker (MLT). 12 Table 1 presents the precision scores and success scores of Siamese-based trackers mentioned earlier. As presented in Table 1, our tracker outperforms these state-of-art Siamesebased trackers over two benchmarks. Specifically, our tracker achieves the best precision (0.930 on OTB2013 and  0.915 on OTB2015). For the success score, our tracker superiorly performs against other evaluated methods (0.699 on OTB2013 and 0.7 on OTB2015). Compared with the AlexNet-based methods, our tracker benefits from the Res50 backbone and outperforms DaSiamRPN by 3.5% on the precision of OTB2015. Note that our method only achieves a little improvement over SiamRPNþþ, which is due to a lot of gray-scale videos in OTB 2015, and no color cue can be utilized. Besides, compared with other methods, whose backbone are deep CNNs, our tracker benefits from the FFM, the attention module, and the CCM. The res50 backbone extracts features that help to distinguish the target from the background. Low-level cues utilized by the FFM help our tracker distinguish the target from distractors with the same class, while high-level cues help our tracker distinguish the target from distractors with a similar appearance. The CCM replenishes color information, which is lost in normalization operation. The performance of our tracker proves the effectiveness of these modules.
FC: fully convolutional; OTB: object tracking benchmark; RPN: region proposal network; PTAV: parallel tracking and verifying.Evaluation on VOT data set VOT2017 27 and VOT2018 28 consist of 60 sequences and show more difficult scenes than OTB benchmarks. We use the most-watched expected average overlap (EAO) score on the baseline benchmark as the evaluation standard. EAO is an estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences. Besides, we also choose the accuracy (average overlap when tracking successfully) and the Robustness (failure times) as evaluation standards. As shown in Table 2, our tracker is compared with the state-of-art tracking algorithms, including SiamFC, SiamRPN, DaSiamRPN, SiamRPNþþ, SiamDW, C-RPN, GradNet, DIMP, UpdateNet. 13 Our approach shows the best EAO score (0.470 on VOT2017 and 0.485 on VOT2018). SiamRPNþþ and DIMP have the same backbone as our tracker. These three Res50 based trackers have similar accuracy scores. Moreover, DIMP shows less tracking failures (0.153) followed by our tracker (0.201). As to the most concerned EAO score on VOT2018, our tracker outstands the DIMP by 4.5% and SiamRPNþþ by 7.1%, respectively, which sufficiently proves the effectiveness of the proposed method. FC: fully convolutional; VOT: visual object tracking; A: accuracy; EAO: expected average overlap (both larger are better); R: robustness score (smaller is better); RPN: region proposal network.

Evaluation on the VOT2018-LT data set
We choose the VOT2018-LT 29 data set with a lot of completely lost scenes to show the superiority of our redetection scheme. This data set includes totally 146,847 frames with 35 sequences of various objects. Each sequence contains at least 1000 images and average 12 LT object disappearances. The main evaluation protocol of the VOT2018-LT data set is F1 score which is defended by where q is a given threshold. Pr(q), Re (q) are the precision, recall, and F1 score under the corresponding threshold, respectively. The threshold varies from 0 to 1 to find the maximum F1 score. Accordingly, we choose some trackers specifically optimized for LT tracking to compare their performance, including, SiamFC, DaSiamRPNLT, SiamRPNþþLT, PTAVplus, 4 and SPLT. 14 The evaluation result is placed in Table 3.
Compared with none redetection trackers, our tracker has advantages in the F1 score for these two trackers only enlarge the search region without an appropriate redetection scheme. Based on its outstanding network backbone and redetection scheme, even our tracker still has advantages over SPLT that build a redetection scheme from the F1 score (an improvement of 2.5%).

Ablation studies
In this section, we firstly present the positive function of the MFW-SiamRPN and CCM in short-term tracking tasks. Then, the analysis of LT tasks is brought out to show the contribution of the redetection scheme in Table 4.

Analysis of FFM.
To pursue the influence of our FFM on the tracking result, the features of the first two layers are trained with a single output structure. The comparison among iv-viii shows that by independent operation, the features of Res1 and Res2 perform terribly in both OTB2015 and VOT2018 benchmarks. ix and xii show that the introduction of features from Res1 and Res2 can improve the performance on the VOT2018 benchmark by 0.6% but reduce the performance on OTB2015 by 0.6%, which is due to the inefficient features fusion. The performance gap between independent operation (iv-viii) and collaborative operation (ix, xii) of multilevel features confirms our conclusion that all features of the backbone should be taken into consideration. From ix, xii, and xiii, we find that simply using the average outputs of multilayers is not an appropriate solution. The FFM gains an improvement of 1.1% over the average type baseline, which fully proves the effectiveness of the proposed FFM.
Analysis of attention structure. From Table 4, we can also deduce the advantages of our attention structure. The xiii and xiv present that, through our attention structure, the feature-fused Res50 tracker gets a competitive performance (0.461 on EAO) 4.7% better than SiamRPNþþ and 3.0% than the fused baseline. What's more, to investigate the impact of different attention types, some additional attention evaluations are proposed in Table 5. Evaluations in Table 5 show that by changing the order of channel attention and spatial attention, there is a slight drop of 0.4% on the success score in OTB2015, which is in contrast to an increase of 0.1% on EAO in VOT2018. In the VOT2018 benchmark, channel attention and spatial attention bring an increase of 0.9% and 2.4%, respectively.  Analysis of color-based constraint module. The xiv and xv in Table 4 prove the advantages of our CCM. Our tracker gains an improvement of 2.4% on the VOT2018 benchmark. However, our CCM does not work on gray sequences and brings little improvement in OTB2015. Furthermore, we infer that the absence of color information should be a generality problem for all Siamese networks. The ii, x, and xv in Table 4 confirm this speculation. Our CCM brings an improvement of 3.4% for DaSiamRPN and 0.7% for SiamRPNþþ.
Analysis of redetection module. The comparison of xv and xvi in Table 4 shows that the MFW-SiamRPN tracker with our global redetection scheme has an improvement of 2.6% on the VOT2018-LT benchmark. What's more, any tracker that lacks a global redetection process could benefit from our redetection scheme on LT benchmarks. The comparison of x and xi in Table 4 shows that SiamRPNþþ with our global redetection scheme gains a 0.2% improvement. At the same time, ii and iii in Table 4 show that our global redetection scheme brings an improvement of 1.4% for SiamRPN.

Conclusion
In this article, we present a novel MFW-SiamRPN to track the template in a video. Our contributions are motivated by two intuitions. Firstly, too much attention to high-level features leads to ignorance of the capability of low-level features to differentiate between same-class objects. For this reason, the FFM is designed to aggregate features from the lowest level to the highest level. Then, similarity attention is proposed to select more discriminative features and suppress the similarity between distractors and the target. And the CCM is proposed to make full use of the color information lost in the deep network. As a result, our method tracks the target stably even when same-class objects interfere. Secondly, to track the target robustly in an LT video, a global redetection scheme is proposed to precept the target's condition and redetect the target. Numerous experiments on several data sets demonstrate the effectiveness of our method.

Author contribution
Song Guiling and Zhang jingyi are contributed equally to this work, and the contributions of other authors are in the order of the author sequence.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.