SCIENTIA SINICA Informationis, Volume 50, Issue 1: 128(2020) https://doi.org/10.1360/N112018-00232

## Learning temporal-spatial consistency correlation filter for visual tracking

• AcceptedFeb 15, 2019
• PublishedJan 8, 2020
Share
Rating

### Abstract

Discriminant correlation filter-based tracking approaches, which adopt a circular shift operator on the tracking target object (the only accurate positive sample) to obtain training data and rely on the potential sample periodic extension hypothesis that enables model training and detection, can be efficiently accomplished through FFT. However, real background information is not modeled during the total learning process. The background-aware correlation filter (BACF) tracking algorithm uses a binary matrix to acquire real positive and negative samples using a dense sampling method to model the target's appearance. However, the BACF algorithm does not consider temporal and spatial consistency information, and when a target undergoes an abrupt change, the learned correlation filter will drift to the background. To solve this problem, in this paper, we introduce temporal and spatial consistency constraints into the baseline BACF framework and propose a learning temporal-spatial consistency correlation filter (TSCF) tracking algorithm. This enables the correlation filter to learn to adapt to the appearance of mutation between successive frames. The temporal consistency constraint smooths the multi-channel correlation filter in the time series, and the spatial consistency constraint smooths the multi-channel correlation filter in spatial distribution, thus making the energy distribution more uniform of the correlation filter learned. In this paper, the TSCF model has closed solutions and the conjugate gradient descent method is used to approximate the optimal solution of a system of closed solutions. The optimization process can then be transformed into the Fourier domain using cyclic matrix properties to quickly obtain a solution, which effectively reduces the cost of calculating large matrices. In this paper, our TSCF algorithm increases distance precision by 5.5$%$ and raises the AUC by 4.3$%$ compared to the baseline BACF algorithm on the TB100 public database. The distance precision achieves 0.879 and the AUC reaches 0.663 on the TB100 database making use of only hand-crafted features. The TSCF algorithm proposed in this paper can be applied to challenging conditions such as short time occlusion, out-of-plane rotation, in-plane rotation, and so on, thus demonstrating its robustness and effectiveness.

### References

[1] Wu Y, Lim J, Yang M H. Object Tracking Benchmark.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1834-1848 CrossRef PubMed Google Scholar

[2] Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 2544--2550. Google Scholar

[3] Henriques J F, Rui C, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels. In: Proceedings of IEEE European Conference on Computer Vision, Florence, 2012. 702--715. Google Scholar

[4] Danelljan M, Khan F S, Felsberg M, et al. Adaptive color attributes for real-time visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 1090--1097. Google Scholar

[5] Henriques J F, Caseiro R, Martins P. High-Speed Tracking with Kernelized Correlation Filters.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 583-596 CrossRef PubMed Google Scholar

[6] Li Y, Zhu J K. A scale adaptive kernel correlation filter tracker with feature integration. In: Proceedings of IEEE European Conference on Computer Vision Workshops, Zurich, 2014. 254--265. Google Scholar

[7] Danelljan M, Hager G, Khan F S. Accurate scale estimation for robust visual tracking. In: Proceedings of IEEE British Machine Vision Conference, Nottingham, 2014. Google Scholar

[8] Danelljan M, Hager G, Khan F S. Discriminative Scale Space Tracking.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 1561-1575 CrossRef PubMed Google Scholar

[9] Galoogahi H K, Sim T, Lucey S. Correlation filters with limited boundaries. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 4630--4638. Google Scholar

[10] Danelljan M, Hager G, Khan F S, et al. Learning spatially regularized correlation filters for visual tracking. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 4310--4318. Google Scholar

[11] Danelljan M, Robinson A, Khan F S, et al. Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Proceedings of IEEE European Conference on Computer Vision, Amsterdam, 2016. 472--488. Google Scholar

[12] Danelljan M, Bhat G, Khan F S, et al. ECO: efficient convolution operators for tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 6931--6939. Google Scholar

[13] Li F, Tian C, Zuo W M, et al. Learning spatial-temporal regularized correlation filters for visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 4904--4913. Google Scholar

[14] Danelljan M, Hager G, Khan F S, et al. Convolutional features for correlation filter based visual tracking. In: Proceedings of IEEE International Conference on Computer Vision Workshops, Santiago, 2015. 621--629. Google Scholar

[15] Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4293--4302. Google Scholar

[16] Song Y B, Ma C, Gong L J, et al. CREST: convolutional residual learning for visual tracking. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 2574--2583. Google Scholar

[17] Gundogdu E, Alatan A A. Good Features to Correlate for Visual Tracking. IEEE Trans Image Process, 2018, 27: 2526-2540 CrossRef PubMed ADS arXiv Google Scholar

[18] Valmadre J, Bertinetto L, Henriques J, et al. End-to-end representation learning for correlation filter based tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5000--5008. Google Scholar

[19] Choi J, Chang H J, Yun S, et al. Attentional correlation filter network for adaptive visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4828--4837. Google Scholar

[20] Sun C, Wang D, Lu H C, et al. Correlation tracking via joint discrimination and reliability learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 489--497. Google Scholar

[21] Sun C, Wang D, Lu H C, et al. Learning spatial-aware regressions for visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 8962--8970. Google Scholar

[22] Tang M, Yu B, Zhang F, et al. High-speed tracking with multi-kernel correlation filters. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 4874--4883. Google Scholar

[23] Zhu Z, Wu W, Zou W, et al. End-to-end flow correlation tracking with spatial-temporal attention. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 548--557. Google Scholar

[24] Galoogahi H K, Fagg A, Lucey S. Learning background-aware correlation filters for visual tracking. In: Proceedings of IEEE European Conference on Computer Vision, Venice, 2017. 1144--1152. Google Scholar

[25] Lu H C, Li P X, Wang D. Visual object tracking: a survey. Pattern Recogn Artif Intell, 2018, 31: 61--76. Google Scholar

[26] Li P, Wang D, Wang L. Deep visual tracking: Review and experimental comparison. Pattern Recognition, 2018, 76: 323-338 CrossRef Google Scholar

[27] Bertinetto L, Valmadre J, Golodetz S, et al. Staple: complementary learners for real-time tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1401--1409. Google Scholar

[28] Mueller M, Smith N, Ghanem B. Context-aware correlation filter tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1387--1395. Google Scholar

[29] Bibi A, Mueller M, Ghanem B. Target response adaptation for correlation filter tracking. In: Proceedings of IEEE European Conference on Computer Vision, Amsterdam, 2016. 419--433. Google Scholar

[30] He A F, Luo C, Tian X M, et al. A twofold siamese network for real-time object tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 4834--4843. Google Scholar

[31] Li B, Yan J J, Wu W, et al. High performance visual tracking with siamese region proposal network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 8971--8980. Google Scholar

[32] Zhang Y H, Wang L J, Qi J Q, et al. Structured siamese network for real-time visual tracking. In: Proceedings of IEEE European Conference on Computer Vision, Munich, 2018. 355--370. Google Scholar

[33] Zhu Z, Wang Q, Li B, et al Distractor-aware siamese networks for visual object tracking. In: Proceedings of IEEE European Conference on Computer Vision, Munich, 2018. 103--119. Google Scholar

[34] Lu X K, Ma C, Ni B B, et al. Deep regression tracking with shrinkage Los. In: Proceedings of IEEE European Conference on Computer Vision, Munich, 2018. 369--386. Google Scholar

[35] Yang T Y, Chan A B. Learning dynamic memory networks for object tracking. In: Proceedings of IEEE European Conference on Computer Vision, Munich, 2018. 153--169. Google Scholar

[36] Chen B Y, Wang D, Li P X, et al. Real-time `actor-critic' tracking. In: Proceedings of IEEE European Conference on Computer Vision, Munich, 2018. 328--345. Google Scholar

• Figure 1

(Color online) Example tracking results of four different methods on the TB100 dataset on three video sequences (Box, Bird2, and BlurOwl)

• Figure 2

(Color online) Comparison between BACF and TSCF algorithms on sequences Box and Bird2.(a) Comparison of classifier response peak on sequence Box;(b) comparison of correlation filters learned on sequence Bird2

• Figure 3

(Color online) Comparison between TSCF and BACF algorithms, where red denotes TSCF, blue represents BACF, orange represents BACFPCG-S, pink represents BACFPCG-M, and green refers to TCF. (a) Basketball (IV, OCC, DEF, IPR, BC); (b) Jogging-2 (OCC, DEF, IPR); (c) Diving (SV, DEF, OPR); (d) Skating2-1 (SV, OCC, DEF, FM, IPR);protect łinebreak (e) Box (IV, SV, OCC, MB, OPR, IPR, OV, BC, LR); (f) KiteSurf (IV, OCC, OPR, IPR)

• Figure 4

(Color online) Comparison between TSCF and BACF algorithms, where red denotes TSCF, blue represents BACF, orange represents BACFPCG-S, pink represents BACFPCG-M, and green refers to TCF. (a) Biker (SV, OCC, MB, FM, IPR, OV, LR); (b) ClifBar (SV, OCC, MB, FM, OPR, OV, BC); (c) Bird2 (OCC, DEF, FM, OPR, IPR); protect łinebreak (d) DragonBaby (SV, OCC, MB, FM, OPR, IPR, OV); (e) BlurOwl (SV, MB, FM, OPR); (f) Freeman3 (SV, OPR, IPR, LR)

• Figure 5

(Color online) The AUC of 100 video sequences and 11 video attributes on TB100 database

• Figure 6

(Color online) The distance precision of 100 video sequences and 11 video attributes on TB100 database

• Figure 7

(Color online) Comparison of distance precision and AUC of TSCF algorithm on TB100 database

• Figure 8

(Color online) Comparison of distance precision and AUC of TSCF algorithm on TC128 database

• Table 1   Comparison of AUC of 11 state-of-the-art algorithms and our TSCF algorithm on 11 video attributes on TB50 database
 Total number KCF SAMF Staple CFAT fDSST SRDCF CACF MKCFup ECOHC STRCF BACF TSCF of video TB50 50 0.399 0.392 0.517 0.529 0.504 0.540 0.542 0.546 0.601 0.606$^*$ 0.574 0.621$^{\#}$ BC 20 0.433 0.386 0.513 0.496 0.561 0.533 0.517 0.562 0.600 0.614$^*$ 0.561 0.620$^{\#}$ DEF 19 0.391 0.354 0.533 0.444 0.460 0.455 0.538 0.503 0.555 0.556$^*$ 0.551 0.572$^{\#}$ FM 22 0.389 0.282 0.495 0.533 0.554 0.573 0.541 0.530 0.611$^{\#}$ 0.588 0.576 0.596$^*$ IV 20 0.410 0.411 0.511 0.485 0.534 0.521 0.530 0.550 0.579 0.594$^{\#}$ 0.560 0.583$^*$ IPR 29 0.368 0.376 0.466 0.511 0.474 0.474 0.502 0.512 0.563 0.575$^{\#}$ 0.546 0.573$^*$ LR 8 0.267 0.432 0.403 0.460 0.437 0.526 0.460 0.527 0.562$^*$ 0.559 0.518 0.585$^{\#}$ MB 19 0.393 0.303 0.489 0.553 0.527 0.549 0.528 0.496 0.605$^{\#}$ 0.586 0.539 0.590$^*$ OCC 27 0.371 0.447 0.521 0.528 0.481 0.506 0.536 0.548 0.589$^*$ 0.589$^*$ 0.557 0.603$^{\#}$ OV 11 0.277 0.302 0.463 0.439 0.454 0.465 0.488 0.473 0.549$^*$ 0.537 0.508 0.561$^{\#}$ OPR 29 0.361 0.408 0.475 0.489 0.466 0.472 0.479 0.526 0.583 0.594$^{\#}$ 0.545 0.590$^*$ SV 34 0.344 0.370 0.470 0.502 0.491 0.509 0.493 0.523 0.593$^*$ 0.590 0.528 0.600$^{\#}$ FPS $-$ 238$^{\#}$ 25 69 5 94 10 43 150$^*$ 55 23 34 3
• Table 2   Comparison of distance precision of 11 state-of-the-art algorithms and our TSCF algorithm on 11 video attributes on TB50 database
 Total number KCF SAMF Staple CFAT fDSST SRDCF CACF MKCFup ECOHC STRCF BACF TSCF of video TB50 50 0.589 0.561 0.687 0.710 0.684 0.723 0.730 0.730 0.821$^*$ 0.815 0.768 0.842$^{\#}$ BC 20 0.632 0.511 0.664 0.636 0.773 0.692 0.677 0.737 0.807 0.815$^*$ 0.745 0.824$^{\#}$ DEF 19 0.579 0.520 0.733 0.641 0.645 0.663 0.75 0.692 0.796 0.798$^*$ 0.750 0.801$^{\#}$ FM 22 0.555 0.408 0.660 0.702 0.733 0.767 0.733 0.692 0.819$^{\#}$ 0.768 0.778 0.801$^*$ IV 20 0.631 0.564 0.681 0.634 0.737 0.706 0.723 0.730 0.779$^*$ 0.783$^{\#}$ 0.748 0.775 IPR 29 0.549 0.518 0.635 0.695 0.655 0.637 0.704 0.679 0.782$^*$ 0.772 0.737 0.786$^{\#}$ LR 8 0.587 0.748 0.667 0.762 0.645 0.764 0.807 0.756 0.882$^{\#}$ 0.823$^*$ 0.770 0.882$^{\#}$ MB 19 0.548 0.408 0.657 0.700 0.701 0.740 0.712 0.653 0.808$^*$ 0.767 0.724 0.809$^{\#}$ OCC 27 0.556 0.639 0.723 0.72 0.682 0.697 0.744 0.734 0.843$^{\#}$ 0.808$^*$ 0.757 0.843$^{\#}$ OV 11 0.364 0.446 0.658 0.576 0.613 0.623 0.686 0.636 0.774$^*$ 0.726 0.724 0.795$^{\#}$ OPR 29 0.553 0.580 0.663 0.663 0.662 0.651 0.702 0.713 0.834$^{\#}$ 0.810 0.737 0.821$^*$ SV 34 0.563 0.558 0.653 0.701 0.681 0.684 0.697 0.690 0.818$^*$ 0.806 0.716 0.820$^{\#}$ FPS $-$ 238$^{\#}$ 25 69 5 94 10 43 150$^*$ 55 23 34 3
• Table 3   Comparison of AUC for algorithm validation on TB100 database
 TB100 BC DEF FM IV IPR LR MB OCC OV OPR SV TSCF 0.664 0.641 0.611 0.641 0.653 0.617 0.605 0.648 0.644 0.604 0.643 0.643 TCF 0.660 0.642 0.607 0.639 0.656 0.607 0.607 0.654 0.632 0.602 0.633 0.635 BACFPCG-M 0.634 0.609 0.600 0.617 0.627 0.586 0.605 0.640 0.606 0.584 0.604 0.626 BACFPCG-S 0.625 0.605 0.596 0.586 0.632 0.558 0.576 0.612 0.595 0.530 0.597 0.602 BACFPCG 0.621 0.625 0.594 0.614 0.632 0.583 0.514 0.585 0.586 0.552 0.594 0.579
• Table 4   Comparison of distance precision for algorithm validation on TB100 database
 TB100 BC DEF FM IV IPR LR MB OCC OV OPR SV TSCF 0.879 0.849 0.828 0.841 0.831 0.844 0.895 0.841 0.864 0.819 0.878 0.860 TCF 0.863 0.831 0.824 0.814 0.818 0.817 0.887 0.824 0.835 0.820 0.852 0.834 BACFPCG-M 0.855 0.816 0.824 0.812 0.812 0.810 0.938 0.830 0.825 0.806 0.840 0.839 BACFPCG-S 0.822 0.789 0.810 0.739 0.801 0.750 0.848 0.766 0.777 0.678 0.804 0.791 BACFPCG 0.824 0.830 0.799 0.826 0.817 0.795 0.795 0.767 0.761 0.765 0.805 0.780
• Table 5   Comparison of 9 state-of-the-art visual object tracking algorithms based on deep learning and our TSCF algorithm on TB100 database. $\#$ and $*$ denote the first and second best results, respectively.
 DRT Sa-Siam SiamRPN FlowTrack StructSiam DasiamRPN DSLT MemTrack ACT TSCF AUC 0.699$^{\#}$ 0.610 0.637 0.655 0.621 0.617 0.660 0.642 0.643 0.664$^*$ DP 0.923$^{\#}$ 0.823 0.851 0.881 0.851 0.880 0.909$^*$ 0.849 0.859 0.879
• #### 0

Citations

• Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有