logo

SCIENTIA SINICA Informationis, Volume 48, Issue 7: 841-855(2018) https://doi.org/10.1360/N112017-00286

Pedestrian tracking framework based on deep convolution network and SIFT

More info
  • ReceivedDec 27, 2017
  • AcceptedMar 18, 2018
  • PublishedJul 17, 2018

Abstract

Numerous object-tracking and multiple people-tracking algorithms have been put forward in the field of computer vision, but these algorithms did not address the issue of when a pedestrian is partially or fully occluded by another object or person. To achieve efficient pedestrian tracking in various occlusion conditions, this research presents a pedestrian-tracking framework based on deep learning. First, a pedestrian detector was trained as a pedestrian-tracking search mechanism based on the object detection algorithm Faster R-CNN, which narrows the search range and improves accuracy efficiently compared with traditional gradient-down algorithms. Second, color histogram and scale-invariant features transform (SIFT) were combined as the target model expression. In the process of target matching, a full convolution network (FCN) was trained for pedestrians to extract the pedestrian information in the target model base on an FCN image semantic segmentation algorithm for removing the background noise. Finally, extensive experiments on OTB demonstrate that the proposed method achieves better performance than other state-of-the-art trackers for precision and success rate.


Funded by

国家自然科学基金(61473013)


References

[1] Yilmaz A, Javed O, Shah M. Object tracking: a survey. ACM Comput Surv, 2006, 38: 81-93 CrossRef Google Scholar

[2] Yao R, Shi Q F, Shen C H, et al. Part-based visual tracking with online latent structural learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 2363--2370. Google Scholar

[3] Adam A, Rivlin E, Shimshoni I. Robust fragments-based tracking using the integral histogram. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), New York, 2006. 798--805. Google Scholar

[4] Ross D A, Lim J, Lin R S. Incremental learning for robust visual tracking. Int J Comput Vision, 2008, 77: 125-141 CrossRef Google Scholar

[5] Zhuang B H, Lu H C, Xiao Z Y. Visual tracking via discriminative sparse similarity map. IEEE Trans Image Process, 2014, 23: 1872-1881 CrossRef PubMed ADS Google Scholar

[6] Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intel, 2002, 24: 603-619 CrossRef Google Scholar

[7] Comaniciu D, Ramesh V, Meer P. Real-time tracking of non-rigid objects using mean shift. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, 2002. Google Scholar

[8] Jin X B, Du J J, Bao J. Data-driven tracking based on Kalman filter. Appl Mech Mater, 2012, 226-228: 2476-2479 CrossRef ADS Google Scholar

[9] Ren S Q, He K M, Girshick R. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 1137-1149 CrossRef PubMed Google Scholar

[10] Lowe D G. Distinctive image features from scale-invariant keypoints. Int J Comput Vision, 2004, 60: 91-110 CrossRef Google Scholar

[11] Bouman C A. Similarity-based retrieval of images using color histograms. In: Proceedings of Storage and Retrieval for Image and Video Databases, San Jose, 1998. Google Scholar

[12] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 640-651 CrossRef PubMed Google Scholar

[13] Zhang K H, Zhang L, Yang M H. Real-time compressive tracking. In: Proceedings of European Conference on Computer Vision, Florence, 2012. Google Scholar

[14] Zhang K H, Zhang L, Yang M H, et al. Fast tracking via spatio-temporal context learning. Comput Sci, 2013,. arXiv Google Scholar

[15] Henriques F, Caseiro R, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels. In: Proceedings of European Conference on Computer Vision, Florence, 2012. Google Scholar

[16] Henriques J F, Caseiro R, Martins P. High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intel, 2015, 37: 583-596 CrossRef PubMed Google Scholar

[17] Zhang B, Li Z, Cao X. Output constraint transfer for kernelized correlation filter in tracking. IEEE Trans Syst Man Cybern Syst, 2017, 47: 693-703 CrossRef Google Scholar

[18] Girshick R. Fast R-CNN. In: Prococeedings of IEEE International Conference Computer Vision, Santiago, 2015. 1440--1448. Google Scholar

[19] Ma J, Shao W, Ye H. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans Multimedia, 2018, : 1-1 CrossRef Google Scholar

[20] He K M, Zhang X Y, Ren S Q. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intel, 2015, 37: 1904-1916 CrossRef PubMed Google Scholar

[21] Li X, Hu W M, Shen C H. A survey of appearance models in visual object tracking. ACM Trans Intel Syst Technol, 2013, 4: 1-48 CrossRef Google Scholar

[22] Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, 1981. Google Scholar

[23] Baker S, Matthews I. Lucas-Kanade 20 years on: a unifying framework. Int J Comput Vision, 2004, 56: 221-255 CrossRef Google Scholar

[24] Hager G D, Belhumeur P N. Efficient region tracking with parametric models of geometry and illumination. IEEE Trans Pattern Anal Mach Intel, 1998, 20: 1025-1039 CrossRef Google Scholar

[25] Matthews L, Ishikawa T, Baker S. The template update problem. IEEE Trans Pattern Anal Mach Intel, 2004, 26: 810-815 CrossRef PubMed Google Scholar

[26] Alt N, Hinterstoisser S, Navab N. Rapid selection of reliable templates for visual tracking. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. Google Scholar

[27] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005. 886--893. Google Scholar

[28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations, San Diego, 2015. Google Scholar

[29] Wu Y, Lim J, Yang M H. Online object tracking: a benchmark. In: Proceedings of IEEE International Conference Computer Vision and Pattern Recognition, Portland, 2013. 2411--2418. Google Scholar

[30] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, 2014. Google Scholar

[31] Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Miami, 2009. 248--255. Google Scholar

[32] Wu Y, Lim J, Yang M H. Online object tracking: a benchmark. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Portland, 2013. 2411--2418. Google Scholar

[33] Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4293--4302. Google Scholar

  • Figure 1

    (Color online) The architecture of the Faster R-CNN based pedestrian detection system

  • Figure 2

    (Color online) Comparison of color histograms of two images. Different pictures of (a) the same and (b) the different pedestrian

  • Figure 3

    (Color online) The effect of the feature matching for the same person in thedifferent location in the video base on the different threshold. (a) SIFT feature points distribution of the same person; (b)ratio=0.3; (c) ratio=0.4;protect łinebreak (d) ratio=0.5; (e) ratio=0.6; (f) ratio=0.8

  • Figure 4

    (Color online) The decrease of the background matching points

  • Figure 5

    (Color online) The structure of the fully convolution network

  • Figure 6

    (Color online) Comparison between proposed framework and other tracking algorithms based on CLE. Plots of CLE-out of (a) simple scene, (b) blurbody (no occlusion), (c) jogging 1 (occlusion), (d) jogging 2 (occlusion), (e) fully occlusion, (f) partial occlusion, (g) pedestrian occlusion, (h) multi-pedestrian occlusion

  • Figure 7

    (Color online) Precision plots for the 8 attributes of the online tracking benchmark and our video with occlusion. Precision plots of OPE-out of (a) simple scene, (b) blurbody (no occlusion), (c) jogging 1 (occlusion), (d) jogging 2 (occlusion), (e) fully occlusion, (f) partial occlusion, (g) pedestrian occlusion, (h) multi-pedestrian occlusion

  • Figure 8

    (Color online) Success plots for the 8 attributes of the online tracking benchmark and our video with occlusion. Success plots of OPE-out of (a) simple scene, (b) blurbody (no occlusion), (c) jogging 1 (occlusion), (d) jogging 2 (occlusion), (e) fully occlusion, (f) multi-pedestrian occlusion, (g) pedestrian occlusion, (h) partial occlusion

  • Figure 9

    (Color online) Illustration of some key frames

  • 1   Table 1The degree of differentiation between two methods
    Method Experiment (1) Experiment (2) Experiment (3) Experiment (4)
    Tradition [28] (%) 41.8 33.3 35.4 28.5
    Proposed (%) 43.5 40.1 47.3 36.7
  • 2   Table 2Video sequence attribute description
    Video sequence Occlusion Illumination variations Scale variations Motion blur Low resolution
    1 N N N N N
    2 N N Y Y Y
    3 Y N N N Y
    4 Y N N N Y
    5 Y N N N N
    6 Y Y Y N Y
    7 Y N Y N N
    8 Y Y Y N N
  •   
    算法1 基于深度卷积网络与尺度不变特征变换的行人跟踪框架
    1: cap = cv2.videoCapture (video_path);
    2: while cap.isOpened():
    3: ok, frame = cap.read();
    4: if not ok:
    5: break;
    6: Get all pedestrians of frame by using Fast R-CNN(score $>$ thresh);
    7: Place all pedestrian coordinates into an array bboxs;
    8: if len(bboxs) = 0:
    9: return;
    10: if the index of frame =1:
    11: Get the target image base on initial bbox;
    12: Calculate the RGB threshold;
    13: for bbox in bboxs:
    14: Obtain the target image;
    15: Obtain the candidate image base on bbx;
    16: Compare the similarity of two images;
    17: Place all similarity into an array degree;
    18: Find the maximum similarity in array degree(max_degree);
    19: if max_degree $>$ RGB threshold:
    20: cv2.rectangle(frame,bbox(max));
    21: Update the new target image;
    22: else
    23: for bbox in bboxs:
    24: Compare the feature matches of two images base on SIFT;
    25: Place all feature matches into an array maches;
    26: Find the maximum maches in array maches (max_mache);
    27: if max_mache $>$ mache threshold:
    28: cv2.rectangle(frame,bbox(max));
    29: Update the new target image;
    30: else
    31: There is no target in the frame.

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1