SCIENCE CHINA Information Sciences, Volume 63 , Issue 4 : 140307(2020) https://doi.org/10.1007/s11432-019-2784-4

## Deep feature extraction and motion representation for satellite video scene classification

• AcceptedFeb 9, 2020
• PublishedMar 9, 2020
Share
Rating

### Abstract

Satellite video scene classification (SVSC) is an advanced topic in the remote sensing field, which refers to determine the video scene categories from satellite videos. SVSC is an important and fundamental step for satellite video analysis and understanding, which provides priors for the presence of objects and dynamic events. In this paper, a two-stage framework is proposed to extract spatial features and motion features for SVSC. More specifically, the first stage is designed to extract spatial features for satellite videos. Representative frames are firstly selected based on the blur detection and spatial activity of satellite videos. Then the fine-tuned visual geometry group network (VGG-Net) is transferred to extract spatial features based on spatial content. The second stage is designed to build motion representation for satellite videos. The motion representation of moving targets in satellite videos is first built by the second temporal principal component of principal component analysis (PCA). Second, features from the first fully connected layer of VGG-Net are used as high-level spatial representation for moving targets. Third, a small network of long and short term memory (LSTM) is further designed for encoding temporal information. Two-stage features respectively characterize spatial and temporal patterns of satellite scenes, which are finally fused for SVSC. A satellite video dataset is built for video scene classification, including 7209 video segments and covering 8 scene categories. These satellite videos are from Jilin-1 satellites and Urthecast. The experimental results show the efficiency of our proposed framework for SVSC.

### Acknowledgment

This work was supported by National Natural Science Foundation of Key International Cooperation (Grant No. 61720106002), Key Research and Development Project of Ministry of Science and Technology (Grant No. 2017YFC1405100), National Natural Science Foundation of China (Grant No. 61901141), and Fundamental Research Funds for the Central Universities (Grant No. HIT.HSRIF.2020010). The authors would like to thank the IEEE GRSS Image Analysis and Data Fusion Technical Committee for providing Urthecast satellite videos.

### References

[1] Yan C, Xie H, Chen J. A Fast Uyghur Text Detector for Complex Background Images. IEEE Trans Multimedia, 2018, 20: 3389-3398 CrossRef Google Scholar

[2] Wang Q, Huang Y, Jia W. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci China Inf Sci, 2020, 63: 120103 CrossRef Google Scholar

[3] Zhao J, Guo W, Zhang Z. A coupled convolutional neural network for small and densely clustered ship detection in SAR images. Sci China Inf Sci, 2019, 62: 042301 CrossRef Google Scholar

[4] Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 2009. 2929--2936. Google Scholar

[5] Yan C, Tu Y, Wang X. STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Trans Multimedia, 2020, 22: 229-241 CrossRef Google Scholar

[6] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2006, New York. 2169--2178. Google Scholar

[7] Sánchez J, Perronnin F, Mensink T. Image Classification with the Fisher Vector: Theory and Practice. Int J Comput Vis, 2013, 105: 222-245 CrossRef Google Scholar

[8] Cheriyadat A M. Unsupervised Feature Learning for Aerial Scene Classification. IEEE Trans Geosci Remote Sens, 2014, 52: 439-451 CrossRef ADS Google Scholar

[9] Yan C, Li L, Zhang C. Cross-Modality Bridging and Knowledge Transferring for Image Understanding. IEEE Trans Multimedia, 2019, 21: 2675-2685 CrossRef Google Scholar

[10] Othman E, Bazi Y, Alajlan N. Using convolutional features and a sparse autoencoder for land-use scene classification. Int J Remote Sens, 2016, 37: 2149-2167 CrossRef ADS Google Scholar

[11] Otavio A B P, Nogueira K, dos Santos J A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, Boston. 44--51. Google Scholar

[12] Hu F, Xia G S, Hu J. Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sens, 2015, 7: 14680-14707 CrossRef ADS Google Scholar

[13] Chaib S, Liu H, Gu Y. Deep Feature Fusion for VHR Remote Sensing Scene Classification. IEEE Trans Geosci Remote Sens, 2017, 55: 4775-4784 CrossRef ADS Google Scholar

[14] Li E, Xia J, Du P. Integrating Multilayer Features of Convolutional Neural Networks for Remote Sensing Scene Classification. IEEE Trans Geosci Remote Sens, 2017, 55: 5653-5665 CrossRef ADS Google Scholar

[15] He N, Fang L, Li S. Remote Sensing Scene Classification Using Multilayer Stacked Covariance Pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899-6910 CrossRef ADS Google Scholar

[16] Yi S, Pavlovic V. Spatio-temporal context modeling for BoW-based video classification. In: Proceedings of IEEE International Conference on Computer Vision Workshops (ICCVW), Sydney, 2013. 779--786. Google Scholar

[17] Guoying Zhao , Ahonen T, Matas J. Rotation-Invariant Image and Video Description With Local Binary Pattern Features. IEEE Trans Image Process, 2012, 21: 1465-1477 CrossRef PubMed ADS Google Scholar

[18] Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia (ACMMM), Augsburg, 2007. 357--360. Google Scholar

[19] Derpanis K G, Lecce M, Daniilidis K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2012. 1306--1313. Google Scholar

[20] Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference (BMVC), London, 2009. 1--11. Google Scholar

[21] Wang H, Kl?ser A, Schmid C. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int J Comput Vis, 2013, 103: 60-79 CrossRef Google Scholar

[22] Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Sydney, 2013. 3551--3558. Google Scholar

[23] Karpathy A, Toderici G, Shetty S, et al. Large scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, 2014. 1725--1732. Google Scholar

[24] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018. 6546--6555. Google Scholar

[25] Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, 2017. 3154--3160. Google Scholar

[26] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, 2015. 4489--4497. Google Scholar

[27] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), Quebec, 2014. 568--576. Google Scholar

[28] Donahue J, Hendricks L A, Rohrbach M. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 677-691 CrossRef PubMed Google Scholar

[29] Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. In: Proceedings of International Conference on Machine Learning (ICML), Lille, 2015. 843--852. Google Scholar

[30] Ng J Y, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: deep networks for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 2015. 4694--4702. Google Scholar

[31] Zhu L, Xu Z, Yang Y. Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 1339--1348. Google Scholar

[32] Feichtenhofer C, Pinz A, Wildes R P. Temporal residual networks for dynamic scene recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 7435--7444. Google Scholar

[33] Simonyan K, Zisserman A. Very deep convolutional networks for large scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), San Diego, 2015. 1--14. Google Scholar

[34] Tianming Liu , Hong-Jiang Zhang , Feihu Qi . A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circuits Syst Video Technol, 2003, 13: 1006-1013 CrossRef Google Scholar

[35] Kin-Wai Sze , Kin-Man Lam , Guoping Qiu . A new key frame representation for video segment retrieval. IEEE Trans Circuits Syst Video Technol, 2005, 15: 1148-1155 CrossRef Google Scholar

[36] Dufaux F. Key frame selection to represent a video. In: Proceedings of International Conference on Image Processing (ICIP), Vancouver, 2000. 275--278. Google Scholar

[37] Crete F, Dolmiere T, Ladret P, et al. The blur effect: perception and estimation with a new no-reference perceptual blur metric. In: Proceedings of SPIE, 2007. 64920I. Google Scholar

[38] Sahouria E, Zakhor A. Content analysis of video using principal components. IEEE Trans Circuits Syst Video Technol, 1999, 9: 1290-1298 CrossRef Google Scholar

[39] Xia G S, Hu J, Hu F. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans Geosci Remote Sens, 2017, 55: 3965-3981 CrossRef ADS arXiv Google Scholar

[40] Tuia D, Moser G, Le Saux B. 2016 IEEE GRSS Data Fusion Contest: Very high temporal resolution from space Technical Committees. IEEE Geosci Remote Sens Mag, 2016, 4: 46-48 CrossRef Google Scholar

[41] Farneback G. Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian Conference on Image Analysis (SCIA), 2003. 363--370. Google Scholar

[42] KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proceedings of the 2nd European Workshop on Advanced Video Based Surveillance System, Boston, 2002. 135--144. Google Scholar

• Figure 1

(Color online) Various occlusion phenomena in satellite videos. (a) Moving targets are occluded by the shadow of high buildings in a parking lot scene; (b) a highway scene is occluded by dense trees; (c) moving targets are occluded when they pass through the bottom of the overpass; (d) a highway scene is occluded by thin clouds.

• Figure 2

(Color online) Low imaging quality in satellite videos. (a) Overexposure in a bridge scene; (b) a reference frame of (a) without overexposure; (c) low contrast in a highway scene.

• Figure 3

(Color online) The architecture of the proposed framework for satellite video scene classification (SVSC). Spatial features are extracted by the VGG-Net based on representative frames. Motion representation is built based on temporal principal components, followed by VGG-Net and LSTM for temporal feature encoding. Then spatial features and temporal features are fused. Finally, a softmax classifier is used for SVSC.

• Figure 4

(Color online) Temporal principal components of consecutive video frames. (a) and (b) a pair of consecutive video frames for a highway satellite video scene; (c) moving targets in frame (a) are outlined with red circles; (d) the first temporal principal component; (e) the second temporal principal component.

• Figure 5

(Color online) The spatial feature extraction network based on the VGG-Net architecture, including 5 blocks of convolutional layers followed by pooling layers and 3 fully connected layers.

• Figure 6

(Color online) A diagram of an LSTM memory unit.

• Figure 7

(Color online) Samples of satellite video scenes with 8 scene classes, including airplane, runway, bridge, harbor, highway, intersection, overpass and parking lot. Four samples per each scene are shown with the first frame.

• Figure 8

Normalized confusion matrices with specific video frame for SVSC. (a) The first frame with CNN+SVM; (b) the last frame with CNN+SVM; (c) the representative frame with CNN+SVM; (d) the first frame with fine-tuned VGG-Net; (e) the last frame with fine-tuned VGG-Net; and (f) the representative frame with fine-tuned VGG-Net. The rows and columns of the matrix respectively denote the predicted and actual labels.

• Figure 9

Normalized confusion matrices by exploiting spatio-temporal information for SVSC. (a) C3D [26]; (b) two-stream CNN [27]; (c) T-ResNet [32]; (d) our proposed method. The rows and columns of the matrix respectively denote the predicted and actual labels.

• Table 1   Illustration of the built satellite video dataset
 Sample statistics including overlapping areas Sample statistics excluding overlapping areas Sample Train Test Total Test proportion Sample Train Test Total Test proportion Harbor 862 294 1156 0.25 Harbor 121 60 181 0.33 Intersection 734 184 918 0.2 Intersection 98 18 116 0.16 Highway 726 184 910 0.2 Highway 142 70 212 0.33 Bridge 388 167 555 0.3 Bridge 60 35 95 0.37 Overpass 589 162 751 0.22 Overpass 119 54 173 0.31 Airplane 788 200 988 0.2 Airplane 89 37 126 0.29 Runway 809 171 980 0.17 Runway 93 53 146 0.36 Parking lot 734 217 951 0.23 Parking lot 84 44 128 0.34 Total number 5630 1579 7209 0.22 Total number 806 371 1177 0.32
• Table 2   Results of SVSC with specific video frame (%)$^{\rm~a)}$
 CNN+SVM Fine-tuned VGG-Net Class The 1st frame The last frame The representative frame The 1st frame The last frame The representative frame Harbor 95.92 98.98 98.98 94.56 87.07 75.17 Intersection 15.76 22.83 19.57 33.70 54.89 68.48 Highway 35.33 23.91 30.98 15.76 33.70 23.91 Bridge 0.00 0.00 0.00 0.60 7.19 1.80 Overpass 96.91 87.04 96.30 100.00 100.00 100.00 Airplane 80.50 78.00 82.00 90.50 74.50 98.00 Runway 81.29 95.91 94.15 87.13 66.67 78.95 Parking lot 59.45 66.36 56.68 58.53 92.63 88.02 AA 58.14 59.13 59.83 60.10 64.58 66.79 OA 60.92 62.19 62.57 62.63 66.94 68.27

a

• Table 3   Results of SVSC by exploiting spatio-temporal information (%)$^{\rm~a)}$
 C3D [26] Two-stream CNN [27] T-ResNet [32] Proposed method Harbor 90.48 85.03 99.32 98.30 Intersection 57.07 42.39 40.22 80.98 Highway 5.44 30.43 5.43 32.07 Bridge 1.20 0.00 4.79 0.00 Overpass 100.00 91.36 100.00 0.94 Airplane 100.00 60.00 79.50 94.44 Runway 89.47 64.33 73.10 69.59 Parking lot 7.37 10.60 59.45 96.77 AA 55.77 48.02 57.83 70.83 OA 57.25 49.72 60.67 73.97

a

• Table 4   Results of SVSC considering moving targets (%)
 Our proposed method Movement track length Frame difference Se-TPC Frame difference GMM OA 84.61 86.76 83.85 77.33

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有