SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 877-891(2020) https://doi.org/10.1360/SSI-2020-0014

## Large-scale video semantic recognition based on consistency of segment-level and video-level predictions

• AcceptedMar 13, 2020
• PublishedJun 10, 2020
Share
Rating

### Abstract

Segment-level video semantic recognition, which known to be an important task in video analysis, attempts to identify the semantic concepts in short video clips.Labeling video segments is difficult because there is an extremely large number of segments and there are no network tags; consequently, only a portion of the video segments are labeled. Determining how to improve the accuracy of semantic recognition of fragmented videos with limited semantic labels is a key challenge in video semantic recognition.This paper proposes a video semantic recognition algorithm based on the consistency of video- and segment-level predictions. The proposed algorithm introduces the constraint of consistency between complete video semantics and fragmentary video semantics. The proposed algorithm can be applied to filter the video segment semantic results to improve recognition accuracy. The proposed algorithm achieved 82.62% mean average precision on the video segment semantic recognition task using the large-scale video dataset YouTube-8M and ranked second in the third YouTube-8M competition.

### References

[1] Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of conference on Computer Vision and Pattern Recognition, 2009. 2929--2936. Google Scholar

[2] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A Large Video Database for Human Motion Recognition. In: Proceedings of International Conference on Computer Vision, Barcelona, 2011. 2556--2563. Google Scholar

[3] Khurram S, Amir R Z, Mubarak S. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. 2012,. arXiv Google Scholar

[4] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks. In: Proceedings of conference on Computer Vision and Pattern Recognition, Columbus, 2014. 1725--1732. Google Scholar

[5] Jiang Y G, Wu Z, Wang J, et al. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40:352-364. Google Scholar

[6] Heilbron B G F C, Escorcia V, Ghanem B, et al. ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of conference on Computer Vision and Pattern Recognition, Boston, 2015. 961--970. Google Scholar

[7] Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8m: A large-scale video classification benchmark. 2016,. arXiv Google Scholar

[8] Sigurdsson G A, Varol G, Wang X, et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. In: Proceedings of European Conference on Computer Vision, Las Vegas, 2016. 510--526. Google Scholar

[9] Sigurdsson G A, Gupta A, Schmid C, et al. Charades-ego: a large-scale dataset of paired third and first person videos. 2018,. arXiv Google Scholar

[10] Carreira J, Noland E, Hillier C et al. A short note on the kinetics-700 human action dataset. 2019,. arXiv Google Scholar

[11] Murray N, Marchesotti L, Perronnin F. AVA: a large-scale database for aesthetic visual analysis. In: Proceedings of conference on Computer Vision and Pattern Recognition, Rhode Island, 2012. 2408--2415. Google Scholar

[12] Goyal R, Kahou S E, Michalski V, et al. The “Something Something" Video Database for Learning and Evaluating Visual Common Sense. In: Proceedings of International Conference on Computer Vision, Venice, 2017. Google Scholar

[13] Monfort M, Vondrick C, Oliva A. Moments in Time Dataset: One Million Videos for Event Understanding.. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 502-508 CrossRef PubMed Google Scholar

[14] Monfort M, Ramakrishnan K, Andonian A, et al. Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. 2019,. arXiv Google Scholar

[15] Hochreiter S, Schmidhuber J. Long short-term memory.. Neural Computation, 1997, 9: 1735-1780 CrossRef PubMed Google Scholar

[16] Donahue J, Anne H L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, 2015. 2625--2634. Google Scholar

[17] Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM international conference on Multimedia, 2015. 461--470. Google Scholar

[18] Yue-Hei N J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, 2015. 4694--4702. Google Scholar

[19] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 4489--4497. Google Scholar

[20] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, 2017. 6299--6308. Google Scholar

[21] Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3d residual networks. In: proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 5533--5541. Google Scholar

[22] Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 7083--7093. Google Scholar

[23] Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence. 2012. 35:221-231. Google Scholar

[24] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1933--1941. Google Scholar

[25] Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, 2015. 4305--4314. Google Scholar

[26] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 20--36. Google Scholar

[27] Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Munich, 2018. 7794--7803. Google Scholar

[28] Yan A, Wang Y, Li Z, et al. PA3D: Pose-Action 3D Machine for Video Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 7922--7931. Google Scholar

[29] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of AAAI Conference on Artificial Intelligence, San Francisco, 2017. Google Scholar

[30] Hershey S, Chaudhuri S, Ellis D P, et al. CNN architectures for large-scale audio classification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, 2017. 131--135. Google Scholar

[31] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7132--714. Google Scholar

[32] Jégou H, Douze M, Schmid C et al. Aggregating local descriptors into a compact image representation. In: Proceedings of Conference on Computer Vision Pattern Recognition, San Francisco, 2010. 3304--3311. Google Scholar

[33] Gong Y, Wang L, Guo R, et al. Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of European conference on computer vision, Zurich, 2014. 392--407. Google Scholar

[34] Arandjelovic R, Gronat P, Torii A, et al. NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of conference on Computer Vision and Pattern Recognition, Las Vegas. 2016. 5297--5307. Google Scholar

[35] Lin R, Xiao J, Fan J. Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In: Proceedings of the European Conference on Computer Vision, Munich, 2018. Google Scholar

[36] Tang Y, Zhang X, Ma L, et al. Non-local netVLAD encoding for video classification. In: Proceedings of the European Conference on Computer Vision, Munich, 2018. Google Scholar

[37] Philbin J, Chum O, Isard M, et al. Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of conference on Computer Vision and Pattern Recognition, Minneapolis, 2007. 1--8. Google Scholar

[38] Sivic J, Zisserman A. Video Google: A text retrieval approach to object matching in videos. In: Proceedings of International Conference on Computer Vision, Nice, 2003. Google Scholar

[39] Jegou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 2011, 34: 1704-1716. Google Scholar

[40] Perronnin F, Liu Y, Sánchez J, et al. Large-scale image retrieval with compressed fisher vectors. In: Proceedings of Conference on Computer Vision and Pattern Recognition, 2010. 3384--3391. Google Scholar

[41] Miech A, Laptev I, Sivic J. Learnable pooling with context gating for video classification. 2017,. arXiv Google Scholar

[42] Chung J, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. 2014,. arXiv Google Scholar

[43] Ostyakov P, Logacheva E, Suvorov R, et al.Label Denoising with Large Ensembles of Heterogeneous Neural Networks. In: Proceedings of the European Conference on Computer Vision, Munich, 2018. Google Scholar

[44] Lee J, Reade W, Sukthankar R, et al. The 2nd YouTube-8M Large-Scale Video Understanding Challenge. In: Proceedings of the European Conference on Computer Vision, 2018. Google Scholar

[45] Izmailov P, Podoprikhin D, Garipov T, et al. Averaging weights leads to wider optima and better generalization. 2018,. arXiv Google Scholar

[46] Ma J, Gorti S K, Volkovs M, et al. Cross-Class Relevance Learning for Temporal Concept Localization. 2019,. arXiv Google Scholar

[47] Lin R, Xiao J, et al: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization. 2019,. arXiv Google Scholar

[48] Dai S. A segment-level classification solution to the 3rd YouTube-8M Video Understanding Challenge. 2019. Google Scholar

[49] Cheng C, Zhang C, Wei Y, et al. Sparse Temporal Causal Convolution for Efficient Action Modeling. In: Proceedings of the 27th ACM International Conference on Multimedia, Nice, 2019. 592--600. Google Scholar

[50] Dai W Z, Xu Q L, Yu Y, et al. Tunneling neural perception and logic reasoning through abductive learning. 2018,. arXiv Google Scholar

[51] Skalic M, Austin D. Building a size constrained predictive model for video classification. In: Proceedings of the European Conference on Computer Vision, Munich, 2018. Google Scholar

[52] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Hawaii, 2017. 1492--1500. Google Scholar

• Figure 1

(Color online) The diagram of consistency of segment-level and video-level predictions

• Figure 2

(Color online) The structure of video semantic recognition models

• Figure 3

(Color online) The diagram of prediction. Dotted frames are the video predictions, solid line frames are the video segment predictions, correct categories are checked, and categories discarded by our algorithm are crossed out

• Table 1   A list of video semantic recognition datasets in recent ten years
 Dataset Year #Videos #Labels Duration Video-segment label Hollywood2 2009 3669 12 Long $\times$ HMDB51 2011 7000 51 Short $\times$ UCF101 2012 13320 101 Short $\times$ Sports-1M 2014 1133158 487 Short $\times$ FCVID 2015 91233 239 Long $\times$ ActivityNet 2015 28000 203 Long checkmark YouTube-8M 2016 6100000 3,862 Long checkmark Charades 2016 9848 157 Short $\times$ Kinetics 2017 650000 700 Short $\times$ AVA 2017 1620000 80 Long $\times$ Something Something 2017 220847 174 Short $\times$ Moments in Time 2017 1000000 339 Short $\times$
•

Algorithm 1 长短时预测一致的模型预测算法

Require:视频$~{\rm~Video}_{i=0}^I$, 视频语义类别 $C_v$, 片段视频语义类别 $C_s$, 候选列表生成算法 algo;

for all ${\rm~Video}_i$

视频 ${\rm~Video}_i$ 分段成多个片段视频 ${\rm~Segment}_{i,j=0}^J$;

${\rm~prediction}_i^v$ = VideoPredictionModels(${\rm~Video}_i$);

for all ${\rm~Segment}_{ij}$

${\rm~prediction}_{ij}^s$ = SegmentPredictionModels(${\rm~Segment}_{ij}$);

end for

end for

GenerateCandidates(algo)

for each ${\rm~Segment}_{i,j}$

${\rm~candidate}_{ij}^s=\{{\rm~prediction}_{ij}^s\cap~{\rm~candidate}_i^v\}$;

根据${\rm~candidate}_{ij}^s$的分数由高到低预测${\rm~Segment}_{ij}$包含语义类别$c_s~\in~C_s$;

end for

计算所有类别$c_s~\in~C_s$的 均值准确率(average precision, AP);

计算片段视频语义类别$C_s$的 平均均值准确率(mean average precision, MAP);

function GenerateCandidates(algo)

for each ${\rm~Video}_i$

if algo=分数排名约束 then

for each ${\rm~Video}_i$

${\rm~candidate}_i=\{{\rm~prediction}_i^v|{\rm~prediction}_i^v.{\rm~score}\ge~\text{sorted}({\rm~prediction}_i^v.{\rm~score})[k-1]\}$ ($k$是超参数);

end for

end if

if algo=分数阈值约束 then

${\rm~candidate}_i=\{{\rm~prediction}_i^v|{\rm~prediction}_i^v.{\rm~score}\ge~{\rm~threshold}\}$ (${\rm~threshold}$是超参数);

end if

if algo=类别预测数量约束 then

设定最多有$m$个视频的预测结果为视频语义类别$c_v~\in~C_v$,若超出$M$则移除分数最小的预测结果 ($m$是超参数);

根据${\rm~prediction}_i^v$的分数由高到低预测${\rm~Video}_i$包含语义类别$c_v~\in~C_v$并为该类别记录下${\rm~Video}_i$的分数;

更新所有类别$c_v~\in~C_v$的预测结果中最小的分数$\text{thresholds}_{c_v}$;

${\rm~candidate}_i^v=\{{\rm~prediction}_i^v|{\rm~prediction}_i^v.{\rm~score}\ge~\text{thresholds}_{c_v}\}$;

end if

end for

end function

• Table 2   Segment MAP comparison for evaluating our approaches
 Model Video GAP Video-segment MAP – – Fine-tune Fine-tune Fine-tune – – – Consistency Consistency – – – – Using all data Mix-NeXtVLAD* 0.88433 0.7373 0.78638 0.81127 0.81548 Mix-EFNL-NeXtVLAD* 0.88288 0.65238 0.77147 0.80857 0.81212 Mix-LFNL-NeXtVLAD 0.88142 0.69911 0.77579 0.80625 0.8093 Mix-SoftDBOF* 0.88071 0.74305 0.80582 0.81237 0.81421 Mix-GatedDBOF* 0.8802 0.73679 0.79963 0.81122 0.81327 Mix-NetFV 0.88251 0.73049 0.77949 0.80967 0.81235 Mix-GRU 0.87659 0.68541 0.77332 0.80436 0.8058 Mix-ResNetLike* 0.86499 0.71616 0.7835 0.8061 0.80928 Mix-ResNetLike-Identity* 0.86284 0.71958 0.78614 0.80796 0.81034 Mix-ResNetLike-Max* 0.86288 0.72541 0.78558 0.80825 0.81100 Essemble models (*) 0.88932 – – – 0.8262
• Table 3   Results of different consistency constraint algorithms
 Constraint algorithm Video-segment MAP No constraint 0.80419 Score ranking constraint: top 100 0.82250 Score threshold constraint: $>$25E$-5$ 0.82326 Prediction numbers per class constraint: 2000 0.82620
• Table 4   Results of top 5 teams in the YouTube-8M challenge
 Team Video-segment MAP Layer6 AI 0.83292 BigVid (our team) 0.82620 RLin 0.82551 Bestfitting 0.81707 Last Top GB Model 0.80459

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有