logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 862-876(2020) https://doi.org/10.1360/SSI-2019-0292

Cross-modal video moment retrieval based on visual-textual relationship alignment

More info
  • ReceivedDec 31, 2019
  • AcceptedApr 22, 2020
  • PublishedJun 10, 2020

Abstract

In recent years, increasing amounts of video resources have created a series of demands for fine retrieval of video moments, such as highlight moments in sports events and the re-creation of specific video content. In this context, research on cross-modal video segment retrieval, which attempts to output a video moment that matches the input query text, is gradually emerging. Existing solutions primarily focus on global or local feature representation for query text and video moments. However, such solutions ignore matching semantic relations contained in query text and video moments. For example, given the query text “a person is playing basketball", existing retrieval systems may incorrectly return a video moment of “a person holding a basketball" without the considering the semantic relationship of “a person playing basketball". Therefore, this paper proposes a cross-modal relationship alignment framework, which we refer to as CrossGraphAlign, for cross-modal video moment retrieval. The proposed framework constructs a textual relationship graph and a visual relationship graph to model the query semantics in text and video segment relations, and then evaluates the similarity between text relations and visual relations through cross-modally aligned graph convolutional networks to help construct a more accurate video moment retrieval system. Experimental results on the publicly available cross-modal video retrieval datasets TACoS and ActivityNet Captions demonstrate that the proposed method can effectively utilize the semantic relationships to improve the recall rate in cross-modal video moment retrieval.


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61703386,U1605251)


References

[1] Zhang H J, Wu J, Zhong D. An integrated system for content-based video retrieval and browsing. Pattern Recognition, 1997, 30: 643-658 CrossRef Google Scholar

[2] Jiang Y G, Ngo C-W, Yang J. Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the ACM International Conference on Image and Video Retrieval, Amsterdam, 2007. 494--501. Google Scholar

[3] Snoek C G, Worring M. Concept-Based Video Retrieval. Foundations and Trends in Information Retrieval, 2009, 2(4): 215-322. Google Scholar

[4] Liu Y, Albanie S, Nagrani A, et al. Use what you have: video retrieval using representations from collaborative experts. In: Proceedings of British Machine Vision Conference, 2019. Google Scholar

[5] Gao J Y, Sun C, Yang Z H, et al. TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 5277--5285. Google Scholar

[6] Hendricks L A, Wang O, Shechtman E, et al. Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 5803--5812. Google Scholar

[7] Xu H J, He K, Plummer B A, et al. Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 2019. 9062--9069. Google Scholar

[8] Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2012. 1106--1114. Google Scholar

[9] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[10] LeCun Y, Bengio Y, Hinton G E. Deep Learning. Nature, 2015, 521(7553): 436-444. Google Scholar

[11] Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, Quebec, 2015. 91--99. Google Scholar

[12] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 4489--4497. Google Scholar

[13] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[14] Devlin J, Chang M-W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019. 4171--4186. Google Scholar

[15] Zhou Z H. Abductive learning: towards bridging machine learning and logical reasoning. Sci China Inf Sci, 2019, 62: 076101 CrossRef Google Scholar

[16] Lv G Y, Xu T, Chen E H, et al. Reading the videos: temporal labeling for crowdsourced time-sync videos based on semantic embedding. In: Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, 2016. 3000--3006. Google Scholar

[17] Zhou P, Xu T, Yin Z. Character-oriented Video Summarization with Visual and Textual Cues. IEEE Trans Multimedia, 2019, : 1-1 CrossRef Google Scholar

[18] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 20--36. Google Scholar

[19] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of Advances in Neural Information Processing Systems, Montreal, 2014. 568--576. Google Scholar

[20] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[21] Chen J Y, Chen X P, Ma L, et al. Temporally grounding natural sentence in video. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, 2018. 162--171. Google Scholar

[22] Liu M, Wang X, Nie L, et al. Attentive moment retrieval in videos. In: Proceedings of International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, 2018. 15--24. Google Scholar

[23] Yuan Y T, Mei T, Zhu W W. To find where you talk: temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 2019. 9159--9166. Google Scholar

[24] Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[25] Ge R Z, Gao J Y, Chen K, et al. MAC: mining activity concepts for language-based temporal localization. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa Village, 2019. 245--253. Google Scholar

[26] Jiang B, Huang X, Yang C, et al. Cross-modal video moment retrieval with spatial and language-temporal attention. In: Proceedings of the on International Conference on Multimedia Retrieval, Ottawa, 2019. 217--225. Google Scholar

[27] Wang W, Huang Y, Wang L. Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 334--343. Google Scholar

[28] Regneri M, Rohrbach M, Wetzel D. Grounding Action Descriptions in Videos. Trans Association Comput Linguistics, 2013, 1: 25-36 CrossRef Google Scholar

[29] Heilbron F C, Escorcia V, Ghanem B, et al. A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 961--970. Google Scholar

[30] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 6077--6086. Google Scholar

[31] Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 10685--10694. Google Scholar

[32] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, 2015. 3668--3678. Google Scholar

[33] Xu D F, Zhu Y K, Choy C B, et al. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 3097--3106. Google Scholar

[34] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conference on Learning Representations, Toulon, 2017. Google Scholar

[35] Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, 2018. 7444--7452. Google Scholar

[36] Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos. In: Proceedings of the IEEE Conference on Computer Vision, Venice, 2017. 706--715. Google Scholar

[37] Wang S J, Wang R P, Yao Z W, et al. Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of Winter Conference on Applications of Computer Vision, Colorado, 2020. Google Scholar

[38] Lee K-H, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 212--228. Google Scholar

[39] Lin D, Fidler S, Kong C, et al. Visual semantic search: retrieving videos via complex textual queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, 2014. 2657--2664. Google Scholar

[40] Alayrac J-B, Bojanowski P, Agrawal N, et al. Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4575--4583. Google Scholar

[41] Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1049--1058. Google Scholar

[42] Ma S, Sigal L, Sclaroff S. Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1942--1950. Google Scholar

[43] Singh B, Marks T K, Jones M, et al. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1961--1970. Google Scholar

[44] Zhao Y, Xiong Y J, Wang L M, et al. Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 2933--2942. Google Scholar

[45] Chao Y-W, Vijayanarasimhan S, Seybold B, et al. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 1130--1139. Google Scholar

[46] Lin T W, Zhao X, Shou Z. Single shot temporal action detection. In: Proceedings of the ACM on Multimedia Conference, Mountain View, 2017. 988--996. Google Scholar

[47] Dai B, Zhang Y Q, Lin D H. Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 3298--3308. Google Scholar

[48] Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 690--706. Google Scholar

[49] Zellers R, Yatskar M, Thomson S, et al. Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 5831--5840. Google Scholar

[50] Schuster S, Krishna R, Chang A, et al. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the 4th Workshop on Vision and Language, Lisbon, 2015. 70--80. Google Scholar

[51] Han R J, Ning Q, Peng N Y. Joint event and temporal relation extraction with shared representations and structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, Hong Kong, 2019. 434--444. Google Scholar

[52] Rohrbach M, Regneri M, Andriluka M, et al. Script data for attribute-based recognition of composite activities. In: Proceedings of European Conference on Computer Vision, Florence, 2012. 144--157. Google Scholar

[53] Zhang D, Dai X Y, Wang X, et al. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 1247--1257. Google Scholar

[54] Zhang S Y, Peng H W, Fu J L, et al. Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, New York, 2020. Google Scholar

[55] Liu M, Wang X, Nie L, et al. Cross-modal moment localization in videos. In: Proceddings of ACM Multimedia Conference on Multimedia Conference, Seoul, 2018. 843--851. Google Scholar

[56] Lin T-Y, Maire M, Belongie S J, et al. Microsoft Coco: common objects in context. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 740--755. Google Scholar

[57] Russakovsky O, Deng J, Su H. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis, 2015, 115: 211-252 CrossRef Google Scholar

  • Figure 1

    (Color online) An example for cross-modal video moment retrieval

  • Figure 2

    (Color online) An illustration of CrossGraphAlign for cross-modal relationship alignment. The algorithm firstly constructs textural relationship graph and visual relationship graph from language query and video, respectively.Then, the visual-textural graph alignment module estimates the most similar video moments related to language query

  • Figure 3

    Building textural relationship graph. The query text is firstly parsed to a dependency tree, then a scene graph is built on it,and finally the word-to-vector method is used to construct the textural relationship graph

  • Figure 4

    (Color online) Building visual relationship graph. Modifying Faster R-CNN to extract object features and relationship features

  • Figure 5

    (Color online) Align visual-textual relationship graph. Utilizing GCN and attention to find visual-textual relationship feature

  • Figure 6

    (Color online) The visualization results from our CrossGraphAlign. Even when the relationship graph is not established successfully, CrossGraphAlign can still reasonably predict the similarity between language query and video moment. (a) Example 1: When the relationship graph is successfully established, CrossGraphAlign can successfully perform visual-text matching. (b) Example 2: When the relationship graph is not established successfully, CrossGraphAlign can still reasonably predict the similarity

  • Table 1   Performances of CrossGraphAlign on TACoS dataset
    Method $R$@1,rm IOU = 0.1 $R$@1,rm IOU = 0.3 $R$@1,rm IOU = 0.5 $R$@5,rm IOU = 0.1 $R$@5,rm IOU = 0.3 $R$@5,rm IOU = 0.5
    MCN [6] 14.4 5.9 37.4 10.3
    CTRL [5] 24.3 18.3 13.3 48.7 36.7 25.4
    CMIN [54] 32.5 24.6 18.1 62.1 38.5 27.0
    SLTA [26] 23.1 17.1 11.9 46.5 32.9 20.9
    ABLR [23] 34.7 19.5 9.4
    QSPN [7] 25.3 20.2 15.2 53.2 36.7 25.3
    TGN 41.9 21.8 18.9 53.4 39.1 31.0
    2D-Tan [52] 47.6 37.3 25.3 70.3 57.8 45.0
    Ours 51.9 39.8 26.4 74.5 60.0 47.2
  • Table 2   Performances of CrossGraphAlign on ActivityNet Captions dataset
    Method $R$@1,rm IOU = 0.3 $R$@1,rm IOU = 0.5 $R$@1,rm IOU = 0.7 $R$@5,rm IOU = 0.3 $R$@5,rm IOU = 0.5 $R$@5,rm IOU = 0.7
    MCN [6] 39.4 21.4 6.4 68.1 53.2 29.7
    CTRL [5] 47.4 29.0 10.3 75.3 59.2 37.5
    QSPN [7] 52.1 33.3 13.4 77.7 62.4 40.8
    2D-Tan [52] 59.5 44.5 26.5 85.5 77.1 62.0
    Ours 62.7 47.2 27.9 88.1 79.1 64.2
  • Table 3   Ablation study of GCN update information and attention mechanism. The speed of our framework is evaluated on a Nvidia Titan Xp GPU
    Model in GCN $R$@1, rm IOU = 0.1 $R$@1, rm IOU = 0.3 $R$@1, rm IOU = 0.5 $R$@5, rm IOU = 0.1 $R$@5, rm IOU = 0.3 $R$@5, rm IOU = 0.5 Speed (ms)
    Node update Edge update Attention
    52.3 40.1 26.5 74.9 60.3 47.4 752
    50.1 38.4 25.1 72.2 58.4 45.7 568
    × 48.1 37.5 25.9 71.7 58.7 45.3 503
    × 51.9 39.8 26.4 74.5 60.0 47.2 352

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号