logo

SCIENTIA SINICA Informationis, Volume 47, Issue 8: 1008(2017) https://doi.org/10.1360/N112016-00288

Visual question answering based on spatial DCTHash dynamic parameter network

More info
  • ReceivedDec 16, 2016
  • AcceptedMar 6, 2017
  • PublishedJun 26, 2017

Abstract

As research on deep learning and multimodal fusion continues to develop, question-answering systems have evolved from purely textual style to incorporating visual information. Visual question answering is becoming one of the inter-discipline topics between natural language processing and computer vision, and is receiving much attention. The dynamic parameter prediction network proposed by Hyeonwoo et al. is capable of effectively combining questions and visual information. However, when the network did weights hashing, the locations of the hashing codes were random, ignoring the image contents' spatial distributions. To overcome this shortcoming, in this paper, a new spatial DCTHash-based dynamic parameter prediction network on multimodal fusion is proposed for predicting visual question's answers. A conv7 feature mapping is extracted to retain the spatial visual information in a fully convolutional style. Then, question-related and visual structure-preserved convolution kernels are generated to perform the visual answer prediction. The proposed model has been compared with current commonly used algorithms on two public datasets: COCOqa and MSCOCO-VQA. The experimental results demonstrate that the proposed model has competitive advantages and can achieve relatively better performance.


Funded by

国家自然科学基金(61365002,61462045,61462042,61662030)

江西省教育厅科技项目(GJJ150350)


References

[1] Yates A, Cafarella M, Banko M, et al. Textrunner: open information extraction on the web. In: Proceedings of Human Language Technologies: the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Rochester, 2007. 25--26. Google Scholar

[2] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Quebec, 2014. 1682--1690. Google Scholar

[3] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Annual Conference on Neural Information Processing Systems, California, 2012. 1097--1105. Google Scholar

[4] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations, San Diego, 2015. 1--14. Google Scholar

[5] SzegedyC, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 1--9. Google Scholar

[6] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[7] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[8] Chung J Y, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. In: Proceedings of the Annual Conference on Neural Information Processing Systems Deep Learning Workshop, Quebec, 2014. 1--9. Google Scholar

[9] Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1--9. Google Scholar

[10] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 2625--2634. Google Scholar

[11] Shih K J, Singh S, Hoiem D. Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4613--4621. Google Scholar

[12] Jiang A, Wang F, Porikli F, et al. Compositional memory for visual question answering,. arXiv Google Scholar

[13] Xu K, Ba J L, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, Nashville, 2015. 2048--2057. Google Scholar

[14] Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 30--38. Google Scholar

[15] Chen W L, Wilson J T, Tyree S, et al. Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, Lille, 2015. 2285--2294. Google Scholar

[16] Chen W L, Wilson J T, Tyree S. Compressing convolutional neural networks,. arXiv Google Scholar

[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Arizona, 2013. Google Scholar

[18] Kiros R, Zhu Y K, Salakhutdinov R, et al. Skip-thought vectors. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, 2015. 3294--3302. Google Scholar

[19] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar

[20] Ren M, Kiros R, Zemel R. Exploring models and data for image question answering. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, 2015. 2953--2961. Google Scholar

[21] Antol S, Agrawal A, Lu J, et al. VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, New York, 2015. 2425--2433. Google Scholar

[22] Wu Z B, Palmer M. Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, 1994. 133--138. Google Scholar

[23] Ma L, Lu Z D, Li H. Learning to answer questions from image using convolutional neural network. In: Proceedings of the Association for the Advance of Artificial Intelligence, Phoenix, 2016. 3567--3573. Google Scholar

  • Figure 1

    (Color online) Illusional samples for visual question answering

  • Figure 2

    Deep information extraction networks for question and image, respectively. (a) RNN for question; (b) VGGNet for image

  • Figure 3

    (Color online) Illustration on hashing weight matrix's generation

  • Figure 4

    (Color online) Spatial-DCTHash based dynamic parameters network

  • Figure 5

    (Color online) Convolutional network for feature “conv7"

  • Figure 6

    (Color online) The generation flowchart of spatial-DCTHash based convolutional kernel

  • Figure 7

    (Color online) Typical samples on VQA predicted results

  • Table 1   Performance comparisons on COCOqa dataset
    WUPS@0.9 WUPS@0.0 Acc
    IMG+BOW [20] 66.78 88.99 55.92
    2VIS+BLSTM [20] 65.34 88.64 55.09
    Ensemble [20] 67.90 89.52 57.84
    ConvQA [23] 65.36 88.58 54.95
    DPPnet-[}CNN-FIXED{] [14] 69.61 90.38 59.52
    Spatial-DCTHash-[}CNN-FIXED{] 69.95 90.47 60.01
  • Table 2   Performance comparisons on open-ended task on MSCOCO-VQA val dataset
    Val set Yes/no Number Other All
    DPPnet-[}CNN-FIXED{] [14] 81.05 33.49 41.04 55.07
    Spatial-DCTHash-[}CNN-FIXED{] 80.76 33.85 41.46 55.21
  • Table 3   Performance comparisons on open-ended task on MSCOCO-VQA test-dev dataset
    Test-dev set Yes/no Number Other All
    Question [21] 75.66 36.7 27.14 48.09
    Image [21] 64.01 0.42 3.77 28.13
    Q+I [21] 75.55 33.67 37.37 52.64
    LSTM Q [21] 78.2 35.68 26.59 48.76
    LSTM Q+I [21] 78.94 35.24 36.42 53.74
    DPPnet-[}CNN-FIXED{] [14] 80.48 37.20 40.90 56.74
    DPPnet [14] 80.71 37.24 41.69 57.22
    Spatial-DCTHash-[}CNN-FIXED{] 80.54 36.81 42.52 57.51
  • Table 4   Performance comparisons on open-ended task on MSCOCO-VQA test-standard dataset
    Test-standard set Yes/no Number Other All
    Human [21] 95.77 83.39 72.67 83.3
    LSTM Q+I [21] 54.06
    DPPnet [14] 80.28 36.92 42.24 57.36
    Spatial-DCThash-[}CNN-FIXED{] 80.20 35.29 42.94 57.50
  • Table 5   Comparisons of accuracy on different question types on MSCOCO-VQA val dataset
    Question type Spatial-DCTHash DPPnet Question type Spatial-DCTHash DPPnet
    are there 82.29 83.15 what is the man 50.37 49.5
    what brand 36.2 34.73 which 41.65 38.3
    what room is 82.22 84.11 are these 76.29 77.07
    what color is 51.58 48.73 what are 47.03 47.59
    is 79.96 80.41 what is the 36.77 36.67
    are they 78.15 78.05 where are the 29.82 29.57
    what number is 3.33 2.03 is this a 79.34 79.67
    what sport is 83.91 84.78 can you 76.8 76.37
    are 75.52 75.68 what time 19.32 20.43
    is the 76.21 76.41 what are the 37.37 38.31
    what is the person 51.04 49.14 are there any 74.9 73.73
    how many 39.74 39.59 what color are the 50.63 50.14
    does this 78.73 79.83 why 16.1 15.48
    is there a 89.27 89.34 what is this 51.35 51.02
    is he 80.08 80.59 how many people are in 35.61 34.54
    what 36.7 36.07 do you 81.09 82.72
    does the 78.32 79.6 is this 77.82 78.63
    is the person 76.59 75.89 why is the 16.92 19.19
    where is the 25.69 26.83 what is the color of the 62.51 61.41
    what animal is 62.25 60.95 what is 29.1 29.25
    how 23.07 22.75 could 90.21 90.53
    what is the woman 42.56 41.45 is that a 74.54 73.03
    none of the above 54 54.02 what is in the 35.46 33.78
    who is 24.38 25.61 what does the 22.2 20.73
    is the woman 78.86 77.64 what kind of 45.78 45.87
    are the 75.69 76.03 is it 81.47 83.27
    how many people are 39.92 38.55 is the man 79.61 79.15
    what is on the 34.16 33.51 what is the name 7.07 7.21
    has 78.39 79.45 is there 83.16 84.41
    was 82.82 82.67 what color is the 53.55 51.44
    what type of 44.71 45.56 what color 39.23 36.47
    is this an 80.18 80 is this person 75.25 75.21
    do 75.7 74.62

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号