As research on deep learning and multimodal fusion continues to develop, question-answering systems have evolved from purely textual style to incorporating visual information. Visual question answering is becoming one of the inter-discipline topics between natural language processing and computer vision, and is receiving much attention. The dynamic parameter prediction network proposed by Hyeonwoo et al. is capable of effectively combining questions and visual information. However, when the network did weights hashing, the locations of the hashing codes were random, ignoring the image contents' spatial distributions. To overcome this shortcoming, in this paper, a new spatial DCTHash-based dynamic parameter prediction network on multimodal fusion is proposed for predicting visual question's answers. A conv7 feature mapping is extracted to retain the spatial visual information in a fully convolutional style. Then, question-related and visual structure-preserved convolution kernels are generated to perform the visual answer prediction. The proposed model has been compared with current commonly used algorithms on two public datasets: COCOqa and MSCOCO-VQA. The experimental results demonstrate that the proposed model has competitive advantages and can achieve relatively better performance.
国家自然科学基金(61365002,61462045,61462042,61662030)
江西省教育厅科技项目(GJJ150350)
[1] Yates A, Cafarella M, Banko M, et al. Textrunner: open information extraction on the web. In: Proceedings of Human Language Technologies: the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Rochester, 2007. 25--26. Google Scholar
[2] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Quebec, 2014. 1682--1690. Google Scholar
[3] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Annual Conference on Neural Information Processing Systems, California, 2012. 1097--1105. Google Scholar
[4] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations, San Diego, 2015. 1--14. Google Scholar
[5] SzegedyC, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 1--9. Google Scholar
[6] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar
[7] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar
[8] Chung J Y, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. In: Proceedings of the Annual Conference on Neural Information Processing Systems Deep Learning Workshop, Quebec, 2014. 1--9. Google Scholar
[9] Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1--9. Google Scholar
[10] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 2625--2634. Google Scholar
[11] Shih K J, Singh S, Hoiem D. Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4613--4621. Google Scholar
[12] Jiang A, Wang F, Porikli F, et al. Compositional memory for visual question answering,. arXiv Google Scholar
[13] Xu K, Ba J L, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, Nashville, 2015. 2048--2057. Google Scholar
[14] Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 30--38. Google Scholar
[15] Chen W L, Wilson J T, Tyree S, et al. Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, Lille, 2015. 2285--2294. Google Scholar
[16] Chen W L, Wilson J T, Tyree S. Compressing convolutional neural networks,. arXiv Google Scholar
[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Arizona, 2013. Google Scholar
[18] Kiros R, Zhu Y K, Salakhutdinov R, et al. Skip-thought vectors. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, 2015. 3294--3302. Google Scholar
[19] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar
[20] Ren M, Kiros R, Zemel R. Exploring models and data for image question answering. In: Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, 2015. 2953--2961. Google Scholar
[21] Antol S, Agrawal A, Lu J, et al. VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, New York, 2015. 2425--2433. Google Scholar
[22] Wu Z B, Palmer M. Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, 1994. 133--138. Google Scholar
[23] Ma L, Lu Z D, Li H. Learning to answer questions from image using convolutional neural network. In: Proceedings of the Association for the Advance of Artificial Intelligence, Phoenix, 2016. 3567--3573. Google Scholar
Figure 1
(Color online) Illusional samples for visual question answering
Figure 2
Deep information extraction networks for question and image, respectively. (a) RNN for question; (b) VGGNet for image
Figure 3
(Color online) Illustration on hashing weight matrix's generation
Figure 4
(Color online) Spatial-DCTHash based dynamic parameters network
Figure 5
(Color online) Convolutional network for feature “conv7"
Figure 6
(Color online) The generation flowchart of spatial-DCTHash based convolutional kernel
Figure 7
(Color online) Typical samples on VQA predicted results
WUPS@0.9 | WUPS@0.0 | Acc | |
IMG+BOW | 66.78 | 88.99 | 55.92 |
2VIS+BLSTM | 65.34 | 88.64 | 55.09 |
Ensemble | 67.90 | 89.52 | 57.84 |
ConvQA | 65.36 | 88.58 | 54.95 |
DPPnet-[}CNN-FIXED{] | 69.61 | 90.38 | 59.52 |
Spatial-DCTHash-[}CNN-FIXED{] | | | |
Val set | Yes/no | Number | Other | All |
DPPnet-[}CNN-FIXED{] | 81.05 | 33.49 | 41.04 | 55.07 |
Spatial-DCTHash-[}CNN-FIXED{] | 80.76 | 33.85 | 41.46 | 55.21 |
Test-dev set | Yes/no | Number | Other | All |
Question | 75.66 | 36.7 | 27.14 | 48.09 |
Image | 64.01 | 0.42 | 3.77 | 28.13 |
Q+I | 75.55 | 33.67 | 37.37 | 52.64 |
LSTM Q | 78.2 | 35.68 | 26.59 | 48.76 |
LSTM Q+I | 78.94 | 35.24 | 36.42 | 53.74 |
DPPnet-[}CNN-FIXED{] | 80.48 | 37.20 | 40.90 | 56.74 |
DPPnet | 80.71 | 37.24 | 41.69 | 57.22 |
Spatial-DCTHash-[}CNN-FIXED{] | 80.54 | 36.81 | | |
Test-standard set | Yes/no | Number | Other | All |
Human | 95.77 | 83.39 | 72.67 | 83.3 |
LSTM Q+I | – | – | – | 54.06 |
DPPnet | 80.28 | 36.92 | 42.24 | 57.36 |
Spatial-DCThash-[}CNN-FIXED{] | 80.20 | 35.29 | 42.94 | 57.50 |
Question type | Spatial-DCTHash | DPPnet | Question type | Spatial-DCTHash | DPPnet |
are there | 82.29 | 83.15 | what is the man | 50.37 | 49.5 |
what brand | 36.2 | 34.73 | which | 41.65 | 38.3 |
what room is | 82.22 | 84.11 | are these | 76.29 | 77.07 |
what color is | 51.58 | 48.73 | what are | 47.03 | 47.59 |
is | 79.96 | 80.41 | what is the | 36.77 | 36.67 |
are they | 78.15 | 78.05 | where are the | 29.82 | 29.57 |
what number is | 3.33 | 2.03 | is this a | 79.34 | 79.67 |
what sport is | 83.91 | 84.78 | can you | 76.8 | 76.37 |
are | 75.52 | 75.68 | what time | 19.32 | 20.43 |
is the | 76.21 | 76.41 | what are the | 37.37 | 38.31 |
what is the person | 51.04 | 49.14 | are there any | 74.9 | 73.73 |
how many | 39.74 | 39.59 | what color are the | 50.63 | 50.14 |
does this | 78.73 | 79.83 | why | 16.1 | 15.48 |
is there a | 89.27 | 89.34 | what is this | 51.35 | 51.02 |
is he | 80.08 | 80.59 | how many people are in | 35.61 | 34.54 |
what | 36.7 | 36.07 | do you | 81.09 | 82.72 |
does the | 78.32 | 79.6 | is this | 77.82 | 78.63 |
is the person | 76.59 | 75.89 | why is the | 16.92 | 19.19 |
where is the | 25.69 | 26.83 | what is the color of the | 62.51 | 61.41 |
what animal is | 62.25 | 60.95 | what is | 29.1 | 29.25 |
how | 23.07 | 22.75 | could | 90.21 | 90.53 |
what is the woman | 42.56 | 41.45 | is that a | 74.54 | 73.03 |
none of the above | 54 | 54.02 | what is in the | 35.46 | 33.78 |
who is | 24.38 | 25.61 | what does the | 22.2 | 20.73 |
is the woman | 78.86 | 77.64 | what kind of | 45.78 | 45.87 |
are the | 75.69 | 76.03 | is it | 81.47 | 83.27 |
how many people are | 39.92 | 38.55 | is the man | 79.61 | 79.15 |
what is on the | 34.16 | 33.51 | what is the name | 7.07 | 7.21 |
has | 78.39 | 79.45 | is there | 83.16 | 84.41 |
was | 82.82 | 82.67 | what color is the | 53.55 | 51.44 |
what type of | 44.71 | 45.56 | what color | 39.23 | 36.47 |
is this an | 80.18 | 80 | is this person | 75.25 | 75.21 |
do | 75.7 | 74.62 |