logo

SCIENCE CHINA Information Sciences, Volume 60, Issue 9: 092104(2017) https://doi.org/10.1007/s11432-015-0902-2

A novel cross-modal hashing algorithm based on multimodal deep learning

More info
  • ReceivedApr 29, 2016
  • AcceptedAug 15, 2016
  • PublishedMar 21, 2017

Abstract

With the growing popularity of multimodal data on the Web, cross-modal retrieval on large-scale multimedia databases has become an important research topic. Cross-modal retrieval methods based on hashing assume that there is a latent space shared by multimodal features. To model the relationship among heterogeneous data, most existing methods embed the data into a joint abstraction space by linear projections. However, these approaches are sensitive to noise in the data and are unable to make use of unlabeled data and multimodal data with missing values in real-world applications. To address these challenges, we proposed a novel multimodal deep-learning-based hash (MDLH) algorithm. In particular, MDLH uses a deep neural network to encode heterogeneous features into a compact common representation and learns the hash functions based on the common representation. The parameters of the whole model are fine-tuned in a supervised training stage. Experiments on two standard datasets show that the method achieves more effective results than other methods in cross-modal retrieval.


Funded by

Fundamental Research Funds for the Central Universities of China(N140404012)

National Natural Science Foundation of China(61402091)

National Natural Science Foundation of China(61370074)


Acknowledgment

Acknowledgments

This work was supported by National Natural Science Foundation of China (Grant Nos. 61402091, 61370074), and Fundamental Research Funds for the Central Universities of China (Grant No. N140404012).


References

[1] Chen C, Zhu Q S, Lin L, et al. Web media semantic concept retrieval via tag removal and model fusion. ACM Trans Intel Syst Technol, 2013, 4: 478-488 Google Scholar

[2] Leung C H C, Chan A W S, Milani A, et al. Intelligent social media indexing and sharing using an adaptive indexing search engine. ACM Trans Intel Syst Technol, 2012, 3: 338-343 Google Scholar

[3] Zhang R M, Lin L, Zhang R, et al. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans Imag Process, 2015, 24: 4766-4779 CrossRef Google Scholar

[4] Nie X S, Liu J, Sun J D, et al. Robust video hashing based on representative-dispersive frames. Sci China Inf Sci, 2013, 56: 068104-4779 Google Scholar

[5] Xiang S J, Yang J Q, Huang J W. Perceptual video hashing robust against geometric distortions. Sci China Inf Sci, 2012, 55: 1520-1527 CrossRef Google Scholar

[6] Datar M, Immorlica N, Indyk P, et al. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of ACM Symposium on Computational Geometry, New York, 2004. 253--262. Google Scholar

[7] Weiss Y, Torralba A, Fergus R. Spectral hashing. In: Proceedings of 22nd Annual Conference on Neural Information Processing Systems, Vancouver, 2008. 1753--1760. Google Scholar

[8] Zhen Y, Yang D. A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 940--948. Google Scholar

[9] Zhu X F, Huang Z, Shen H T, et al. Linear cross-modal hashing for efficient multimedia search. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 143--152. Google Scholar

[10] Yu Z, Wu F, Yang Y, et al. Discriminative coupled dictionary hashing for fast cross-media retrieval. In: Proceedings of the 37th Annual ACM SIGIR Conference, Gold Coast, 2014. 395--404. Google Scholar

[11] Bronstein M, Bronstein A, Michel F, et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 3594--3601. Google Scholar

[12] Kumar S, Udupa R. Learning hash functions for cross-view similarity search. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, 2011. 1360--1365. Google Scholar

[13] Hu Y, Jin Z M, Ren H Y, et al. Iterative multi-view hashing for cross media indexing. In: Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, 2014. 527--536. Google Scholar

[14] Song J K, Yang Y, Yang Y, et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources. \linebreak In: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, 2013. 785--796. Google Scholar

[15] Wu B T, Yang Q, Zheng W S, et al. Quantized correlation hashing for fast cross-modal search. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 3946--3952. Google Scholar

[16] Kang Y, Kim S, Choi S. Deep learning to hash with multiple representations. In: Proceedings of IEEE International Conference on Data Mining, Brusselsm, 2012. 930--935. Google Scholar

[17] Wang D X, Cui P, Ou M D, et al. Deep multimodal hashing with orthogonal regularization. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 2291--2297. Google Scholar

[18] Wang Q F, Si L, Shen B. Learning to hash on partial multimodal data. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 3904--3910. Google Scholar

[19] Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, 2012, 20: 30-42 CrossRef Google Scholar

[20] Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In: Proceedings of Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 1106--1114. Google Scholar

[21] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Washington, 2011. 689--696. Google Scholar

[22] Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines. In: Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 2231--2239. Google Scholar

[23] Sohn K, Shang W, Lee H. Improved multimodal deep learning with variation of information. In: Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, 2014. 2141--2149. Google Scholar

[24] Wu P C, Hoi S C, Xia H, et al. Online multimodal deep similarity learning with application to image retrieval. \linebreak In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 153--162. Google Scholar

[25] Wang W, Ooi B C, Yang X Y, et al. Effective multi-modal retrieval based on stacked autoencoders. In: Proceedings of 40th International Conference on Very Large Data Bases, Hangzhou, 2014. 649--660. Google Scholar

[26] Feng F X, Wang X J, Li R F. Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 21st ACM International Conference on Multimedia, Orlando, 2014. 7--16. Google Scholar

[27] Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res, 2010, 11: 3371-3408 Google Scholar

[28] Salakhutdinov R, Hinton G. Deep Boltzmann machines. In: Proceedings of 12th International Conference on Artificial Intelligence and Statistics, Florida, 2009. 448--455. Google Scholar

[29] Hinton G, Salakhutdinov R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504-507 CrossRef Google Scholar

[30] Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks. In: Proceedings of Annual Conference on Neural Information Processing Systems, Vancouver, 2006. 153--160. Google Scholar

[31] Rumelhart D, Hinton G, Williams R. Neurocomputing: Foundations of Research. Cambridge: MIT Press, 1988. Google Scholar

[32] Rasiwasia N, Pereira J, Coviello E, et al. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, New York, 2010. 251--260. Google Scholar

[33] Blei D, Ng A, Jordan M. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993-1022 Google Scholar

[34] Lowe D. Distinctive image features from scale-invariant key points. Int J Comput Vision, 2004, 60: 91-110 CrossRef Google Scholar

[35] Chua T S, Tang J H, Hong R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of ACM International Conference on Image and Video Retrieval, Santorini, 2009. 1--9. Google Scholar

[36] Zhou J, Ding G G, Guo Y C. Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th Annual International ACMSIGIR Conference, Gold Coast, 2014. 415--424. Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1