SCIENCE CHINA Information Sciences, Volume 60, Issue 1: 012201(2017) https://doi.org/10.1007/s11432-016-0158-2

A framework for the fusion of visual and tactile modalities for improving robot perception

More info
  • ReceivedMar 8, 2016
  • AcceptedJun 30, 2016
  • PublishedNov 22, 2016


Robots should ideally perceive objects using human-like multi-modal sensing such as vision, tactile feedback, smell, and hearing. However, the features presentations are different for each modal sensor. Moreover, the extracted feature methods for each modal are not the same. Some modal features such as vision, which presents a spatial property, are static while features such as tactile feedback, which presents temporal pattern, are dynamic. It is difficult to fuse these data at the feature level for robot perception. In this study, we propose a framework for the fusion of visual and tactile modal features, which includes the extraction of features, feature vector normalization and generation based on bag-of-system (BoS), and coding by robust multi-modal joint sparse representation (RM-JSR) and classification, thereby enabling robot perception to solve the problem of diverse modal data fusion at the feature level. Finally, comparative experiments are carried out to demonstrate the performance of this framework.

Funded by

National Natural Science Foundation of China(613278050)

Academic of Military Medical Science(AMMS)

National Natural Science Foundation of China(91420302)

National Natural Science Foundation of China(91520201)

Innovation Foundation(2015CXJJ020)

National Natural Science Foundation of China(61210013)



This work was supported by National Natural Science Foundation of China (Grant Nos. 613278050, 61210013, 91420302, 91520201) and Academic of Military Medical Science (AMMS) Innovation Foundation (Grant No. 2015CXJJ020).


[1] Sharma R, Pavlovic V I, Huang T S. Toward multimodal human-computer interface. Proc IEEE, 1998, 86: 853-869 CrossRef Google Scholar

[2] Nock H J, Iyengar G, Neti C. Assessing face and speech consistency for monologue detection in video. In: Proceedings of the 10th ACM International Conference on Multimedia. New York: ACM, 2002. 303--306. Google Scholar

[3] Meier U, Stiefelhagen R, Yang J, et al. Towards unrestricted lip reading. Int J Pattern Recogn Artif Intell, 2000, 14: 571-585 CrossRef Google Scholar

[4] Wolff G J, Prasad K V, Stork D G, et al. Lipreading by neural networks: visual processing, learning and sensory integration. In: Proceedings of Advances in Neural Information Processing Systems, Denver, 1993. 1027--1034. Google Scholar

[5] Olshausen B A, Field D J. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res, 1997, 37: 3311--3325. Google Scholar

[6] Nguyen N H, Nasrabadi N M, Tran T D. Robust multi-sensor classification via joint sparse representation. In: Proceedings of the 14th International Conference on Information Fusion. New York: IEEE Press, 2011. 1--8. Google Scholar

[7] Zhang H C, Zhang Y N, Nasrabadi N M, et al. Joint-structured-sparsity-based classification for multiple-measurement transient acoustic signals. IEEE Trans Syst Man Cybern-part B Cybern, 2012, 42: 1586-1598 CrossRef Google Scholar

[8] Yuan X-T, Liu X B, Yan S C. Visual classification with multitask joint sparse representation. IEEE Trans Image Process, 2012, 21: 4349-4360 CrossRef Google Scholar

[9] Liu H P, Sun F C. Fusion tracking in color and infrared images using joint sparse representation. Sci China Inf Sci, 2012, 55: 590-599 CrossRef Google Scholar

[10] Shekhar S, Patel V M, Nasrabadi N M, et al. Joint sparse representation for robust multimodal biometrics recognition. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 113-126 CrossRef Google Scholar

[11] Rao N, Nowak R, Cox C, et al. Classification with the sparse group lasso. IEEE Trans Signal Process, 2016, 64: 448-463 CrossRef Google Scholar

[12] Zhang Q, Levine M D. Robust multi-focus image fusion using multi-task sparse representation and spatial context. IEEE Trans Image Process, 2016, 25: 2045-2058 CrossRef Google Scholar

[13] Lowe D. Distinctive image features from scale-invariant keypoints. Int J Comput Vision, 2004, 60: 91-110 CrossRef Google Scholar

[14] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2005. 886--893. Google Scholar

[15] Chatzichristofis S A, Zagoris K, Boutalis Y S, et al. Accurate image retrieval based on compact composite descriptors and relevance feedback information. Int J Pattern Recogn Artif Intell, 2010, 24: 207-244 CrossRef Google Scholar

[16] Aldous D, Ibragimov I, Jacod J. Exchangeability and Related Topics. Berlin: Springer, 1985. 1--198. Google Scholar

[17] van Gemert J C, Veenman C J, Smeulders A W, et al. Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell, 2010, 32: 1271-1283 CrossRef Google Scholar

[18] Wang J, Yang J, Yu K, et al. Locality-constrained linear coding for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2010. 3360--3367. Google Scholar

[19] Doretto G, Chiuso A, Wu Y N, et al. Dynamic textures. Int J Comput Vision, 2003, 51: 91-109 CrossRef Google Scholar

[20] Ellis K, Coviello E, Chan A B, et al. A bag of systems representation for music auto-tagging. IEEE Trans Audio Speech Lang Process, 2013, 21: 2554-2569 CrossRef Google Scholar

[21] Mumtaz A, Coviello E, Lanckriet G R G, et al. A scalable and accurate descriptor for dynamic textures using bag of system trees. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 697-712 CrossRef Google Scholar

[22] Ma R, Liu H P, Sun F C, et al. Linear dynamic system method for tactile object classification. Sci China Inf Sci, 2014, 57: 120205-712 Google Scholar

[23] Sprechmann P, Ramirez I, Sapiro G, et al. C-hilasso: a collaborative hierarchical sparse modeling framework. IEEE Trans Signal Process, 2011, 59: 4183-4198 CrossRef Google Scholar

[24] Jalali A, Sanghavi S, Ruan C, et al. A dirty model for multi-task learning. In: Proceedings of Conference on Neural Information Processing Systems, Canada, 2010. 964--972. Google Scholar

[25] Clarke F H. Optimization and Nonsmooth Analysis. Hoboken: Wiley, 1990. 24--109. Google Scholar

[26] Chen X J, Zhou W J. Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization. SIAM J Imag Sci, 2010, 3: 765-790 CrossRef Google Scholar

[27] Schmidt M, Fung G, Rosaless R. Optimization Methods for L1 Regularization. Berlin: Springer-Verlag, 2009. Google Scholar

[28] Figueiredo M A T, Nowak R D, Wright S J. Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J Sel Topics Signal Process, 2007, 1: 586-597 CrossRef Google Scholar

[29] Wright S J, Nowak R D, Figueiredo M A T. Sparse reconstruction by separable approximation. IEEE J Sel Topics Signal Process, 2009, 57: 2479-2493 CrossRef Google Scholar

[30] Yin W T, Osher S, Goldfarb D, et al. Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM J Imag Sci, 2008, 1: 143-168 CrossRef Google Scholar

[31] Boyd S, Parikh N, Chu E, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Lear, 2010, 3: 1-122 CrossRef Google Scholar

[32] Chi E C, Lange K. Splitting methods for convex clustering. J Comput Graph Stat, 2015, 24: 994-1013 CrossRef Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有