logo

SCIENTIA SINICA Informationis, Volume 48, Issue 4: 433-448(2018) https://doi.org/10.1360/N112017-00211

Intelligence methods of multi-modal information fusion in human-computer interaction

More info
  • ReceivedOct 30, 2017
  • AcceptedMar 1, 2018
  • PublishedApr 13, 2018

Abstract

We first introduce the concepts of single-modal information processing and multi-modal information fusion in cognitive science. Some classical multi-modal information fusion models and their computer implementations in history are also explained. Under the conditions that each channel's information can be obtained, and their features could be unified representation synchronously, the fusion of multi-modal information can be transformed into classification or regression problems. For practical human-computer interaction systems, the performance of multi-modal information fusion largely relies on the accuracy of the single-modal information identification and the design of the interactive system. We present a practical example of multi-modal information fusion system, and discuss its performances on human computer interaction. Finally, the possible and important development trends for multi-modal human-computer interaction techniques and systems are discussed.


Funded by

国家重点研发计划(2017YFB1002804)

国家自然科学基金(61332017,61425017)


References

[1] Cohen P R, McGee D. Tangible multimodal interfaces for safety-critical applications. Commun ACM, 2004, 47: 1--46. Google Scholar

[2] Jaimes A, Sebe N. Multimodal human-computer interaction: A survey. Comput Vision Image Understanding 2007, 108: 116--134. Google Scholar

[3] Meyer S, Rakotonirainy A. A survey of research on context-aware homes. In: Proceedings of Australasian Information Security Workshop Conference on ACSW Frontiers, Adelaide, 2003. 159--168. Google Scholar

[4] Yang M H, Tao J H, Li H, et al. A nature multimodal human-computer-interaction dialog system. In: Proceedings of the 9th Joint Conference on Harmonious Human Machine Environment (HMME), Nanchang, 2013. Google Scholar

[5] Yang M H, Gao T L, Tao J H, et al. The error analysis of intention classification and speech recognition in speech man-machine conversation. In: Proceedings of the 11th Joint Conference on Harmonious Human Machine Environment, Huludao, 2015. Google Scholar

[6] Duric Z, Gray W D, Heishman R, et al. Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proc IEEE, 2002. 1272--1289. Google Scholar

[7] Wang L, Hu W, Tan T. Recent developments in human motion analysis. Pattern Recognition 2003, 36: 585--601. Google Scholar

[8] Seely R D, Goffredo M, Carter J N, et al. View invariant gait recognition. Adv Pattern Recogn, 2009, 23: 1410--1413. Google Scholar

[9] Chin K Y, Hong Z W, Chen Y L. Impact of Using an Educational Robot-Based Learning System on Students' Motivation in Elementary Education. IEEE Trans Learning Technol 2014, 7: 333--345. Google Scholar

[10] Pierre-Yves O. The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Studies 2003, 59: 157--183. Google Scholar

[11] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. Comput Sci, 2015, 10: 429--439. Google Scholar

[12] Ming-Hsuan Yang , Kriegman D J, Ahuja N. Detecting faces in images: a survey. IEEE Trans Pattern Anal Machine Intell 2002, 24: 34--58. Google Scholar

[13] Zhao W, Chellappa R, Phillips P J, et al. Face recognition: a literature survey. ACM Comput Surv, 2003, 12: 399--458. Google Scholar

[14] Pantie M, Rothkrantz L J M. Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Machine Intell 2000, 22: 1424--1445. Google Scholar

[15] Tao J H, Tan T. Affective computing: a review. In: Proceedings of Affective Computing & Intelligent Interaction. Berlin: Springer, 2005. 981--995. Google Scholar

[16] Chao L L, Tao J H, Yang M H, et al. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, Brisbane, 2015. 65--72. Google Scholar

[17] Wang S J, Yan W J, Li X, et al. Micro-Expression Recognition Using Color Spaces. IEEE Trans Image Process 2015, 24: 6034--6047. Google Scholar

[18] He L, Jiang D, Yang L, et al. Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of ACM International Workshop on Audio/visual Emotion Challenge, Brisbane, 2015. 73--80. Google Scholar

[19] Ge L H, Liang H, Yuan J S, et al. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of Computer Vision and Pattern Recognition, Honolulu, 2017. 1991--2000. Google Scholar

[20] Zimmermann C, Brox T. Learning to estimate 3D hand pose from single RGB images. In: Proceedings of International Conference on Computer Vision, Venice, 2017. 4903--4911. Google Scholar

[21] Ruffieux S, Lalanne D, Mugellini E, et al. A survey of datasets for human gesture recognition. In: Proceedings of International Conference on Human-Computer Interaction. Berlin: Springer, 2014. 337--348. Google Scholar

[22] Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 2015, 43: 1--54. Google Scholar

[23] Hu W, Tan T, Wang L, et al. A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst Man Cybernet, 2004, 34: 334--352. Google Scholar

[24] Fagiani C, Betke M, Gips J. Evaluation of tracking methods for human-computer interaction. In: Proceedings of the 6th IEEE Workshop on Applications of Computer Vision. Washington: IEEE Computer Society, 2002. 121--126. Google Scholar

[25] Oviatt S, Cohen P, Wu L, et al. Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions. Human-Comput Interaction 2000, 15: 263--322. Google Scholar

[26] Tian F, Xu L S, Wang H A, et al. Tilt menu: using the 3D orientation information of pen devices to extend the selection capability of pen-based user interfaces. In: Proceedings of Conference on Human Factors in Computing Systems, Florence, 2008. 1371--1380. Google Scholar

[27] Tian F, Lu F, Jiang Y, et al. An exploration of pen tail gestures for interactions. Int J Human-Comput Studies 2013, 71: 551--569. Google Scholar

[28] Pelz J B. Portable eye-tracking in natural behavior. J Vis, 2004, 4: 14. Google Scholar

[29] Santella A, de Carlo D. Robust clustering of eye movement recordings for quantification of visual interest. Eye Tracking Res Appl, 2004, 23: 27--34. Google Scholar

[30] Cheng S, Sun Z, Sun L, et al. Gaze-based annotations for reading comprehension. In: Proceedings of ACM Conference on Human Factors in Computing Systems, Seoul, 2015. 1569--1572. Google Scholar

[31] Yu C, Sun K, Zhong M Y, et al. One-dimensional handwriting: inputting letters and words on smart glasses. In: Proceedings of CHI Conference on Human Factors in Computing Systems, San Jose, 2016. 71--82. Google Scholar

[32] Yu C, Wen H Y, Xiong W, et al. Investigating effects of post-selection feedback for acquiring ultra-small targets on touchscreen. In: Proceedings of CHI Conference on Human Factors in Computing Systems, San Jose, 2016. 4699--4710. Google Scholar

[33] Wang D, Zhao X, Shi Y, et al. Six Degree-of-Freedom Haptic Simulation of Probing Dental Caries Within a Narrow Oral Cavity.. IEEE Trans Haptics 2016, 9: 279--291. Google Scholar

[34] Yang W, Jiang Z, Huang X. Tactile perception of digital images. In: Proceedings of International AsiaHaptics Conference, Singapore, 2016. 445--447. Google Scholar

[35] Paivio A. Mental Representation: A Dual Coding Approach. New York: Oxford University Press, 1986. Google Scholar

[36] Baddeley A D. Working Memory. Oxford: Clarendon Press, 1986. Google Scholar

[37] Nelson C. What are the differences between long-term, short-term, and working memory. Prog Brain Res, 2008, 169: 323--338. Google Scholar

[38] Baddeley A. Working memory: looking back and looking forward.. Nat Rev Neurosci 2003, 4: 829--839. Google Scholar

[39] The Effect of Word Length on Immediate Serial Recall Depends on Phonological Complexity, Not Articulatory Duration. Q J Exp Psychology Sect A 1998, 51: 283--304. Google Scholar

[40] Just M A, Carpenter P A. A capacity theory of comprehension: Individual differences in working memory.. Psychological Rev 1992, 99: 122--149. Google Scholar

[41] The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behav Brain Sci 2001, 24: 87--114. Google Scholar

[42] Chooi W T, Thompson L A. Working memory training does not improve intelligence in healthy young adults. Intelligence 2012, 40: 531--542. Google Scholar

[43] Barrouillet P, Bernardin S, Camos V. Time constraints and resource sharing in adults' working memory spans. J erimental Psychol, 2004, 133: 83--100. Google Scholar

[44] The relationship between processing and storage in working memory span: Not two sides of the same coin. J Memory Language 2007, 56: 212--228. Google Scholar

[45] Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313: 504--507. Google Scholar

[46] Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res, 2010, 11: 3371--3408. Google Scholar

[47] Chao L, Tao J, Yang M, et al. Bayesian fusion based temporal modeling for naturalistic audio affective expression classification. In: Proceedings of the 5th International Conference on Affective Computing and Intelligent Interaction, Geneva, 2013. Google Scholar

[48] Miao Y J, Gowayyed M, Metze F. Eesen: end-to-end speech recognition using deep rnn models and wfst-based decoding. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, 2015. 167--174. Google Scholar

[49] Caramiaux B, Montecchio N, Tanaka A. Adaptive gesture recognition with variation estimation for interactive systems. ACM Trans Interact Intell Syst, 2015, 4: 18. Google Scholar

[50] Mayer R E. Multimedia learning. Psychol Learn Motivation, 2002, 41: 85--139. Google Scholar

[51] Revlin R. Cognition: Theory and Practice. New York: Worth Publishers, 2012. Google Scholar

[52] Fournet N, Roulin J, Vallet F. Evaluating short-term and working memory in order adults: French normative data. Aging Mental Health, 2012, 16: 922--930NAN文献为空. Google Scholar

[53] Mayer R E. Multimedia Learning. 2nd ed. New York: Cambridge University Press, 2009. Google Scholar

[54] Ernst M O, Banks M S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 2002, 415: 429--433. Google Scholar

[55] Gunes H, Piccardi M. Affect recognition from face and body: early fusion vs. late fusion. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, 2006. 3437--3443. Google Scholar

[56] Yang M H, Tao J H, Li H, et al. A nature multimodal human-computer-interaction dialog system. In: Proceedings of the 9th Joint Conference on Harmonious Human Machine Environment (CHCI2013), Nanchang, 2013. Google Scholar

[57] Yang M H, Tao J H, Chao L. User behavior fusion in dialog management with multi-modal history cues. Multimedia Tools Appl, 2015, 74: 10025--10051. Google Scholar

[58] Li X, Gao F, Wang J. A priori knowledge accumulation and its application to linear BRDF model inversion. J Geophys Res-Atmo, 2001, 106: 11925--11935. Google Scholar

[59] Liu F, Lin X, Li S Z. Multi-modal face tracking using Bayesian network. In: Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, Nice, 2003. 135. Google Scholar

[60] Town C. Multi-sensory and multi-modal fusion for sentient computing. Int J Comput Vis, 2007, 71: 235--253. Google Scholar

[61] Pradalier C, Colas F, Bessiere P. Expressing bayesian fusion as a product of distributions: applications in robotics. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, 2015. 1851--1856. Google Scholar

[62] Savran A, Cao H, Nenkova A. Temporal bayesian fusion for affect sensing: combining video, audio, and lexical modalities. IEEE Trans Cybern, 2015, 45: 1927. Google Scholar

[63] Li W, Lin G. An adaptive importance sampling algorithm for Bayesian inversion with multimodal distributions. J Comput Phys, 2015, 294: 173--190. Google Scholar

[64] Yu D, Deng L, He X, et al. Large-margin minimum classification error training for large-scale speech recognition tasks. In: Proceedings of IEEE International Conference on Acoustics, Honolulu, 2016. Google Scholar

[65] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: surpassing human-level performance on imageNet classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Santiago, 2015. 1026--1034. Google Scholar

[66] Yang F, Choi W, Lin Y. Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 2129--2137. Google Scholar

[67] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, 2011. 689--696. Google Scholar

[68] Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning, Helsinki, 2008. 160--170. Google Scholar

[69] Seltzer M L, Droppo J. Multi-task learning in deep neural networks for improved phoneme recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 2013. 6965--6969. Google Scholar

[70] Tzeng E, Hoffman J, Darrell T. Simultaneous deep transfer across domains and tasks. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 4068--4076. Google Scholar

[71] Lukasz K, Gomez A N, Shazeer N, et al. One model to learn them all. 2017,. arXiv Google Scholar

[72] Wang C, Gorce M D L, Paragios N. Segmentation, ordering and multi-object tracking using graphical models. In: Proceedings of IEEE International Conference on Computer Vision, Kyoto, 2010. 747--754. Google Scholar

[73] Wei F, Li W, Lu Q. A document-sensitive graph model for multi-document summarization. Knowl Inform Syst, 2010, 22: 245--259. Google Scholar

[74] Myunghwan K, Jure L. Latent multi-group membership graph model. Comput Sci, 2012, 80. Google Scholar

[75] Honorio J, Samaras D. Multi-task learning of gaussian graphical models. In: Proceedings of International Conference on Machine Learning, Haifa, 2010. 447--454. Google Scholar

[76] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350: 1332--1338. Google Scholar

[77] Wu J X, Cheng J, Zhao C Y, et al. Fusing multi-modal features for gesture recognition. In: Proceedings of the 15th ACM on International Conference On Multimodal Interaction, Sydney, 2013. 453--460. Google Scholar

[78] Hamoud L, Kilgour D M, Hipel K W. Strength of preference in graph models for multiple-decision-maker conflicts. Appl Math Comput, 2006, 179: 314--332. Google Scholar

[79] Oviatt S L. Mutual disambiguation of recognition errors in a multimodal architecture. In: Proceedings of ACM Conference Human Factors in Computing Systems, Pittsburgh, 1999. 576--583. Google Scholar

[80] Wahlster W. SmartKom: symmetric multimodality in an adaptive and reusable dialogue shell. In: Proceedings of the Human Computer Interaction Status Conference, Berlin, 2003. 47--62. Google Scholar

[81] McGuire P, Fritsch J, Steil J J. Multi-modal human-machine communication for instructing robot grasping tasks. In: Proceedings of LEEE/RSJ International Conference on Intelligent Robots and Systems, Lausanne, 2002. 1082--1088. Google Scholar

[82] Michaelis J E, Mutlu B. Someone to read with design of and experiences with an in-home learning companion robot for reading. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 301--312. Google Scholar

[83] Cheng A, Yang L, Andersen E. Teaching language and culture with a virtual reality game. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 541--549. Google Scholar

[84] Sun M, Zhao Z, Ma X. Sensing and handling engagement dynamics in human-robot interaction involving peripheral computing. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, 2017. 556--567. Google Scholar

[85] Pham T A, Delalandre M, Barrat S. Accurate junction detection and characterization in line-drawing images. Pattern Recogn, 2013, 47: 282--295. Google Scholar

[86] Zhao B, Yang M, Pan H, et al. Nonrigid point matching of Chinese characters for robot writing. In: Proceedings of the IEEE International Conference on Robotics and Biomimetics, Macau, 2017. 762--767. Google Scholar

  • Table 1   Accuracy of users' dialogue intention $^{\rm~a)}$
    Correct ASRCorrect ASR and intentions Inaccurate ASR with correct intentions
    Number of correct feedback51144362
    Accuracy0.8190.867 0.549

    a

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1