SCIENTIA SINICA Informationis, Volume 46, Issue 8: 969-981(2016) https://doi.org/10.1360/N112016-00072

Towards real world perception and interaction

More info
  • ReceivedMar 30, 2016
  • AcceptedMay 30, 2016


Perception and interaction are the most important and essential parts of an intelligent machine. They are crucial and even unique channels by which to learn from the real world. In the past two decades, there has been significant progress in closed world research on perception and/or interaction. With the current rapid developments in the areas of service robots and unmanned vehicles, perception and interaction are confronted with challenges from the real world. This paper briefly reviews the history of computer perception and interaction, and lists eight problems in real world perception and interaction that, if solved, will elevate the perception and interaction capabilities of intelligent machines from a specialist- to human-level in the real world.

Funded by



[1] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529: 484-489 CrossRef Google Scholar

[2] Roberts L. Machine perception of three-dimensional solids. In: Optical and Electron-optical Information Processing. Cambridge: MIT Press, 1965. 159-197. Google Scholar

[3] Marr D. Vision: a Computational Investigation Into the Human Representation and Processing of Visual Information. Cambridge: MIT Press, 2010. Google Scholar

[4] Jain R C, Binford T O. Ignorance, myopia, and naiveté in computer vision systems. CVGIP: Image Und, 1991, 53: 112-117 CrossRef Google Scholar

[5] Brooks R A. Intelligence without reason. In: Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, 1991. 569-595. Google Scholar

[6] Zhang Z. Microsoft Kinect sensor and its effect. IEEE Multimed, 2012, 19: 4-10. Google Scholar

[7] Shotton J, Sharp T, Kipman A, et al. Real-time human pose recognition in parts from single depth images. Commun ACM, 2013, 56: 116-124 CrossRef Google Scholar

[8] Fankhauser P, Bloesch M, Rodriguez D, et al. Kinect v2 for mobile robot navigation: evaluation and modeling. In: Proceedings of the 17th International Conference on Advanced Robotics, Istanbul, 2015. 388-394. Google Scholar

[9] Han J, Shao L, Xu D, et al. Enhanced computer vision with Microsoft Kinect sensor: a review. IEEE Trans Cyber, 2013, 43: 1318-1334 CrossRef Google Scholar

[10] Leibe B, Schiele B. Analyzing appearance and contour based methods for object categorization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, 2003. 2: 409-415. Google Scholar

[11] Li F-F, Rob F, Pietro P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Comput Vis Image Und, 2007, 106: 59-70 CrossRef Google Scholar

[12] Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge. Int J Comput Vision, 2010, 88, 303-338. Google Scholar

[13] Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami Beach, 2009. 248-255. Google Scholar

[14] Krishna R, Zhu Y, Groth O, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332. Google Scholar

[15] Thomee B, Elizalde B, Shamma D, et al. YFCC100M: the new data in multimedia research. Commun ACM, 2016, 59: 64-73. Google Scholar

[16] Viola P, Jones R. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, 2001. 2: 524-531. Google Scholar

[17] Fei-Fei L, Perona P. A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005. 2: 524-531. Google Scholar

[18] Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems 25, Lake Tahoe, 2012. 1097-1105. Google Scholar

[19] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350: 1332-1338 CrossRef Google Scholar

[20] Chen J, Chen B. Architectural modeling from sparsely scanned range data. Int J Comput Vision, 2007, 78: 223-236. Google Scholar

[21] Lin H, Gao J, Zhou Y, et al. Semantic decomposition and reconstruction of residential scenes from LiDAR data. ACM Trans Graphics, 2013, 32: 1-10. Google Scholar

[22] Agarwala S, Furukawaa Y, Snavely N, et al. Building Rome in a day. Commun ACM, 2011, 54: 105-112 CrossRef Google Scholar

[23] Newcombe R A, Izadi S, Hilliges O, et al. KinectFusion: real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, 2011. 127-136. Google Scholar

[24] Henry P, Krainin M, Herbst E, et al. RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments. In: Proceedings of the International Symposium on Experimental Robotics, New Delhi and Agra, 2010. 22-25. Google Scholar

[25] Nan L, Xie K, Sharf A. A search-classify approach for cluttered indoor scene understanding. ACM Trans Graphics, 2012, 31: 1-10. Google Scholar

[26] Chen K. Lai Y-K, Wu Y-X, et al. Automatic semantic modeling of indoor scenes from low-quality RGB-D data using contextual information. ACM Trans Graphics, 2014, 33: 1-12. Google Scholar

[27] Zhang L, Vazquez C, Knorr S. 3D-TV content creation: automatic 2D-to-3D video conversion. IEEE Trans Broadcast, 2011, 57: 372-383 CrossRef Google Scholar

[28] Karsch K, Liu C, Kang S B. Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2144-2158 CrossRef Google Scholar

[29] Song Y M, Xie Y, Malyarchuk Y, et al. Digital cameras with designs inspired by the arthropod eye. Nature, 2013, 497: 95-99 CrossRef Google Scholar

[30] Yokoya R, Nayar S K. Extended depth of field catadioptric imaging using focal sweep. In: Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, 2015. 3505-3513. Google Scholar

[31] Nayar S, Mitsunaga T. High dynamic range imaging: spatially varying pixel exposures. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, 2000. 472-479. Google Scholar

[32] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29: 82-97. Google Scholar

[33] Johnson A, Hebert M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans Pattern Anal Mach Intell, 1999, 21: 433-449 CrossRef Google Scholar

[34] Bo L, Ren X, Fox D. Depth kernel descriptors for object recognition. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, 2011. 821-826. Google Scholar

[35] Xiao J, Owens A, Torralba A. SUN3D: a database of big spaces reconstructed using SfM and object labels. In: Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, 2013. 1625-1632. Google Scholar

[36] Jacob M G, Li Y-T, Akingba G A, et al. Collaboration with a robotic scrub nurse. Commun ACM, 2013, 56: 68-75. Google Scholar

[37] Chai X, Li G, Chen X, et al. VisualComm: a tool to support communication between deaf and hearing persons with the Kinect. In: Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, Bellevue, 2013. 76. Google Scholar

[38] Häuslschmid R, Menrad B, Butz A. Freehand vs. micro gestures in the car: driving performance and user experience. In: Proceedings of IEEE Symposium on 3D User Interfaces (3DUI), Arles, 2015. 159-160. Google Scholar

[39] Lampert C H, Nickisch H, Harmeling S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 453-465 CrossRef Google Scholar

[40] Liang K, Chang H, Shan S, et al. A unified multiplicative framework for attribute learning. In: Proceedings of the 15th International Conference on Computer Vision, Santiago, 2015. 2506-2514. Google Scholar

[41] Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the 15th International Conference on Computer Vision, Santiago, 2015. 1-9. Google Scholar

[42] Liu H, Wang R, Shan S, et al. Deep supervised hashing for fast image retrieval. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[43] Ross P. When will software have the right stuff? IEEE Spectrum, 2011, 48: 38-43. Google Scholar

[44] Kirby M, Sirovich L. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell, 1990, 12: 103-108 CrossRef Google Scholar

[45] Schmidhuber J. Learning complex extended sequences using the principle of history compression. Neural Comput, 1992, 4: 234-242 CrossRef Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有