logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 1 : 112101(2020) https://doi.org/10.1007/s11432-019-1517-3

A flexible technique to select objects via convolutional neural network in VR space

More info
  • ReceivedApr 8, 2019
  • AcceptedAug 1, 2019
  • PublishedDec 23, 2019

Abstract

Most studies on the selection techniques of projection-based VR systems are dependent on users wearing complex or expensive input devices, however there are lack of more convenient selection techniques. In this paper, we propose a flexible 3D selection technique in a large display projection-based virtual environment. Herein, we present a body tracking method using convolutional neural network (CNN) to estimate 3D skeletons of multi-users, and propose a region-based selection method to effectively select virtual objects using only the tracked fingertips of multi-users. Additionally, a multi-user merge method is introduced to enable users' actions and perception to realign when multiple users observe a single stereoscopic display. By comparing with state-of-the-art CNN-based pose estimation methods, the proposed CNN-based body tracking method enables considerable estimation accuracy with the guarantee of real-time performance. In addition, we evaluate our selection technique against three prevalent selection techniques and test the performance of our selection technique in a multi-user scenario. The results show that our selection technique significantly increases the efficiency and effectiveness, and is of comparable stability to support multi-user interaction.


References

[1] Cruz-Neira C, Sandin D J, DeFanti T A. The CAVE: audio visual experience automatic virtual environment. Commun ACM, 1992, 35: 64-72 CrossRef Google Scholar

[2] Rademacher P, Bishop G. Multiple-center-of-projection images. In: Proceedings of ACM Siggraph, Orlando, 1998. 199--206. Google Scholar

[3] Simon A, Smith R C, Pawlicki R R. Omnistereo for panoramic virtual environment display systems. In: Proceedings of Virtual Reality, Chicago, 2004. 67--279. Google Scholar

[4] van de Pol R, Ribarsky W, Hodges L, et al. Interaction techniques on the virtual workbench. In: Proceedings of Eurographics Virtual Environments, Viema, 1999. 157--168. Google Scholar

[5] Banerjee A, Burstyn J, Girouard A. MultiPoint: Comparing laser and manual pointing as remote input in large display interactions. Int J Human-Comput Studies, 2012, 70: 690-702 CrossRef Google Scholar

[6] Myers B A, Bhatnagar R, Nichols J, et al. Interacting at a distance: measuring the performance of laser pointers and other devices. In: Proceedings of Sigchi Conference on Human Factors in Computing Systems, Minneapolis, Minnesota, 2002. 33--40. Google Scholar

[7] Polacek O, Klima M, Sporka A J, et al. A comparative study on distant free-hand pointing. In: Proceedings of European Conference on Interactive Tv and Video, Berlin, 2012. 139--142. Google Scholar

[8] Nancel M, Wagner J, Pietriga E, et al. Mid-air pan-and-zoom on wall-sized displays. In: Proceedings of Sigchi Conference on Human Factors in Computing Systems, Vancouver, 2011. 177--186. Google Scholar

[9] Brown M A, Stuerzlinger W. Exploring the throughput potential of in-air pointing. In: Proceedings of International Conference on Human-Computer Interaction, Toronto, 2016. 13--24. Google Scholar

[10] Ortega M, Nigay L. Airmouse: finger gesture for 2D and 3D interaction. In: Proceedings of Ifip Tc 13 International Conference on Human-Computer Interaction, Uppsala, 2009. 214--227. Google Scholar

[11] Vogel D, Balakrishnan R. Distant freehand pointing and clicking on very large, high resolution displays. In: Proceedings of ACM Symposium on User Interface Software and Technology, Seattle, 2005. 33--42. Google Scholar

[12] Kim K, Choi H. Depth-based real-time hand tracking with occlusion handling using kalman filter and dam-shift. In: Proceedings of Asian Conference on Computer Vision, Singapore, 2014. 218--226. Google Scholar

[13] Zohra F T, Rahman M W, Gavrilova M. Occlusion detection and localization from kinect depth images. In: Proceedings of International Conference on Cyberworlds, Chongqing, 2016. 189--196. Google Scholar

[14] Wu C J, Quigley A, Harris-Birtill D. Out of sight: a toolkit for tracking occluded human joint positions. Pers Ubiquit Comput, 2017, 21: 125-135 CrossRef Google Scholar

[15] Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines. In: Proceedings of Computer Vision and Pattern Recognition, Las Vegas, 2016. 4724--4732. Google Scholar

[16] Cao Z, Simon T, Wei S E, et al. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of Computer Vision and Pattern Recognition, Las Vegas, 2017. 7291--7299. Google Scholar

[17] Insafutdinov E, Pishchulin L, Andres B, et al. Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 34--50. Google Scholar

[18] Iqbal U, Gall J. Multi-person pose estimation with local joint-to-person associations. In: Proceedings of European Conference on Computer Vision Workshops, Crowd Understanding, 2016. 627--642. Google Scholar

[19] Fang H S, Xie S Q, Tai Y W, et al. Rmpe: regional multi-person pose estimation. In: Proceedings of Proceedings of International Conference on Computer Vision, 2017. 2334--2343. Google Scholar

[20] Bolas M, McDowall I, Corr D. New research and explorations into multiuser immersive display systems. IEEE Comput Grap Appl, 2004, 24: 18-21 CrossRef Google Scholar

[21] Simon A. Usability of multiviewpoint images for spatial interaction in projection-based display systems.. IEEE Trans Visual Comput Graphics, 2007, 13: 26-33 CrossRef PubMed Google Scholar

[22] Matulic F, Vogel D. Multiray: multi-finger raycasting for large displays. In: Proceedings of Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montreal, 2018. 1--13. Google Scholar

[23] Ramanan D, Forsyth D A, Zisserman A. Strike a pose: tracking people by finding stylized poses. In: Proceedings of Computer Vision and Pattern Recognition, Washington, 2005. 271--278. Google Scholar

[24] Jain A. Articulated people detection and pose estimation: reshaping the future. In: Proceedings of Computer Vision and Pattern Recognition, Washington, 2012. 3178--3185. Google Scholar

[25] Pishchulin L, Insafutdinov E, Tang S Y, et al. Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4929--4937. Google Scholar

[26] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[27] Liang J, Green M. JDCAD: A highly interactive 3D modeling system. Comput Graphics, 1994, 18: 499-506 CrossRef Google Scholar

[28] de Haan G, Koutek M, Post F H. Intenselect: using dynamic object rating for assisting 3D object selection. In: Proceedings of Eurographics Conference on Virtual Environments, Aalborg, 2005. 201--209. Google Scholar

[29] Steed A, Parker C. 3D selection strategies for head tracked and non-head tracked operation of spatially immersive displays. In: Proceedings of the 8th International Immersive Projection Technology, Workshop, 2004. 13--14. Google Scholar

[30] Grossman T, Balakrishnan R. The bubble cursor:enhancing target acquisition by dynamic resizing of the cursor's activation area. In: Proceedings of Conference on Human Factors in Computing Systems, Portland, 2005. 281--290. Google Scholar

[31] Vanacken L, Grossman T, Coninx K. Exploring the effects of environment density and target visibility on object selection in 3D virtual environments. In: Proceedings of 3D User Interfaces, Charlotte, 2007. 115--122. Google Scholar

[32] Frees S, Kessler G D, Kay E. PRISM interaction for enhancing control in immersive virtual environments. ACM Trans Comput-Hum Interact, 2007, 14: 2-es CrossRef Google Scholar

[33] Kopper R, Bowman D A, Silva M G. A human motor behavior model for distal pointing tasks. Int J Human-Comput Studies, 2010, 68: 603-615 CrossRef Google Scholar

[34] Forlines C, Balakrishnan R, Beardsley P, et al. Zoom-and-pick: facilitating visual zooming and precision pointing with interactive handheld projectors. In: Proceedings of ACM Symposium on User Interface Software and Technology, Seattle, 2005. 73--82. Google Scholar

[35] Kopper R, Bacim F, Bowman D A. Rapid and accurate 3D selection by progressive refinement. In: Proceedings of 3D User Interfaces, Washington, 2011. 67--74. Google Scholar

[36] Shen Y J, Hao Z H, Wang P F, et al. A novel human detection approach based on depth map via kinect. In: Proceedings of Computer Vision and Pattern Recognition Workshops, Portland, 2013. 535--541. Google Scholar

[37] Kuang H, Cai S Q, Ma X L, et al. An effective skeleton extraction method based on kinect depth image. In: Proceedings of International Conference on Measuring Technology and Mechatronics Automation, Changsha, 2018. 187--190. Google Scholar

[38] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference of Learning Representation, San Diego, 2015. 1--14. Google Scholar

[39] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 1097--1105. Google Scholar

[40] Rosenberg L B. The effect of interocular distance upon operator performance using stereoscopic displays to perform virtual depth tasks. In: Proceedings of Virtual Reality International Symposium, Washington, 1993. 27--32. Google Scholar

[41] Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: New benchmark and state of the art analysis. In: Proceedings of Computer Vision and Pattern Recognition, Washington, 2014. 3686--3693. Google Scholar

[42] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, Zurich, Switzerland, 2014. 740--755. Google Scholar

[43] Argelaguet F, Andujar C. A survey of 3D object selection techniques for virtual environments. Comput Graphics, 2013, 37: 121-136 CrossRef Google Scholar

[44] Kulik A, Kunert A, Beck S. C1x6. ACM Trans Graph, 2011, 30: 1 CrossRef Google Scholar

  • Figure 1

    (Color online) The illustration of our technique (once the object is selected, the contour is marked): (a) the user points at the desired object with his fingertip; (b) the object is selected when the fingertip occludes part of the object.

  • Figure 2

    (Color online) Architecture of the two-stage body tracking method. Using the multiple sub-stage CNN with two branches, the method eventually estimates correct full body poses for each user in the depth image.

  • Figure 3

    (Color online) The process of projecting a fingertip region onto the projection plane. The projection region of the fingertip can be considered the “shadow" of the fingertip with the eye position.

  • Figure 4

    (Color online) The progressive refinement in selection. (a) Several selectable objects are detected; (b) the objects are zoomed in for convenience of selection.

  • Figure 5

    (Color online) Multi-users' interaction in the virtual environment is shown in (a). The individual visions of the two users are respectively shown in (b) and (c).

  • Figure 6

    (Color online) Implementation of Ray Cast, Cone Cast and SQUAD in the virtual environment. (a) Ray Cast technique; (b) Cone Cast technique; (c) SQUAD and the Quad-menu.

  • Figure 7

    (Color online) The four different scenarios used in study one. (a) The scene of Scenario 1; (b) the scene of Scenario 2; (c) the scene of Scenario 3; (d) the scene of Scenario 4.

  • Figure 8

    (Color online) The mean completion time (a) and error number (b) for the four techniques in each scenario.

  • Figure 9

    (Color online) Illustration of multi-user collaboration (a) in the virtual assembly application. (b) The mean number of incorrect joint connections of each method for each scenario.

  • Figure 10

    (Color online) Comparison between the trajectories of the user's head tracked by our body tracking method and by Kinect SDK. (a) The values of $X$ coordinate; (b) the values of $Y$ coordinate; (c) the values of $Z$ coordinate.

  • Figure 11

    (Color online) Our method succeeds in mutual occlusion (b) and (d), while the estimates of Kinect SDK are erroneous (a) and (c).

  • Table 1   Comparison of mAP (%) and time (s/frame) on the full testing set of MPII multi-person dataset$^{\rm~a)}$
    Method Head Shoulder Elbow Wrist Hip Knee Ankle Total Time
    DeeperCut [17] 78.4 72.5 60.2 51.0 57.2 52.0 45.4 59.5 485
    Iqbal et al. [18] 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10
    CMU-Pose [16] 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6 1.24
    RMPE [19] 88.4 86.5 78.6 70.4 74.4 73.0 65.8 76.7 1.5
    Ours 91.7 87.9 78.3 68.7 75.2 74.1 64.3 77.2 1.05

    a

  • Table 2   Comparison on the testing subset test-dev of the COCO dataset$^{\rm~a)b)}$
    Method AP(%) ${\rm{A}}{{\rm{P}}^{50}}$(%) ${\rm{A}}{{\rm{P}}^{75}}$(%) ${\rm{A}}{{\rm{P}}^{\rm~M}}$(%) ${\rm{A}}{{\rm{P}}^{\rm~L}}$(%) Time (s/frame)
    CMU-Pose [16] 61.8 84.9 67.5 57.1 68.2 0.1
    RMPE [19] 61.8 83.7 69.8 58.6 67.6 2.5
    Ours 63.3 85.3 68.9 57.8 68.8 0.08
  • Table 3   Comparison of mAP (%) and time (s/frame) on the full testing set of MPII multi-person dataset$^{\rm~a)}$
    Method Head Shoulder Elbow Wrist Hip Knee Ankle Finger Total Time
    CMU-Pose [16] 90.3 86.5 74.7 64.2 74.3 70.2 62.3 77.4 74.9 0.56
    Ours 89.8 87.4 76.2 65.7 73.8 73.1 63.5 79.7 76.1 0.13

    a

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号