logo

SCIENTIA SINICA Informationis, Volume 46, Issue 7: 811-818(2016) https://doi.org/10.1360/N112015-00285

The inherent ambiguity in scene depth learning from single\\ images

Lei HE1, Qiulei DONG1,2,3,*, Zhanyi HU1,2,3
More info
  • ReceivedJan 6, 2016
  • AcceptedMar 22, 2016

Abstract

Scene depth inference from a single image is currently an important issue in machine learning, its underlying rationale is the possibility of human depth perception from single images. Human depth perception from single images is the consequence of extended evolution, its underlying brain mechanism is far from well established whereby it seems imprudent to compute the depth based merely on such a fact. In fact, the 3D-to-2D object imaging process must satisfy some strict projective geometric relationship, and without prior knowledge on the camera's intrinsics, some ambiguity exists between the scene depth and the camera's focal length. We think since the camera' intrinsics were not accounted for, the currently reported single-image based depth learning approaches in the literature invariantly contain a crucial deficiency in theory. The ambiguity between the depth and focal-length is also verified by real images. We think in order to increase the accuracy of learning the scene depth from single images, cameras' intrinsics should be taken into account, at least the focal-length should also be used as input in both learning and inference phases.


Funded by

中国科学院战略先导专项(XDB02070002)

国家自然科学基金(61333015)

国家自然科学基金(61421004)

北京市自然科学基金(7142152)


References

[1] Hoiem D, Efros A A, Hebert M. Automatic photo pop-up. ACM Trans Graphics (TOG), 2005, 24: 577-584 CrossRef Google Scholar

[2] Hedau V, Hoiem D, Forsyth D. Thinking inside the box: using appearance models and context based on room geometry. In: Proceedings of the 11th European Conference on Computer Vision. Berlin: Springer, 2010. 224-237. Google Scholar

[3] Schwing A G, Urtasun R. Efficient exact inference for 3d indoor scene understanding. In: Proceedings of the 12th European Conference on Computer Vision. Berlin: Springer, 2012. 299-313. Google Scholar

[4] Saxena A, Chung S H, Ng A Y. Learning depth from single monocular images. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2005. 1161-1168. Google Scholar

[5] Saxena A, Chung S H, Ng A Y. 3-d depth reconstruction from a single still image. Int J Comput Vision, 2008, 76: 53-69. Google Scholar

[6] Saxena A, Sun M, Ng A Y. Make3d: learning 3d scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell, 2009, 31: 824-840 CrossRef Google Scholar

[7] Liu M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 716-723. Google Scholar

[8] Baig M H, Jagadeesh V, Piramuthu R, et al. Im2depth: scalable exemplar based depth transfer. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), Steamboat Springs, 2014. 145-152. Google Scholar

[9] Fouhey D F, Gupta A, Hebert M. Data-driven 3d primitives for single image understanding. In: Proceedings of the IEEE International Conference on Computer Vision, Sydney, 2013. 3392-3399. Google Scholar

[10] Fouhey D F, Gupta A, Hebert M. Unfolding an indoor origami world. In: Proceedings of 13th European Conference on Computer Vision. Berlin: Springer, 2014. 687-702. Google Scholar

[11] Ladicky L, Shi J, Pollefeys M. Pulling things out of perspective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 89-96. Google Scholar

[12] Häne C, Ladicky L, Pollefeys M. Direction matters: depth estimation with a surface normal classifier. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 381-389. Google Scholar

[13] Liu C, Yuen J, Torralba A. Nonparametric scene parsing via label transfer. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 2368-2382 CrossRef Google Scholar

[14] Karsch K, Liu C, Kang S B. Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2144-2158 CrossRef Google Scholar

[15] Liu C, Yuen J, Torralba A. Sift flow: dense correspondence across scenes and its applications. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 978-994 CrossRef Google Scholar

[16] Konrad J, Meng W, Ishwar P, et al. Learning-based, automatic 2D-to-3D image and video conversion. IEEE Trans Image Process, 2013, 22: 3485-3496 CrossRef Google Scholar

[17] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of Advances in Neural Information Processing Systems, Montreal, 2014. 2366-2374. Google Scholar

[18] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2650-2658. Google Scholar

[19] Wang X L, Fouhey D F, Gupta A. Designing deep networks for surface normal estimation. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 539-547. Google Scholar

[20] Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 5162-5170. Google Scholar

[21] Liu F Y, Shen C H, Lin G S, et al. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell, in Press. doi: 10.1109/TPAMI.2015.2505283. Google Scholar

[22] Li B, Shen C, Dai Y, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 1119-1127. Google Scholar

[23] Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images. In: Proceedings of the 12th European Conference on Computer Vision. Berlin: Springer, 2012. 746-760. Google Scholar

[24] Geiger A, Lenz P, Stiller C, et al. Vision meets robotics: the kitti dataset. Int J Robot Res, 2013, 32: 1231-1237 CrossRef Google Scholar

[25] Kendall A, Grimes M, Cipolla R. PoseNet: a convolutional networks for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2938-2946. Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1