SCIENCE CHINA Information Sciences, Volume 61, Issue 2: 023101(2018) https://doi.org/10.1007/s11432-017-9234-7

## Learning stratified 3D reconstruction

• AcceptedAug 8, 2017
• PublishedDec 26, 2017
Share
Rating

### Abstract

Stratified 3D reconstruction, or a layer-by-layer 3D reconstruction upgraded from projective to affine, then to the final metric reconstruction, is a well-known 3D reconstruction method in computer vision. It is also a key supporting technology for various well-known applications, such as streetview, smart3D, oblique photogrammetry. Generally speaking, the existing computer vision methods in the literature can be roughly classified into either the geometry-based approaches for spatial vision or the learning-based approaches for object vision. Although deep learning has demonstrated tremendous success in object vision in recent years, learning 3D scene reconstruction from multiple images is still rare, even not existent, except for those on depth learning from single images. This study is to explore the feasibility of learning the stratified 3D reconstruction from putative point correspondences across images, and to assess whether it could also be as robust to matching outliers as the traditional geometry-based methods do. In this study, a special parsimonious neural network is designed for the learning. Our results show that it is indeed possible to learn a stratified 3D reconstruction from noisy image point correspondences, and the learnt reconstruction results appear satisfactory although they are still not on a par with the state-of-the-arts in the structure-from-motion community due to largely its lack of an explicit robust outlier detector such as random sample consensus (RANSAC). To the best of our knowledge, our study is the first attempt in the literature to learn 3D scene reconstruction from multiple images. Our results also show that how to implicitly or explicitly integrate an outlier detector in learning methods is a key problem to solve in order to learn comparable 3D scene structures to those by the current geometry-based state-of-the-arts. Otherwise any significant advancement of learning 3D structures from multiple images seems difficult, if not impossible. Besides, we even speculate that deep learning might be, in nature, not suitable for learning 3D structure from multiple images, or more generally, for solving spatial vision problems.

### Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61333015, 61375042, 61421004, 61573359, 61772444).

### References

[1] Roberts R, Sinha S N, Szeliski R, et al. Structure from motion for scenes with large duplicate structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado, 2011. 3137--3144. Google Scholar

[2] Kerl C, Sturm J, Cremers D. Dense visual slam for rgb-d cameras. In: Proceedings of 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, 2013. 2100--2106. Google Scholar

[3] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst, 2012, 25: 1097--1105. Google Scholar

[4] Hartley R, Zisserman A. Multiple View Geometry in Computer Vision. New York: Cambridge University Press, 2003. Google Scholar

[5] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[6] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2650--2658. Google Scholar

[7] Kendall A, Grimes M, Cipolla R. Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2938--2946. Google Scholar

[8] Kulkarni T D, Whitney W F, Kohli P, et al. Deep convolutional inverse graphics network. In: Proceedings of International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015. 2539--2547. Google Scholar

• Figure 1

Architecture of our proposed parsimonious network. The inputs are the 2D coordinates of point correspondences across $N$ images, and matching confidences are used for the weight computation of various projective reconstructions. The outputs are 3D stratified reconstructions, including projective reconstruction, affine reconstruction, metric reconstruction, and the final Euclidean reconstruction.

• Figure 2

(Color online) Implementation of the three local convolution layers in the proposed network. The inputs to the first layer are point correspondences of size $N\times2\times1$. The kernel sizes of the three layers are $3\times2$, $1\times1$, and $3\times1$, respectively. The channel numbers of the three layers are $64$, $64$, and $4$, respectively. The dimensions of outputs at the three layers are $(N-2)\times1\times64$, $(N-2)\times1\times64$, and $(N-4)\times1\times4$, respectively. Kernels presented in different colors have different weights.

• Figure 3

(Color online) Architecture of PtoA, which upgrades the projective reconstruction to the affine reconstruction. The numbers in brackets are the number of neurons in each layer. The infinite plane vector is learned implicitly in the FC layer.

• Figure 4

(Color online) Architecture of AtoM, which upgrades the affine reconstruction to the metric reconstruction. The numbers in brackets are the number of neurons in each layer. The intrinsic matrix ${\boldsymbol~K}_1$ is learned implicitly in the FC layer.

• Figure 5

(Color online) Architecture of MtoE. The network transforms the metric reconstruction to the true Euclidean reconstruction. The numbers in brackets are the number of neurons in each layer. MtoE learns a similarity transformation that includes three operations: rotation, scaling, and translation.

• Figure 6

(Color online) 3D points and cameras locations in the simulation experiment. There are 30 cameras; the optical center of each camera is evenly distributed along the circumference of a circle, and each optical axis is oriented towards the center of the cube.

• Figure 7

(Color online) Stratified 3D reconstruction result of the simulation test data. (a) Ground truth; (b) projective reconstruction; (c) affine reconstruction; (d) metric reconstruction; (e) true Euclidean reconstruction.

• Figure 8

(Color online) Images in the multi-view-stereo datasets. (a) Three images from Herz-Jesu-P8; (b) three images from Fountain-P11; (c) three images from Castle-P30.

• Figure 9

(Color online) Stratified 3D reconstruction results of multi-view-stereo benchmark datasets. The results, from top to bottom, are the ground truth, projective reconstruction, affine reconstruction, metric reconstruction, and true Euclidean reconstruction. The scenes, from left to right, are that of Herz-Jesu-P8, Fountain-P11, and Castle-P30.

• Figure 10

(Color online) Reprojection errors of Hers-Jesu-P8 3D reconstruction results. (a) Histograms of the reprojection errors occurrences: percentage of data with an error of $n$ pixels; (b) cumulative reprojection errors distribution percentage of pixels with an error smaller than $n$ pixels.

• Figure 11

(Color online) Reprojection errors of Fountain-P11 3D reconstruction results. (a) and (b) Histograms of the reprojection errors occurrences: percentage of data with an error of $n$ pixels; (c) and (d) cumulative reprojection errors distribution percentage of pixels with an error smaller than $n$ pixels.

• Figure 12

(Color online) Reprojection errors of Castle-P30 3D reconstruction results. (a) Histograms of the reprojection errors occurrences: percentage of data with an error of $n$ pixels; (b) cumulative reprojection errors distribution percentage of pixels with an error smaller than $n$ pixels.

• Table 1   Distribution parameters of uniform noise
 Uniform noise Distribution of parameter Feature location errors $\pm$U(1,2) Outliers $\pm$U(100,500)
• Table 2   RMSE of the true Euclidean reconstruction for the following three cases: noise-free data, data with feature location errors, and data with outliers
 Proportion RMSE (m) Noise-free data - 0.008 Feature location errors 30/30 0.028 Outliers 1/30 0.008 2/30 0.009 4/30 0.010 6/30 0.012 8/30 0.016 10/30 0.019
• Table 3   Comparison of reprojection error (pixels) for three cases: noise-free data, data with feature location errors, and data with outliers
 Proportion Our model OpenMVG Noise-free data - 0.85 0.55 Feature location errors 30/30 1.88 1.64 Outliers 1/30 0.85 0.50 2/30 0.86 0.55 4/30 0.88 0.55 6/30 0.93 0.57 8/30 1.02 0.58 10/30 1.12 0.59
• #### 2

Citations

• Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有