1State Key Lab of CAD & CG, Zhejiang University
2SenseTime Research
* denotes equal contribution
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.
Zoom in by scrolling. You can toggle the “Single Sided” option in Model Inspector (pressing I key) to enable back-face culling (see through walls). Select “Matcap” to inspect the geometry without textures.
Data is captured around the working area with an iPhone, and the camera poses are obtained from ARKit. The model used here is only trained on ScanNet, which indicates that NeuralRecon generalizes well to new domains. The gradual refinement on the reconstruction quality over time (through GRU-Fusion) can also be observed.
The pretrained model of NeuralRecon can generalize reasonably well to outdoor scenes, which are completely out of the domain of the training dataset ScanNet.
NeuralRecon can handle homogeneous textures (e.g. white walls and tables), thanks to the learned surface priors.
NeuralRecon predicts TSDF with a three-level coarse-to-fine approach that gradually increases the density of sparse voxels. Key-frame images in the local fragment are first passed through the image backbone to extract the multi-level features. These image features are later back-projected along each ray and aggregated into a 3D feature volume $\mathbf{F}_t^l$, where $l$ represents the level index. At the first level ($l=1$), a dense TSDF volume $\mathbf{S}_t^{1}$ is predicted. At the second and third levels, the upsampled $\mathbf{S}_t^{l-1}$ from the last level is concatenated with $\mathbf{F}_t^l$ and used as the input for the GRU Fusion and MLP modules. A feature volume defined in the world frame is maintained at each level as the global hidden state of the GRU. At the last level, the output $\mathbf{S}_t^l$ is used to replace corresponding voxels in the global TSDF volume $\mathbf{S}_t^{g}$, yielding the final reconstruction at time $t$.
Only the inference time on key frames is computed. Back-face culling is enabled during rendering. Ground-truth is captured using the LiDAR sensor on iPad Pro.
B5-Scene 1:
B5-Scene 2:
@article{sun2021neucon,
title={{NeuralRecon}: Real-Time Coherent {3D} Reconstruction from Monocular Video},
author={Sun, Jiaming and Xie, Yiming and Chen, Linghao and Zhou, Xiaowei and Bao, Hujun},
journal={CVPR},
year={2021}
}
We would like to specially thank Reviewer 3 for the insightful and constructive comments. We would like to thank Sida Peng , Siyu Zhang and Qi Fang for the proof-reading.
Welcome to checkout our work on Transformer-based feature matching (LoFTR) and human reconstruction (NeuralBody and Mirrored-Human) in CVPR 2021.