NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video

CVPR 2021

(Oral Presentation and Best Paper Candidate)

Jiaming Sun^1,2, Yiming Xie^1, Linghao Chen¹, Xiaowei Zhou¹, Hujun Bao¹

¹State Key Lab of CAD & CG, Zhejiang University ²SenseTime Research
^* denotes equal contribution

Paper

Code

Supplementary

Abstract

NeuralRecon reconstructs 3D scene geometry from a monocular video with known camera poses in real-time🔥.

We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.

Reconstruction showcase

Zoom in by scrolling. You can toggle the “Single Sided” option in Model Inspector (pressing I key) to enable back-face culling (see through walls). Select “Matcap” to inspect the geometry without textures.

Real-time incremental reconstruction

Data is captured around the working area with an iPhone, and the camera poses are obtained from ARKit. The model used here is only trained on ScanNet, which indicates that NeuralRecon generalizes well to new domains. The gradual refinement on the reconstruction quality over time (through GRU-Fusion) can also be observed.

Generalization to the outdoor scene

The pretrained model of NeuralRecon can generalize reasonably well to outdoor scenes, which are completely out of the domain of the training dataset ScanNet.

Pipeline overview

NeuralRecon predicts TSDF with a three-level coarse-to-fine approach that gradually increases the density of sparse voxels. Key-frame images in the local fragment are first passed through the image backbone to extract the multi-level features. These image features are later back-projected along each ray and aggregated into a 3D feature volume $\mathbf{F}_t^l$, where $l$ represents the level index. At the first level ($l=1$), a dense TSDF volume $\mathbf{S}_t^{1}$ is predicted. At the second and third levels, the upsampled $\mathbf{S}_t^{l-1}$ from the last level is concatenated with $\mathbf{F}_t^l$ and used as the input for the GRU Fusion and MLP modules. A feature volume defined in the world frame is maintained at each level as the global hidden state of the GRU. At the last level, the output $\mathbf{S}_t^l$ is used to replace corresponding voxels in the global TSDF volume $\mathbf{S}_t^{g}$, yielding the final reconstruction at time $t$.

Comparison with state-of-the-art methods

Only the inference time on key frames is computed. Back-face culling is enabled during rendering. Ground-truth is captured using the LiDAR sensor on iPad Pro.

B5-Scene 1:

B5-Scene 2: