NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video

CVPR 2021 (Oral)

Jiaming Sun1,2*, Yiming Xie1*, Linghao Chen1, Xiaowei Zhou1, Hujun Bao1

1State Key Lab of CAD & CG, Zhejiang University    2SenseTime Research
* denotes equal contribution


NeuralRecon reconstructs 3D scene geometry from a monocular video with known camera poses in real-time🔥.

We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.

Reconstruction showcase

Zoom in by scrolling. You can toggle the “Single Sided” option in Model Inspector (pressing I key) to enable back-face culling (see through walls). Select “Matcap” to inspect the geometry without textures.

Real-time incremental reconstruction

Data is captured around the working area with an iPhone, and the camera poses are obtained from ARKit. The model used here is only trained on ScanNet, which indicates that NeuralRecon generalizes well to new domains. Notice that NeuralRecon can handle homogeneous textures (e.g. the booth area and the white walls in the header video), thanks to the learned surface priors. The gradual refinement on the reconstruction quality over time (through GRU-Fusion) can also be observed.

Pipeline overview (video coming soon)

NeuralRecon Architechture

NeuralRecon predicts TSDF with a three-level coarse-to-fine approach that gradually increases the density of sparse voxels. Key-frame images in the local fragment are first passed through the image backbone to extract the multi-level features. These image features are later back-projected along each ray and aggregated into a 3D feature volume $\mathbf{F}_t^l$, where $l$ represents the level index. At the first level ($l=1$), a dense TSDF volume $\mathbf{S}_t^{1}$ is predicted. At the second and third levels, the upsampled $\mathbf{S}_t^{l-1}$ from the last level is concatenated with $\mathbf{F}_t^l$ and used as the input for the GRU Fusion and MLP modules. A feature volume defined in the world frame is maintained at each level as the global hidden state of the GRU. At the last level, the output $\mathbf{S}_t^l$ is used to replace corresponding voxels in the global TSDF volume $\mathbf{S}_t^{g}$, yielding the final reconstruction at time $t$.

Comparison with state-of-the-art methods

Only the inference time on key frames is computed. Back-face culling is enabled during rendering. Ground-truth is captured using the LiDAR sensor on iPad Pro.

B5-Scene 1:

B5-Scene 2:

Comparison with Atlas on a large scene (30m x 10m)


  title={{NeuralRecon}: Real-Time Coherent {3D} Reconstruction from Monocular Video},
  author={Sun, Jiaming and Xie, Yiming and Chen, Linghao and Zhou, Xiaowei and Bao, Hujun},


We would like to specially thank Reviewer 3 for the insightful and constructive comments. We would like to thank Sida Peng , Siyu Zhang and Qi Fang for the proof-reading.

Recommendations to other works from our group

Welcome to checkout our work on Transformer-based feature matching (LoFTR) and human reconstruction (NeuralBody and Mirrored-Human) in CVPR 2021.