Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

CVPR 2025

1Zhejiang University    2Beijing Normal University   
* Equal contribution.
^ Work done during internship at Zhejiang University.
Corresponding author.

-->

Overview

In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods.

Depth Map Visualization and Comparison

Framework

training pipeline

Given multi-view images, we first employ a Structure from Motion (SfM) method to derive sparse 3D scene structures (a). These 3D structures are then encoded into an intermediate explicit representation (b), which is used as a condition for depth estimation (c). Finally, we conduct a TSDF fusion to achieve the final reconstruction (d).

Reconstruction Showcase

We present the reconstruction results on UrbanScene3D. Due to file size limitation of sketchfab, the models displayed are of relatively low resolution. You may download the models with higher resolution geometry and texture if you are interested.


Combination with Gaussian Splatting

We run gsplat and compare the results with and without our method for initialization and supervision. Our method effectively constrains the Gaussian positions, particularly in textureless and reflective regions such as the lakes.

Citation

@inproceedings{guo2025murre,
  title={Multi-view Reconstruction via SfM-guided Monocular Depth Estimation},
  author={Guo, Haoyu and Zhu, He and Peng, Sida and Lin, Haotong and Yan, Yunzhi and Xie, Tao and Wang, Wenguan and Zhou, Xiaowei and Bao, Hujun},
  booktitle={CVPR},
  year={2025},
}