TL;DR: We propose a new Gaussian Splatting representation with video diffusion model for novel view synthesis and surface reconstruction from extremely sparse (3-4), unposed images in unbounded 360 scenes.


How it works

In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360 scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes.







Comparisons to other methods

Compare the renders and surface reconstruction of our method (right) with baseline methods (left) on Mip-NeRF 360 (3-view) and Tanks and Temples (4-view). Try selecting different methods and scenes!





Method overview

A diagram explaining the method in broad strokes, like explained in the caption.
(a-d) Given unposed extremely sparse views, we employ the dense stereo reconstruction model [18, 38] to recover camera poses and initial point cloud of the scene. A layered Gaussian-based representation is built upon the initial point cloud to enable layer-specific bootstrap optimization. (e) We design the iterative fusion of reconstruction and generation with diffusion model [49]. Unknown views are iteratively generated under conditions of consistent GS rendering of known views. In turn, generated views are used to enhance the GS training.