StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

1Sensetime Research 2State Key Lab of CAD&CG, Zhejiang University 3Tetras.AI
* Equal Contribution Corresponding Authors

Video

Note: This is an ongoing project, and new results will be continuously updated on the page.

The following section presents the method description and results from the arXiv paper.

Abstract

Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.

Method Overview

(a) We introduce a spatiotemporal autoregression framework for long-range scene generation. The generated scene is represented as a set of sparsely sampled posed images. The generation of the current sliding window of images (blue dotted box) is conditioned on spatially adjacent images that are generated previously (green frustums) and temporally overlapping image (blue solid box) from the preceding window.

(b) The spatial conditioning images are processed by a large reconstruction model, which extracts the 3D information and renders the reconstructed latent features to each novel view. These spatial features, together with the temporal conditioning image, are used to condition the generation of the current window through a video diffusion model and a ControlNet.

(c) The framework is used to implement three downstream tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation.


Video Comparisons

Sparse View Interpolation

Perpetual View Generation

Layout-conditioned City Generation

Quantitative Comparisons

Accuracy comparison on datasets RealEstate-10K and ACID
Accuracy comparison on datasets RealEstate-10K and Tanks&Templates
Scalability comparison on long-range videos on the dataset RealEstate-10K

Ablation Study

Ablation on spatial and temporal conditions of perpetual view generation.
Video of Ablation on Spatial and Temporal Conditions

BibTeX


      @misc{zhai2025stargen,
        title={StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation}, 
        author={Shangjin Zhai and Zhichao Ye and Jialin Liu and Weijian Xie and Jiaqi Hu and Zhen Peng and Hua Xue and Danpeng Chen and Xiaomeng Wang and Lei Yang and Nan Wang and Haomin Liu and Guofeng Zhang},
        year={2025},
        eprint={2501.05763},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2501.05763},
    }