Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation. However, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.
Panoramic Video Construction and Caption Pipeline. We utilize the Habitat Simulator to randomly select positions within the scenes of the Habitat Matterport 3D(HM3D) dataset and render cubic six-face maps. These maps are then interpolated and stitched together to form panoramas. We can obtain panoramas with clear tops and bottoms. To generate more precise text descriptions for the panoramas, we first use Blip2 to generate corresponding text descriptions for each of the obtained cube maps, and then employ LLM to summarize and obtain accurate and complete text descriptions. Furthermore, the Habitat Simulator allows us to render images based on camera trajectories within the HM3D scenes, enabling the creation of a dataset that simultaneously includes camera poses, panoramas, panoramic depths and their corresponding text descriptions.
Comparisons between PanFusion and Ours. PanFusion uses BLIP2 to directly generate text descriptions for panoramas, which only has four or five words and is very concise. The CLIP Score (CS) cannot reflect the accuracy of the text description, and the PanFusion dataset has the problem of blurry top and bottom. In contrast, our panoramic video dataset construction pipeline first generates text descriptions for perspective images using BLIP2, then uses LLM to summarize, which can obtain more detailed text descriptions. At the same time, the top and bottom of our panoramas are clear, and the dataset is larger (millions of panoramic keyframes). We also provide the camera pose of each panorama, the corresponding panoramic depth map, etc.
DiffPano Framework. The DiffPano framework consists of a single-view panoramic diffusion model and a multi-view diffusion model based on panoramic epipolar aware attention. It can support text-to-panorama or multi-view panoramas generation.
Text to Panorama Comparison between Ours vs PanFusion vs TextLight. Compared with PanFusion, our method can generate panoramas with clear top and bottom, while the top and bottom of panfusion's generation are blurred. Compared with Text2Light, our method has better left-right consistency.
Diversity of Text to Panorama Generation with Our Method. Given the same text prompt like "A cozy living room with wooden floors and a couch", our method can generate diverse and consistent panoramas.
Generalizability of Text to Panorama Generation with Our Method. Despite our method is only trained on indoor scene datasets, it can still generate outdoor panoramas scenes conditoned on text, which shows that our method has a certain degree of generalization. In the future, we can explore outdoor scene reconstruction to increase the diversity of training data and further improve the generalization of our method.
Text to Multi-View Panorama Comparison between Ours vs Modified MVDream. Experiments show that DiffPano is easier to converge than MVDream and can generate more consistent multi-view panoramas. "MVDream×2" denotes MVDream is trained with twice iteration number relative to DiffPano.
Text to Multi-View Panorama of Our Method. DiffPano allows scalable and consistent panorama generation (i.e. room switching) with given unseen text descriptions and camera poses. Each column represents the generated multi-view panoramas, switching from one room to another.
Text to Panoramic Video Generation of Our Method. Our method can generate longer panoramic videos with image-based panorama generation which demonstrate scalability. To generate panoramic videos, we can first generate multi-view panoramas of different rooms with large pose changes conditioned on the text, and then run the image-to-multi-view panorama generation conditioned on the generated multi-view panoramas, so that it can be expanded to generate longer panoramic videos.
@article{Ye2024DiffPano,
title={DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion},
author={Weicai Ye and Chenhao Ji and Zheng Chen and Junyao Gao and Xiaoshui Huang and Song-Hai Zhang and Wanli Ouyang and Tong He and Cairong Zhao and Guofeng Zhang},
booktitle={NeurIPS},
year={2024},
}