HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

1Zhejiang University, 2ByteDance

Hierarchical scene generation with HiScene.

Abstract

Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

Video

Pipeline

piepline

Our hierarchical framework generates 3D scenes with compositional identities through three main stages. First, we create a 3D scene from a generated isometric view. Next, we perform scene parsing to obtain precise object segmentation, followed by multi-view rendering and detailed occlusion analysis for each identified instance. Finally, we apply our video-diffusion-based amodal completion to generate complete views of each instance, which serve as guidance for regenerating intact objects with proper spatial alignment in the scene. The resulting 3D scene features fully compositional identities, facilitating user-directed modifications like interactive scene editing.

Hierarchical scene parsing

HiScene first initialize a 3D Gaussian Splatting scene from a generated isometric view, then perform hierarchical scene parsing with semantic segmentation to identify distinct objects and obtain each object’ multi-view rendering and occlusion analysis.

2D & 3D Segmentation

Object-Centric Multiview Rendering & Occlusion Analysis

Video-diffusion-based Amodal Completion

A key challenge during identity refinement is that the rendered instance views often exhibit significant occlusions. Despite advances in 3D object generation, reconstructing complete objects from occluded views remains ill-posed. To tackle this problem, we reformulate the instance refinement as a 2D amodal completion and 3D regeneration task, and propose a video-diffusion-based completion framework to handle it. Our method treats the amodal completion process as a temporal transition video effect, where occlusions gradually dissolve to reveal the complete object.

Dataset curation

During object completion, apart from filling occluded parts, we need to remove notable visual artifacts caused by occlusion, such as shadows. To create training data with realistic shadow effects, we combined carefully filtered Objaverse 3D objects (181K) with rigid body simulation and path-tracing rendering in Blender, generating 468K synthetic images. This synthetic dataset was then integrated with existing data, resulting in a comprehensive collection of 1.32 million image pairs. These pairs were converted to video format using linear blending, transitioning from foreground objects to their complete versions.

Spatial Aligned Generation

After obtaining objects’ occlusion-free views, we aim to regenerate each object to achieve intact instances while preserving their original scale and poses. To accomplish this, we design a spatial alignment mechanism with shape prior injection that ensures refined objects maintain proper geometric alignment with the original scene context.

piepline

More Examples

We show more examples of interactive 3D scene generation and scene editing.

BibTeX

@article{dong2025hiscene,
  author    = {Dong, Wenqi and Yang, Bangbang and Yang, Zesong and Li, Yuan and Hu, Tao and Bao, Hujun and Ma, Yuewen and Cui, Zhaopeng},
  title     = {HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation},
  journal   = {arxiv},
  year      = {2025},
}