PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

CVPR 2025

1State Key Lab of CAD & CG, Zhejiang University, 2RayNeo
PanoGS Results.

We propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Our PanoGS can achieve more accurate segmentation results and generate 3D instance-level results for open-vocabulary text queries, unlike previous methods that generate heatmaps between scene features and text queries

Abstract

Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Technically, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-plane to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding.


Method

PanoGS.

Overview of our approach. (a) Given posed RGB-D images, we reconstruct the scene with 3D Gaussian primitives, and each primitive is associated with additional latent language code $g$ generated from a latent continuous pyramid tri-plane feature space. (b) After the geometry reconstruction, we obtain 2D fused primitive-level features and confidences via back projection, which is used for efficient 3D language feature regression and latent pyramid tri-plane and 3D decoder optimization. (c) We perform a language-guided graph cuts algorithm to construct super-primitive and use the 2D instance mask generated by SAM to conduct progressive graph clustering.

Experiments

3D Semantic Segmentation of ScanNetV2 dataset

Visual Localization.

3D Panoptic Segmentation of ScanNetV2 dataset

Novel view synthesis.

Video Results



BibTeX

@inproceedings{panogs,
      title={{PanoGS}: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding},
      author={Zhai, Hongjia and Li, Hai and Li, Zhenzhe and Pan, Xiaokun and He, Yijia and Zhang, Guofeng},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2025},
    }