Representing Long Volumetric Video
with Temporal Gaussian Hierarchy

SIGGRAPH Asia 2024 (TOG)


Zhen Xu1* Yinghao Xu2* Zhiyuan Yu3* Sida Peng1 Jiaming Sun1 Hujun Bao1 Xiaowei Zhou1x

1Zhejiang University    2Stanford University    3HKUST
*Equal Contribution    xCorresponding Author

Real-Time rendering of long volumetric videos up to a few minutes (thousands of frames), on the SelfCap dataset.

Overview Video


Abstract


This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.

Method


Given a long multi-view video sequence, our method can generate a compact volumetric video with minimal training and memory usage while maintaining real-time rendering with state-of-the-art quality.

  • (a) We propose a hierarchical structure where each level consists of multiple temporal segments. Each segment stores a set of 4D Gaussians [Yang et al. 2023b] to parametrize scenes. As shown at the bottom, the 4D Gaussians in different segments represent different granularities of motions, efficiently and effectively modeling video dynamics.
  • (b) The appearance model leverages gradient thresholding to obtain sparse Spherical Harmonics coefficients, resulting in very compact storage while still maintaining view-dependent effects well.

Real-Time Demos


More Real-Time Demos on the DNA-Rendering Dataset
More Real-Time Demos on the Sports Dataset
More Real-Time Demos on the MobileStage Dataset
More Real-Time Demos on the CMU-Panoptic Dataset
More Real-Time Demos on the Neural3DV Dataset
More Real-Time Demos on the ENeRF-Outdoor Dataset

Real-Time VR Demos


Real-Time VR Demos on Apple Vision Pro and Meta Quest 3

Baseline Comparisons


Comparisons with 4K4D, K-Planes and ENeRF


Citation


@Article{xu2024longvolcap,
  author  = {Xu, Zhen and Xu, Yinghao and Yu, Zhiyuan and Peng, Sida and Sun, Jiaming and Bao, Hujun and Zhou, Xiaowei},
  title   = {Representing Long Volumetric Video with Temporal Gaussian Hierarchy},
  journal = {ACM Transactions on Graphics},
  number  = {6},
  volume  = {43},
  month   = {November},
  year    = {2024},
  url     = {https://zju3dv.github.io/longvolcap}
}

Business Inquiries


For business inquiries and other collaboration opportunities, please fill out this form and we will get back to you soon.