Representing Long Volumetric Video
with Temporal Gaussian Hierarchy

SIGGRAPH Asia 2024 (TOG)

Zhen Xu¹^* Yinghao Xu²^* Zhiyuan Yu³^* Sida Peng¹ Jiaming Sun¹ Hujun Bao¹ Xiaowei Zhou¹^x

¹Zhejiang University ²Stanford University ³HKUST
^*Equal Contribution ^xCorresponding Author

Paper

Abstract

This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.

Method

Given a long multi-view video sequence, our method can generate a compact volumetric video with minimal training and memory usage while maintaining real-time rendering with state-of-the-art quality.

(a) We propose a hierarchical structure where each level consists of multiple temporal segments. Each segment stores a set of 4D Gaussians [Yang et al. 2023b] to parametrize scenes. As shown at the bottom, the 4D Gaussians in different segments represent different granularities of motions, efficiently and effectively modeling video dynamics.
(b) The appearance model leverages gradient thresholding to obtain sparse Spherical Harmonics coefficients, resulting in very compact storage while still maintaining view-dependent effects well.

Representing Long Volumetric Video
with Temporal Gaussian Hierarchy

SIGGRAPH Asia 2024 (TOG)

Overview Video

Abstract

Method

Real-Time Demos

More Real-Time Demos on the DNA-Rendering Dataset

More Real-Time Demos on the Sports Dataset

More Real-Time Demos on the MobileStage Dataset

More Real-Time Demos on the CMU-Panoptic Dataset

More Real-Time Demos on the Neural3DV Dataset

More Real-Time Demos on the ENeRF-Outdoor Dataset

Real-Time VR Demos

Real-Time VR Demos on Apple Vision Pro and Meta Quest 3

Baseline Comparisons

Comparisons with 4K4D, K-Planes and ENeRF

Citation

Business Inquiries

Representing Long Volumetric Video with Temporal Gaussian Hierarchy

SIGGRAPH Asia 2024 (TOG)

Overview Video

Abstract

Method

Real-Time Demos

Click to Expand More Real-Time Demos on the DNA-Rendering Dataset

Click to Expand More Real-Time Demos on the Sports Dataset

Click to Expand More Real-Time Demos on the MobileStage Dataset

Click to Expand More Real-Time Demos on the CMU-Panoptic Dataset

Click to Expand More Real-Time Demos on the Neural3DV Dataset

Click to Expand More Real-Time Demos on the ENeRF-Outdoor Dataset

Real-Time VR Demos

Click to Expand Real-Time VR Demos on Apple Vision Pro and Meta Quest 3

Baseline Comparisons

Click to Expand Comparisons with 4K4D, K-Planes and ENeRF

Citation

Business Inquiries

Representing Long Volumetric Video
with Temporal Gaussian Hierarchy

More Real-Time Demos on the DNA-Rendering Dataset

More Real-Time Demos on the Sports Dataset

More Real-Time Demos on the MobileStage Dataset

More Real-Time Demos on the CMU-Panoptic Dataset

More Real-Time Demos on the Neural3DV Dataset

More Real-Time Demos on the ENeRF-Outdoor Dataset

Real-Time VR Demos on Apple Vision Pro and Meta Quest 3

Comparisons with 4K4D, K-Planes and ENeRF