World-Grounded Human Motion Recovery via
Gravity-View Coordinates

SIGGRAPH Asia 2024

Zehong Shen^1* Huaijin Pi^1,2* Yan Xia¹ Zhi Cen¹ Sida Peng^1† Zecen Hu³ Hujun Bao¹ Ruizhen Hu⁴ Xiaowei Zhou¹

¹Zhejiang University ²The University of Hong Kong ³Deep Glint ⁴Shenzhen University

TL;DR

(1) This paper focuses on recovering world-grounded global human motion from monocular videos. 💃
(2) The key idea is predicting human pose in a novel Gravity-View Coordinates , which can be naturally and uniquely inferred from each image using gravity direction and camera-view direction. Therefore, it helps alleviate error accumulation in estimating global motion for long videos. 💪
(3) The model is trained on AMASS, BEDLAM, H36M, and 3DPW. All codes and weights are publicly available. 🎉

Abstract

The proposed network, excluding preprocessing (2D tracking, feature extraction, relative camera rotation estimation), takes 280 ms to process a 1430-frame (~45 seconds) video on an RTX 4090 GPU.

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge of this problem stems from the ambiguity of defining the world coordinate system that varies from sequence to sequence. Previous approaches attempt to alleviate this issue by predicting relative motion between frames in an autoregressive manner, but being prone to accumulative errors. Instead, we propose to solve this challenge by estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely decreasing the ambiguity of learning image-pose mapping. The estimated poses in the GV frame can be transformed back to a world coordinate system given camera motions to form a global motion sequence. Also, the per-frame estimation avoids the error accumulation in the autoregressive methods. Experimental results on in-the-wild benchmarks demonstrate that our method can recover more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed.

Method

GV Coordinates: (1) Naturally considers gravity. (2) Uniquely defined for each image. (3) Avoids error accumulation along gravity direction for consecutive images.

Pipeline. Given a monocular video (left), GVHMR preprocesses the video by tracking the human bounding box, detecting 2D keypoints, extracting image features, and estimating camera relative rotation using visual odometry (VO) or a gyroscope. GVHMR then fuses these features into per-frame tokens, which are processed with a relative transformer and multitask MLPs. The outputs include: (1) intermediate representations (middle), i.e. human orientation in the Gravity-View coordinate system, root velocity in the SMPL coordinate system, and the stationary probability for predefined joints; and (2) camera frame SMPL parameters (right-top). Finally, the global trajectory (right-bottom) is recovered by transforming the intermediate representations to the world coordinate system, as described in Sec. 3.1.

Training & Evaluation Metrics

Training: GVHMR is trained on a mixed dataset consisting of AMASS, BEDLAM, H36M, and 3DPW. The model is trained from scratch and converges after 420 epochs with a batch size of 256. Training takes 13 hours on 2 RTX 4090 GPUs.

World-grounded Metrics

Camera-space Metrics

World-Grounded Human Motion Recovery via
Gravity-View Coordinates

SIGGRAPH Asia 2024

TL;DR

Abstract

Method

Training & Evaluation Metrics

More results

Applications

Citation

World-Grounded Human Motion Recovery via Gravity-View Coordinates

SIGGRAPH Asia 2024

TL;DR

Abstract

Method

Training & Evaluation Metrics

More results

Applications

Citation

World-Grounded Human Motion Recovery via
Gravity-View Coordinates