World-Grounded Human Motion Recovery via
Gravity-View Coordinates

SIGGRAPH Asia 2024


Zehong Shen1*   Huaijin Pi1,2*   Yan Xia1   Zhi Cen1   Sida Peng1†   Zecen Hu3   Hujun Bao1   Ruizhen Hu4   Xiaowei Zhou1

1Zhejiang University   2The University of Hong Kong   3Deep Glint   4Shenzhen University

TL;DR


(1) This paper focuses on recovering world-grounded global human motion from monocular videos. 💃
(2) The key idea is predicting human pose in a novel Gravity-View Coordinates , which can be naturally and uniquely inferred from each image using gravity direction and camera-view direction. Therefore, it helps alleviate error accumulation in estimating global motion for long videos. 💪
(3) The model is trained on AMASS, BEDLAM, H36M, and 3DPW. All codes and weights are publicly available. 🎉

Abstract


The proposed network, excluding preprocessing (2D tracking, feature extraction, relative camera rotation estimation), takes 280 ms to process a 1430-frame (~45 seconds) video on an RTX 4090 GPU.

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge of this problem stems from the ambiguity of defining the world coordinate system that varies from sequence to sequence. Previous approaches attempt to alleviate this issue by predicting relative motion between frames in an autoregressive manner, but being prone to accumulative errors. Instead, we propose to solve this challenge by estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely decreasing the ambiguity of learning image-pose mapping. The estimated poses in the GV frame can be transformed back to a world coordinate system given camera motions to form a global motion sequence. Also, the per-frame estimation avoids the error accumulation in the autoregressive methods. Experimental results on in-the-wild benchmarks demonstrate that our method can recover more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed.

Method


GV Coordinates: (1) Naturally considers gravity. (2) Uniquely defined for each image. (3) Avoids error accumulation along gravity direction for consecutive images.

Pipeline. Given a monocular video (left), GVHMR preprocesses the video by tracking the human bounding box, detecting 2D keypoints, extracting image features, and estimating camera relative rotation using visual odometry (VO) or a gyroscope. GVHMR then fuses these features into per-frame tokens, which are processed with a relative transformer and multitask MLPs. The outputs include: (1) intermediate representations (middle), i.e. human orientation in the Gravity-View coordinate system, root velocity in the SMPL coordinate system, and the stationary probability for predefined joints; and (2) camera frame SMPL parameters (right-top). Finally, the global trajectory (right-bottom) is recovered by transforming the intermediate representations to the world coordinate system, as described in Sec. 3.1.

Training & Evaluation Metrics


Training: GVHMR is trained on a mixed dataset consisting of AMASS, BEDLAM, H36M, and 3DPW. The model is trained from scratch and converges after 420 epochs with a batch size of 256. Training takes 13 hours on 2 RTX 4090 GPUs.

World-grounded Metrics

Camera-space Metrics

More results


Applications




Citation


@inproceedings{shen2024gvhmr,
  title={World-Grounded Human Motion Recovery via Gravity-View Coordinates},
  author={Shen, Zehong and Pi, Huaijin and Xia, Yan and Cen, Zhi and Peng, Sida and Hu, Zechen and Bao, Hujun and Hu, Ruizhen and Zhou, Xiaowei},
  booktitle={SIGGRAPH Asia Conference Proceedings},
  year={2024}
}