1Zhejiang University 2The University of Hong Kong 3Deep Glint 4Shenzhen University
(1) This paper focuses on recovering world-grounded global human motion from monocular videos. 💃
(2) The key idea is predicting human pose in a novel
Gravity-View Coordinates
, which can be
naturally and uniquely inferred from each image using
gravity direction
and
camera-view direction.
Therefore, it helps alleviate error accumulation in estimating global motion for long videos. 💪
(3) The model is trained on AMASS, BEDLAM, H36M, and 3DPW. All codes and weights are publicly
available. 🎉
The proposed network, excluding preprocessing (2D tracking, feature extraction, relative camera rotation estimation), takes 280 ms to process a 1430-frame (~45 seconds) video on an RTX 4090 GPU.
We present a novel method for recovering world-grounded human motion from monocular video.
The main challenge of this problem stems from the ambiguity of defining the world coordinate system that
varies from sequence to sequence. Previous approaches attempt to alleviate this issue by predicting relative
motion between frames in an autoregressive manner, but being prone to accumulative errors.
Instead, we propose to solve this challenge by estimating human poses in a novel Gravity-View (GV)
coordinate system, which is defined by the world gravity and the camera view direction.
The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely
decreasing the ambiguity of learning image-pose mapping.
The estimated poses in the GV frame can be transformed back to a world coordinate system given camera
motions to form a global motion sequence.
Also, the per-frame estimation avoids the error accumulation in the autoregressive methods.
Experimental results on in-the-wild benchmarks demonstrate that our method can recover more realistic motion
in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both
accuracy and speed.
GV Coordinates: (1) Naturally considers gravity. (2) Uniquely defined for each image. (3) Avoids error accumulation along gravity direction for consecutive images.
Pipeline. Given a monocular video (left), GVHMR preprocesses the video by tracking the human bounding box, detecting 2D keypoints, extracting image features, and estimating camera relative rotation using visual odometry (VO) or a gyroscope. GVHMR then fuses these features into per-frame tokens, which are processed with a relative transformer and multitask MLPs. The outputs include: (1) intermediate representations (middle), i.e. human orientation in the Gravity-View coordinate system, root velocity in the SMPL coordinate system, and the stationary probability for predefined joints; and (2) camera frame SMPL parameters (right-top). Finally, the global trajectory (right-bottom) is recovered by transforming the intermediate representations to the world coordinate system, as described in Sec. 3.1.
Training: GVHMR is trained on a mixed dataset consisting of AMASS, BEDLAM, H36M, and 3DPW. The model is trained from scratch and converges after 420 epochs with a batch size of 256. Training takes 13 hours on 2 RTX 4090 GPUs.
World-grounded Metrics
Camera-space Metrics
@inproceedings{shen2024gvhmr,
title={World-Grounded Human Motion Recovery via Gravity-View Coordinates},
author={Shen, Zehong and Pi, Huaijin and Xia, Yan and Cen, Zhi and Peng, Sida and Hu, Zechen and Bao, Hujun and Hu, Ruizhen and Zhou, Xiaowei},
booktitle={SIGGRAPH Asia Conference Proceedings},
year={2024}
}