Motion-2-to-3: Leveraging 2D Motion Data
to Boost 3D Motion Generation


Huaijin Pi1,2*   Ruoxi Guo1,3*   Zehong Shen1   Qing Shuai1   Zechen Hu3   Zhumei Wang3   Yajiao Dong3   Ruizhen Hu4   Taku Komura2   Sida Peng1   Xiaowei Zhou1

1Zhejiang University   2The University of Hong Kong   3Deep Glint   4Shenzhen University

TL;DR


(1) This paper focuses on text-driven 3D human motion generation. 💃
(2) The motivation is that 3D motion capture data is expensive to collect while 2D human videos offer a vast and accessible source of 2D motion data.💪
(3) Our key idea is using 2D human motion extracted from videos to improve 3D human motion generation.🎉

Demo video


Abstract


(a) Our approach leverages 2D motion data to improve 3D motion generation by unifying 2D and 3D motion data. (b) Our framework yields better FID and generates a broader range of motion types.

Text-driven human motion synthesis is capturing significant attention for its ability to effortlessly generate intricate movements from abstract text cues, showcasing its potential for revolutionizing motion design not only in film narratives but also in virtual reality experiences and computer game development. Existing methods often rely on 3D motion capture data, which require special setups resulting in higher costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore leveraging 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-motion pairs. To enhance this model to synthesize 3D motion, we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Experiments on the HumanML3D dataset and novel text prompts demonstrate that our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports. Our code will be made publicly available.

Method


Pipeline. We design a Multi-view Diffusion model (a) to generate multi-view results. During inference, the Multi-view Diffusion model predicts 2D local motion and root velocity (b). Then, we use triangulation to recover 3D local joint positions (c) and accumulate root velocity to obtain 3D global trajectory (d), resulting in the final 3D motion (e).

Comparsion video




More Results




Citation


@article{pi2024motion2to3,
  title={Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation},
  author={Pi, Huaijin and Guo, Ruoxi and Shen, Zehong and Shuai, Qing and Hu, Zechen and Wang, Zhumei and Dong, Yajiao and Hu, Ruizhen and Komura, Taku and Peng, Sida and Zhou, Xiaowei},
  journal={arXiv preprint arXiv:2412.},
  year={2024}
}