This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. The code will be released for the reproducibility.
Motion reconstruction in causal latent space
Streaming motion generation framework
Overall visualization: Dynamic Motion Composition of 3 additional motions after the initial motion "walk forward".
a man jumps on one leg.
"a man walks forward with arms swinging."
"then he jumps up."
"he turns around."
"he faces another side."
"A man is walking forward , favoring his left leg and shifting his walk. He is possibly drunk."
"He turns around."
Causal motion Encoder explicitly models the temporal causal structures in the latent space, enabling streaming generation of unseen motions.
"A man is walking forward , favoring his left leg and shifting his walk. He is possibly drunk."
"He turns around."
A man walks in a circular path, with his torso slightly leaned back.
A person jumps repeatedly, throwing their arms above their head
and stretching their legs with each jump.
We refine the motion representation to enable directly conversion from the predicted joint rotations to SMPL body parameters.
(Processing scripts of the modified 272-dim motion representation are available at here.)
@article{xiao2025motionstreamer,
title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
journal={arXiv preprint arXiv:2503.15451},
year={2025}
}