MotionStreamer: Streaming Motion Generation
via Diffusion-based Autoregressive Model
in Causal Latent Space

1 Zhejiang University 2 The Chinese University of Hong Kong, Shenzhen 3 The University of Hong Kong 4 Shanghai Jiao Tong University 5 DeepGlint 6 Shanghai AI Lab
teaser

Abstract

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. The code will be released for the reproducibility.

Pipeline Overview

Motion reconstruction in causal latent space

Streaming motion generation framework


Text-to-motion Generation



A man is jogging around.


person is dancing eloquently.

he is moving back and forth
while practing his kick boxing.


A person spins 360 degrees clockwise.


a person is walking like a mummy.

a man is walks up and down from either stairs, rocks, or some unlevel
terrain requiring a step.


a person is doing star jumps.


A man is performing a somersault.

a man is walking forward , stumbles
and lose balance.


a man walks sideways like a crab.


the person is doing a punch karate kick.

a man crawls forward like a zombie
and then stands up.


Streaming Long-term Motion Generation



Multi-round Text-to-motion Generation


We build an interactive Blender Add-on for multi-round generation supported in our work .


Dynamic Motion Composition


We support regenerating subsequent motions by altering textual conditions while preserving the initially generated prefix motion.


"walk forward" , "jump forward" .


"walk forward" , "sit down" .


"walk forward" , "turn around" .

Overall visualization: Dynamic Motion Composition of 3 additional motions after the initial motion "walk forward".


Comparison with Text-to-motion Models

a man jumps on one leg.


Comparison with Long-term Text-to-motion Models

"a man walks forward with arms swinging."
"then he jumps up." "he turns around." "he faces another side."


Ablation Study on Causal property of motion Encoder

"A man is walking forward , favoring his left leg and shifting his walk. He is possibly drunk."
"He turns around."

Causal motion Encoder explicitly models the temporal causal structures in the latent space, enabling streaming generation of unseen motions.


Ablation Study on Two-Forward Strategy

"A man is walking forward , favoring his left leg and shifting his walk. He is possibly drunk."
"He turns around."


Ablation Study on QK Normalization Technique

A man walks in a circular path, with his torso slightly leaned back.


Ablation Study on Text Tokenizer

A person jumps repeatedly, throwing their arms above their head
and stretching their legs with each jump.


Generation Diversity


a person walks in a circle.

Modified 272-dim Motion Representation


The widely used 263-dimensional motion representation in HumanML3D dataset [Guo et al. (2022)]
suffers from post-processing (Inverse Kinematics operation) artifacts.


We refine the motion representation to enable directly conversion from the predicted joint rotations to SMPL body parameters.

(Processing scripts of the modified 272-dim motion representation are available at here.)

BibTeX

@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }