MotionStreamer: Streaming Motion Generation
via Diffusion-based Autoregressive Model
in Causal Latent Space

ICCV 2025

Lixing Xiao¹, Shunlin Lu², Huaijin Pi³, Ke Fan⁴, Liang Pan³, Yueer Zhou¹,
Ziyong Feng⁵, Xiaowei Zhou¹, Sida Peng¹, Jingbo Wang⁶

¹ Zhejiang University ² The Chinese University of Hong Kong, Shenzhen ³ The University of Hong Kong ⁴ Shanghai Jiao Tong University ⁵ DeepGlint ⁶ Shanghai AI Lab

arXiv Paper Motion Representation Full Code Data

Text-to-motion Generation

A man is jogging around.

person is dancing eloquently.

he is moving back and forth
while practing his kick boxing.

A person spins 360 degrees clockwise.

a person is walking like a mummy.

a man is walks up and down from either stairs, rocks, or some unlevel
terrain requiring a step.

a person is doing star jumps.

A man is performing a somersault.

a man is walking forward , stumbles
and lose balance.

a man walks sideways like a crab.

the person is doing a punch karate kick.

a man crawls forward like a zombie
and then stands up.

Streaming Long-term Motion Generation

Multi-round Text-to-motion Generation

We build an interactive Blender Add-on for multi-round generation supported in our work .

Dynamic Motion Composition

We support regenerating subsequent motions by altering textual conditions while preserving the initially generated prefix motion.

"walk forward" , "jump forward" .

"walk forward" , "sit down" .

"walk forward" , "turn around" .

Overall visualization: Dynamic Motion Composition of 3 additional motions after the initial motion "walk forward".

Comparison with Text-to-motion Models

a man jumps on one leg.

Comparison with Long-term Text-to-motion Models

"a man walks forward with arms swinging."
"then he jumps up." "he turns around." "he faces another side."

Ablation Study on Causal property of motion Encoder

"A man is walking forward , favoring his left leg and shifting his walk. He is possibly drunk."
"He turns around."

Causal motion Encoder explicitly models the temporal causal structures in the latent space, enabling streaming generation of unseen motions.

Ablation Study on Two-Forward Strategy

the person is walking on the threadmill.

Ablation Study on QK Normalization Technique

A man walks in a circular path, with his torso slightly leaned back.

Ablation Study on Text Tokenizer

A person jumps repeatedly, throwing their arms above their head
and stretching their legs with each jump.

Generation Diversity

a person walks in a circle.

Modified 272-dim Motion Representation

The widely used 263-dimensional motion representation in HumanML3D dataset [Guo et al. (2022)]
suffers from post-processing (Inverse Kinematics operation) artifacts.

We refine the motion representation to enable directly conversion from the predicted joint rotations to SMPL body parameters.

(Processing scripts of the modified 272-dim motion representation are available at here.)

BibTeX

@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }

MotionStreamer: Streaming Motion Generation
via Diffusion-based Autoregressive Model
in Causal Latent Space

ICCV 2025

Abstract

Pipeline Overview

Text-to-motion Generation

A man is jogging around.

person is dancing eloquently.

he is moving back and forth
while practing his kick boxing.

A person spins 360 degrees clockwise.

a person is walking like a mummy.

a man is walks up and down from either stairs, rocks, or some unlevel
terrain requiring a step.

a person is doing star jumps.

A man is performing a somersault.

a man is walking forward , stumbles
and lose balance.

a man walks sideways like a crab.

the person is doing a punch karate kick.

a man crawls forward like a zombie
and then stands up.

Streaming Long-term Motion Generation

Multi-round Text-to-motion Generation

We build an interactive Blender Add-on for multi-round generation supported in our work .

Dynamic Motion Composition

We support regenerating subsequent motions by altering textual conditions while preserving the initially generated prefix motion.

"walk forward" , "jump forward" .

"walk forward" , "sit down" .

"walk forward" , "turn around" .

Comparison with Text-to-motion Models

Comparison with Long-term Text-to-motion Models

Ablation Study on Causal property of motion Encoder

Ablation Study on Two-Forward Strategy

Ablation Study on QK Normalization Technique

Ablation Study on Text Tokenizer

Generation Diversity

Modified 272-dim Motion Representation

The widely used 263-dimensional motion representation in HumanML3D dataset [Guo et al. (2022)]
suffers from post-processing (Inverse Kinematics operation) artifacts.

BibTeX

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

ICCV 2025

Abstract

Pipeline Overview

Text-to-motion Generation

A man is jogging around.

person is dancing eloquently.

he is moving back and forth while practing his kick boxing.

A person spins 360 degrees clockwise.

a person is walking like a mummy.

a man is walks up and down from either stairs, rocks, or some unlevel terrain requiring a step.

a person is doing star jumps.

A man is performing a somersault.

a man is walking forward , stumbles and lose balance.

a man walks sideways like a crab.

the person is doing a punch karate kick.

a man crawls forward like a zombie and then stands up.

Streaming Long-term Motion Generation

Multi-round Text-to-motion Generation

We build an interactive Blender Add-on for multi-round generation supported in our work .

Dynamic Motion Composition

We support regenerating subsequent motions by altering textual conditions while preserving the initially generated prefix motion.

"walk forward" , "jump forward" .

"walk forward" , "sit down" .

"walk forward" , "turn around" .

Comparison with Text-to-motion Models

Comparison with Long-term Text-to-motion Models

Ablation Study on Causal property of motion Encoder

Ablation Study on Two-Forward Strategy

Ablation Study on QK Normalization Technique

Ablation Study on Text Tokenizer

Generation Diversity

Modified 272-dim Motion Representation

The widely used 263-dimensional motion representation in HumanML3D dataset [Guo et al. (2022)] suffers from post-processing (Inverse Kinematics operation) artifacts.

BibTeX

MotionStreamer: Streaming Motion Generation
via Diffusion-based Autoregressive Model
in Causal Latent Space

he is moving back and forth
while practing his kick boxing.

a man is walks up and down from either stairs, rocks, or some unlevel
terrain requiring a step.

a man is walking forward , stumbles
and lose balance.

a man crawls forward like a zombie
and then stands up.

The widely used 263-dimensional motion representation in HumanML3D dataset [Guo et al. (2022)]
suffers from post-processing (Inverse Kinematics operation) artifacts.