VAP: Precise Action-to-Video Generation
through Visual Action Prompts

ICCV 2025

Yuang Wang^{2^*} Chao Wen^{3^*} Haoyu Guo^{2^*} Sida Peng² Minghan Qin⁴ Hujun Bao² Xiaowei Zhou² Ruizhen Hu^{1,5^x}

¹Xiangjiang Lab ²ZJU ³FDU ⁴THU ⁵SZU
^*Equal Contribution ^xCorresponding Author

Abstract

We present visual action prompts (VAP), a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources — human-object interactions (HOI) and dexterous robotic manipulation — enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.

Method

VAP encodes the action-induced structural dynamics of humans and robots as 2D visual action prompts — sequences of 2D skeletons — providing a unified control signal for action-conditioned video generation. We develop robust pipelines to extract these skeletons from human-object interaction (HOI) and robot videos, then condition and fine-tune a pretrained video generator to synthesize interaction-driven visual dynamics consistent with the prompt.

Agent-specific Control

We first train a separate model on each dataset. Agent-centric action signals (e.g., 7-DoF end-effector poses in IRA-Sim) work well for a single embodiment with fixed viewpoints (e.g., RT-1) but degrade on datasets with randomized cameras (e.g., DROID). In contrast, VAP's 2D skeleton prompts are robust, maintaining performance under random viewpoints and contact-rich interactions.

Agent-agnostic Control

We then train a single VAP model jointly across datasets. A single set of VAPs supports interactive video generation for both hand-object interactions and robotic manipulation, demonstrating generalization across embodiments.

Variants of VAP

Visual action prompts are a flexible control interface. We can instantiate them as 2D skeletons, mesh renderings, depth maps, etc.; all serve as effective action surrogates, enabling fine-grained, temporally coherent control of motion and interaction.

VAP: Precise Action-to-Video Generation
through Visual Action Prompts

ICCV 2025

Abstract

Method

Agent-specific Control

Agent-specific Control on RT-1 Dataset

Agent-specific Control on DROID Dataset

Agent-agnostic Control

Agent-agnostic Control on EgoVid Dataset

Agent-agnostic Control on RT-1 Dataset

Agent-agnostic Control on DROID Dataset

Variants of VAP

Variants of Visual Action Prompts

Visual Simulation of Different Actions

Visual Simulation of Different Actions

Citation

VAP: Precise Action-to-Video Generation through Visual Action Prompts

ICCV 2025

Abstract

Method

Agent-specific Control

Click to Expand Agent-specific Control on RT-1 Dataset

Click to Expand Agent-specific Control on DROID Dataset

Agent-agnostic Control

Click to Expand Agent-agnostic Control on EgoVid Dataset

Click to Expand Agent-agnostic Control on RT-1 Dataset

Click to Expand Agent-agnostic Control on DROID Dataset

Variants of VAP

Click to Expand Variants of Visual Action Prompts

Visual Simulation of Different Actions

Click to Expand Visual Simulation of Different Actions

Citation

VAP: Precise Action-to-Video Generation
through Visual Action Prompts

Agent-specific Control on RT-1 Dataset

Agent-specific Control on DROID Dataset

Agent-agnostic Control on EgoVid Dataset

Agent-agnostic Control on RT-1 Dataset

Agent-agnostic Control on DROID Dataset

Variants of Visual Action Prompts

Visual Simulation of Different Actions