1Xiangjiang Lab 2ZJU 3FDU 4THU 5SZU
*Equal Contribution xCorresponding Author
We present visual action prompts (VAP), a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources — human-object interactions (HOI) and dexterous robotic manipulation — enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.
VAP encodes the action-induced structural dynamics of humans and robots as 2D visual action prompts — sequences of 2D skeletons — providing a unified control signal for action-conditioned video generation. We develop robust pipelines to extract these skeletons from human-object interaction (HOI) and robot videos, then condition and fine-tune a pretrained video generator to synthesize interaction-driven visual dynamics consistent with the prompt.
We first train a separate model on each dataset. Agent-centric action signals (e.g., 7-DoF end-effector poses in IRA-Sim) work well for a single embodiment with fixed viewpoints (e.g., RT-1) but degrade on datasets with randomized cameras (e.g., DROID). In contrast, VAP's 2D skeleton prompts are robust, maintaining performance under random viewpoints and contact-rich interactions.
We then train a single VAP model jointly across datasets. A single set of VAPs supports interactive video generation for both hand-object interactions and robotic manipulation, demonstrating generalization across embodiments.
Visual action prompts are a flexible control interface. We can instantiate them as 2D skeletons, mesh renderings, depth maps, etc.; all serve as effective action surrogates, enabling fine-grained, temporally coherent control of motion and interaction.
By altering the visual action prompt, VAP modulates the output video to simulate different actions.
@InProceedings{VAP_2025_ICCV,
author = {Wang, Yuang and Wen, Chao and Guo, Haoyu and Peng, Sida and Qin, Minghan and Bao, Hujun and Zhou, Xiaowei and Hu, Ruizhen},
title = {Precise Action-to-Video Generation Through Visual Action Prompts},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {10},
year = {2025}
}