HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers

SIGGRAPH 2025


Zhiyuan Yu1*   Zhe Li1*   Hujun Bao3   Can Yang1x   Xiaowei Zhou3x  

1Mathematics, HKUST   2Huawei   3State Key Lab of CAD&CG, Zhejiang University
* Equal contribution     x Corresponding authors

Abstract


We propose HumanRAM, a novel approach for feed-forward novel view synthesis (reconstruction) and novel pose synthesis (animation) from sparse/single-view human image(s). The animation poses are from ActorsHQ and AMASS datasets.

3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.


Sparse-view Results



Single-view Results



Method


Pose Image. We first render position maps with canonical SMPL-X as vertex colors, and then the position maps are used to sample triplane-based neural texture.

Pipeline. HumanRAM adopts transformers for human reconstruction and animation from sparse view images in a feed-forward manner. We first patchify and project spare-view RGB images and their corresponding Plücker rays and pose images into input tokens through a linear layer. The pose images are acquired by rasterizing the SMPL-X neural texture onto the input views. Similarly, given the target novel view under the same or another novel pose, the target tokens are created from the target Plücker rays and pose images through another linear layer. Then both input tokens and target tokens are fed into transformer blocks. Finally, a DPT-based decoder regresses the intermediate target tokens to a high-fidelity human image under the target view and target pose. Overall, HumanRAM realizes feed-forward reconstruction and animation by controlling the target views and target poses at the input end.


Applications on Volumetric Video



Full Demo Video




Citation


        @inproceedings{yu2025humanram,
          title={HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers},
          author={Yu, Zhiyuan and Li, Zhe and Bao, Hujun and Yang, Can and Zhou, Xiaowei},
          booktitle={SIGGRAPH Conference Proceedings},
          year={2025}
        }