HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers

SIGGRAPH 2025

Zhiyuan Yu^1* Zhe Li^2* Hujun Bao³ Can Yang^1x Xiaowei Zhou^3x

¹Mathematics, HKUST ²Huawei ³State Key Lab of CAD&CG, Zhejiang University
^* Equal contribution ^x Corresponding authors

ArXiv

Paper

Supplementary

Abstract

We propose HumanRAM, a novel approach for feed-forward novel view synthesis (reconstruction) and novel pose synthesis (animation) from sparse/single-view human image(s). The animation poses are from ActorsHQ and AMASS datasets.

3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.

Method

Pose Image. We first render position maps with canonical SMPL-X as vertex colors, and then the position maps are used to sample triplane-based neural texture.

Pipeline. HumanRAM adopts transformers for human reconstruction and animation from sparse view images in a feed-forward manner. We first patchify and project spare-view RGB images and their corresponding Plücker rays and pose images into input tokens through a linear layer. The pose images are acquired by rasterizing the SMPL-X neural texture onto the input views. Similarly, given the target novel view under the same or another novel pose, the target tokens are created from the target Plücker rays and pose images through another linear layer. Then both input tokens and target tokens are fed into transformer blocks. Finally, a DPT-based decoder regresses the intermediate target tokens to a high-fidelity human image under the target view and target pose. Overall, HumanRAM realizes feed-forward reconstruction and animation by controlling the target views and target poses at the input end.

HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers

SIGGRAPH 2025

Abstract

Sparse-view Results

Single-view Results

Method

Applications on Volumetric Video

Full Demo Video

Citation