Learning Neural Volumetric Representations
of Dynamic Humans in Minutes
Comparison of optimization speed between our method and baseline.
This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. They generally require a lengthy optimization process. Other generalization methods leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for creating viewpoint-free human performance synthesis from sparse view videos in minutes with competitive visual quality. Specifically, we leverage the human body prior to define a novel part-based voxelized NeRF representation, which distributes the representational power of the canonical human model efficiently. Furthermore, we propose a novel dimensionality reduction 2D motion parameterization scheme to increase the convergence rate of the human deformation field. Experiments demonstrate that our approach can be trained 100 times faster than prior per-scene optimization methods while being competitive in the rendering quality. We show that given a 512x512 video capturing a human performer of 100 frames, our model typically takes about 5 minutes for training to produce photorealistic free-viewpoint videos on a single RTX 3090 GPU. The code will be released for reproducibility.