Learning Human Mesh Recovery in 3D Scenes

Given an input image and pre-scanned scene, SA-HMR utilizes a single forward network pass to estimate the global position (blue ball), contact scene points (colored scene points), and a scene-aware human mesh in 170ms.

Abstract

We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to ﬁrst estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed.

Pipeline

\(\textbf{Overview of the proposed SA-HMR.}\) \(\textbf{1.}\) The human root and scene contact estimation module (Sec.3.2) that first predicts the initial root and then refines the root with 3D scene cues using a sparse 3D CNN. The module also predicts contact labels for each scene point. \(\textbf{2.}\) The scene-aware human mesh recovery module (Sec.3.3) that enhances the pretrained METRO network with a parallel scene network. The scene network takes the predicted contact scene points as input, and uses cross-attention to pass messages to the intermediate features of the METRO network.