HTD-Refine: Natural Human Motion Recovery
by Aligning High-order Temporal Dynamics
from Monocular Videos

CVPR 2026 Oral Award Candidate

Dingkun Wei^1,2,* Zehong Shen^2,* Yan Xia³ Georgios Pavlakos³ Yujun Shen² Xiaowei Zhou¹

^*Equal contribution ¹Zhejiang University ²Ant Group ³The University of Texas at Austin

TL;DR We refine existing HMR methods with estimated 3D velocity and 3D acceleration to recover natural human motion in global coordinates.

Abstract

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues, namely velocity and acceleration, which are essential for reconstructing motion with realistic momentum, timing, and high-frequency detail.

Methods

HTD-Refine recovers natural global human motion from a monocular video through three stages. First, we initialize a world-space trajectory by combining camera-space SMPL estimates from an off-the-shelf HMR model and per-frame camera extrinsics. Second, our lightweight temporal transformer, PVA-Net, predicts stabilized 2D keypoints together with camera-space 3D joint velocities and accelerations. Third, we optimize pose, global orientation, and translation over the full sequence with velocity and acceleration targets from PVA-Net, under constraints of 2D keypoints, jerk smoothness, and regularization toward the initialization.

Video Results

Video loading...

Main Results

MPJVE: Mean Per-Joint Velocity Error. MPJAE: Mean Per-Joint Acceleration Error.

EMDB-2 Moving Cameras

Model	Jitter	FS	MPJVE	MPJAE	WA-MPJPE	W-MPJPE	RTE
TRAM (w/ traj filter)	25.1	12.0	0.6	12.3	78.8	221.3	1.5
TRAM + HTD-Refine	6.6	7.5	0.4	8.0	71.7	204.9	1.5
GVHMR	17.2	4.0	0.6	10.4	118.7	292.7	2.1
GVHMR + HTD-Refine	7.2	5.7	0.4	7.9	69.2	192.4	1.5
Human3R	529.6	60.0	2.9	143.3	169.0	367.1	2.2
Human3R + HTD-Refine	132.5	23.2	1.3	39.4	156.2	391.4	2.2

RICH Static Cameras

Model	Jitter	FS	MPJVE	MPJAE	WA-MPJPE	W-MPJPE	RTE
TRAM (w/ traj filter)	18.7	12.9	0.6	8.7	103.6	168.4	2.7
TRAM + HTD-Refine	4.2	6.5	0.4	5.1	90.2	145.3	2.5
GVHMR	13.0	3.3	0.4	6.8	77.4	124.0	2.5
GVHMR + HTD-Refine	3.6	3.3	0.3	4.8	75.2	123.8	2.3

Citation

@inproceedings{wei2026htdrefine,
  title     = {Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos},
  author    = {Wei, Dingkun and Shen, Zehong and Xia, Yan and Pavlakos, Georgios and Shen, Yujun and Zhou, Xiaowei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

HTD-Refine: Natural Human Motion Recovery by Aligning High-order Temporal Dynamics from Monocular Videos