1SenseTime Research
2Zhejiang University
3National University of Defense Technology
* denotes equal contribution
This paper solves the problem of real-time 6-DoF object tracking from an RGB video. Prior optimization-based methods optimize the object pose by aligning the projected model to the image based on handcrafted features, which is prone to suboptimal solutions. Recent learning-based methods use a deep network to predict the pose, which has limited generalizability or computational efficiency. We propose a learning-based active contour model to make the best use of both worlds. Specifically, given the initial pose, we project the object model to the image plane to obtain the initial contour and use a lightweight network to predict how the contour should move to match the true object boundary, which gives the gradients to optimize the object pose. We also devise an efficient optimization algorithm to train our model end-to-end with pose supervision. Experimental results on semi-synthetic and real-world 6-DoF object tracking datasets demonstrate that our model outperforms state-of-the-art methods by a substantial margin in pose accuracy, while achieving real-time performance on a mobile device.
$\textbf{1.}$ The method uses an FPN-Lite CNN to extract multi-level features for the current cropped frame $_c\mathit{I}_k$, and represents the local region of the contour by a correspondence line model. $\textbf{2.}$ A contour feature map $_{ct}\mathbf{F}_k^s$ is built by sampling a cycle of correspondence lines upon the image feature map, followed by a boundary prediction module that predicts boundary location probability $_{ct}\mathbf{B}_k^s$. $\textbf{3.}$ A differentiable optimization layer is used to estimate the pose $\mathbf{P}_k^s$ in a coarse-to-fine manner. The superscript `s' refers to various feature levels.