PVO: Panoptic Visual Odometry

Arxiv 2022

Weicai Ye*1, Xinyue Lan*1, Shuo Chen1, Yuhang Ming2, Xinyuan Yu1,3, Hujun Bao1, Zhaopeng Cui1, Guofeng Zhang1†

1State Key Lab of CAD & CG, Zhejiang University   2Visual Information Laboratory, University of Bristol 3Wuhan University
* denotes equal contribution
† denotes corresponding author



We present a novel panoptic visual odometry framework, termed PVO, to achieve a more comprehensive modeling of the scene's motion, geometry, and panoptic segmentation information. PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, enabling the two tasks to facilitate each other. Specifically, we introduce a panoptic update module into the VO module, which operates on the image panoptic segmentation. This Panoptic-Enhanced VO module can trim the interference of dynamic objects in the camera pose estimation by adjusting the weights of optimized camera poses. On the other hand, the VO-Enhanced VPS module improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO module. These two modules contribute to each other through a recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks.

System overview

PVO System Overview

Panoptic Visual Odometry Architecture. Our method consists of three modules, namely, an image panoptic segmentation module for system initialization (blue), a Panoptic-Enhanced VO module (orange), and a VO-Enhanced VPS module (red). The last two modules contribute to each other in a recurrent iterative manner.


We compare our method with several state-of-the-art methods for both two tasks. For visual odometry, we conduct experiments on three datasets with dynamic scenes: VKITTI2, KITTI, and TUM RGBD dynamic sequences, to evaluate the accuracy of the camera trajectory, primarily using Absolute Trajectory Error. For video panoptic segmentation, we evaluate the VPQ metric used in FuseTrack on VKITTI2, Cityscapes and VIPER datasets.

  • Visual Odometry

  • As shown in Tab. 1 and Fig. 6, our PVO outperforms DROID-SLAM by a large margin except for the vkitti02 sequence. Compared with DROID-SLAM, we achieve nearly half of the pose estimation error in DROID-SLAM, shown in Fig. 7, which demonstrates good generalization ability of PVO. Tab. 2 demonstrates that our methods perform better on all datasets, compared with DROID-SLAM. We achieve the best results on 5 dataset out of 9 datasets. Note that PointCorr is a state-of-the-art RGB-D SLAM using Point Correlation, while ours only used monocular RGB video.

  • Video Panoptic Segmentation

  • We observe that our method with PanopticFCN outperforms the state-of-the-art method, achieving +1.6% VPQ higher than VPSNet-Track on Cityscapes-Val dataset. Compared with VPSNet-FuseTrack, our method with PanopticFCN achieves much higher scores (51.5VPQ vs. 48.4 VPQ) on VIPER dataset. As shown in Tab. 5, the VO-Enhanced VPS module is effective in improving segmentation accuracy and tracking consistency.

    PVO Demo

    Panoptic Visual Odometry takes a monocular video as input and outputs the panoptic 3D map while simultaneously localizes the camera itself with respect to the map. We show the panoptic 3D map produced by our method. The red triangle indicates the camera pose, and different colors indicate different instances.


        title={PVO: Panoptic Visual Odometry},
        author={Ye, Weicai and Lan, Xinyue and Chen, Shuo and Ming, Yuhang and Yu, Xinyuan and Bao, Hujun and Cui, Zhaopeng and Zhang, Guofeng},


    This work was partially supported by NSF of China (No. 61932003) and ZJU-SenseTime Joint Lab of 3D Vision.