You Don't Only Look Once: Constructing Spatial-Temporal Memory
for Integrated 3D Object Detection and Tracking

ICCV 2021

Jiaming Sun1,2*, Yiming Xie1*, Siyu Zhang2, Linghao Chen1, Guofeng Zhang1, Hujun Bao1, Xiaowei Zhou1

1State Key Lab of CAD & CG, Zhejiang University    2SenseTime Research
* denotes equal contribution


TL;DR: By leveraging spatial and temporal information, UDOLO jointly detects and tracks 3D objects without looking for object in each frame from scratch.

Humans are able to continuously detect and track surrounding objects by constructing a spatial-temporal memory of the objects when looking around. In contrast, 3D object detectors in existing tracking-by-detection systems often search for objects in every new video frame from scratch, without fully leveraging memory from previous detection results. In this work, we propose a novel system for integrated 3D object detection and tracking, which uses a dynamic object occupancy map and previous object states as spatial-temporal memory to assist object detection in future frames. This memory, together with the ego-motion from back-end odometry, guides the detector to achieve more efficient object proposal generation and more accurate object state estimation. The experiments demonstrate the effectiveness of the proposed system and its performance on the ScanNet and KITTI datasets. Moreover, the proposed system produces stable bounding boxes and pose trajectories over time, while being able to handle occluded and truncated objects.

Overview video (5 min)


Pipeline overview

UDOLO Architechture

At each time step, the front-end region proposal network OOM-Guided RPN takes the point cloud as input, extracts current-frame object proposals only in the regions that have high object occupancy scores (red points) given by the object occupancy map, as well as the unobserved regions (blue points) where new objects may appear. These proposals are later fused with back-end object future state predictions from the last frame and passed through the second stage of the detector Fusion R-CNN. After association with the tracklet, current front-end predictions are fed into the Kalman Filter to produce the fused object states as the final bounding box prediction. Then the object occupancy map is updated according to the future object states given by the motion prediction module. Solid arrows denote the major data flow. Red arrows denote the feedback mechanism design.


  title={{You Don't Only Look Once}: Constructing Spatial-Temporal Memory for Integrated 3D Object Detection and Tracking},
  author={Sun, Jiaming and Xie, Yiming and Zhang, Siyu and Zhang, Guofeng and Bao, Hujun and Zhou, Xiaowei},


We would like to specially thank Reviewer 2 and 3 for the insightful and constructive comments.