You Don't Only Look Once

You Don't Only Look Once: Constructing Spatial-Temporal Memory
for Integrated 3D Object Detection and Tracking

ICCV 2021

Jiaming Sun^1,2, Yiming Xie^1, Siyu Zhang², Linghao Chen¹, Guofeng Zhang¹, Hujun Bao¹, Xiaowei Zhou¹

¹State Key Lab of CAD & CG, Zhejiang University ²SenseTime Research
^* denotes equal contribution

Paper

Code

Supplementary (Coming soon)

Abstract

TL;DR: By leveraging spatial and temporal information, UDOLO jointly detects and tracks 3D objects without looking for object in each frame from scratch.

Humans are able to continuously detect and track surrounding objects by constructing a spatial-temporal memory of the objects when looking around. In contrast, 3D object detectors in existing tracking-by-detection systems often search for objects in every new video frame from scratch, without fully leveraging memory from previous detection results. In this work, we propose a novel system for integrated 3D object detection and tracking, which uses a dynamic object occupancy map and previous object states as spatial-temporal memory to assist object detection in future frames. This memory, together with the ego-motion from back-end odometry, guides the detector to achieve more efficient object proposal generation and more accurate object state estimation. The experiments demonstrate the effectiveness of the proposed system and its performance on the ScanNet and KITTI datasets. Moreover, the proposed system produces stable bounding boxes and pose trajectories over time, while being able to handle occluded and truncated objects.

Pipeline overview

At each time step, the front-end region proposal network OOM-Guided RPN takes the point cloud as input, extracts current-frame object proposals only in the regions that have high object occupancy scores (red points) given by the object occupancy map, as well as the unobserved regions (blue points) where new objects may appear. These proposals are later fused with back-end object future state predictions from the last frame and passed through the second stage of the detector Fusion R-CNN. After association with the tracklet, current front-end predictions are fed into the Kalman Filter to produce the fused object states as the final bounding box prediction. Then the object occupancy map is updated according to the future object states given by the motion prediction module. Solid arrows denote the major data flow. Red arrows denote the feedback mechanism design.