SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention (ECCV 2022)
This is the official repository for SpatialDETR which will be published at ECCV 2022.
scene_26a6b03c8e2f4e6692f174a7074e54ff.mp4
Authors: Simon Doll, Richard Schulz, Lukas Schneider, Viviane Benzin, Markus Enzweiler, Hendrik P.A. Lensch
Based on the key idea of DETR this paper introduces an object-centric 3D object detection framework that operates on a limited number of 3D object queries instead of dense bounding box proposals followed by non-maximum suppression. After image feature extraction a decoder-only transformer architecture is trained on a set-based loss. SpatialDETR infers the classification and bounding box estimates based on attention both spatially within each image and across the different views. To fuse the multi-view information in the attention block we introduce a novel geometric positional encoding that incorporates the view ray geometry to explicitly consider the extrinsic and intrinsic camera setup. This way, the spatially-aware cross-view attention exploits arbitrary receptive fields to integrate cross-sensor data and therefore global context. Extensive experiments on the nuScenes benchmark demonstrate the potential of global attention and result in state-of-the-art performance.
If you find this repository useful, please cite
@inproceedings{Doll2022ECCV,
author = {Doll, Simon and Schulz, Richard and Schneider, Lukas and Benzin, Viviane and Enzweiler Markus and Lensch, Hendrik P.A.},
title = {SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention},
booktitle = {European Conference on Computer Vision(ECCV)},
year = {2022}
}
To setup the repository and run trainings we refer to getting_started.md
- We moved the codebase to the new coordinate conventions of
mmdetection3d rc1.0ff
- The performancy might slightly vary compared to the original runs on
mmdetection3d 0.17
reported in the paper
The baseline models have been trained on 4xV100 GPUs
, the submission models on 8xA100 GPUs
. For more details we refer to the corresponding configuration / log files. Keep in mind that the performance can vary between runs and that the current codebase uses [email protected]
Config | Logfile | Set | #GPUs | mmdet3d | mAP | ATE | ASE | AOE | AVE | AAE | NDS |
---|---|---|---|---|---|---|---|---|---|---|---|
query_proj_value_proj.py (baseline) | log / model | val | 4 | rc1.0 | 0.315 | 0.843 | 0.279 | 0.497 | 0.787 | 0.208 | 0.396 |
query_proj_value_proj.py | log | val | 4 | 0.17 | 0.313 | 0.850 | 0.274 | 0.494 | 0.814 | 0.213 | 0.392 |
query_center_proj_no_value_proj_shared.py | log | val | 8 | 0.17 | 0.351 | 0.772 | 0.274 | 0.395 | 0.847 | 0.217 | 0.425 |
query_center_proj_no_value_proj_shared_cbgs_vovnet_trainval.py | log | test | 8 | 0.17 | 0.425 | 0.614 | 0.253 | 0.402 | 0.857 | 0.131 | 0.487 |
See license_infos.md for details.
This repo contains the implementations of SpatialDETR. Our implementation is a plugin to MMDetection3D and also uses a fork of DETR3D. Full credits belong to the contributors of those frameworks and we truly thank them for enabling our research!