IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

This repository contains the PyTorch implementation of the CVPR'2024 paper (Highlight), IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection. This work simultaneously models instance-level and scene-level multimodal contexts to enhance 3D detection performance.

Updates

[2024.4.30] Code of IS-Fusion is released.

Abstract

Bird’s eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-FUSION, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-FUSION essentially differs from existing approaches that only focus on the BEV scenelevel fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-FUSION outperforms all the published multimodal works to date.

Citation

If you find this project is helpful for you, please cite our paper:

@inproceedings{{yin2024isfusion,
  title={IS-FUSION: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection},
  author={Yin, Junbo and Shen, Jianbing and Chen, Runnan and Li, Wei and Yang, Ruigang and Frossard, Pascal and Wang, Wenguan},
  booktitle={CVPR},
  year={2024}
}

Main Results

3D object detection results on nuScenes dataset.

Method	Modality	mAP (val)	NDS (val)	mAP (test)	NDS (test)
TransFusion-L (Baseline)	L	65.1	70.1	65.5	70.2
TransFusion-LC	L+C	67.5	71.3	68.9	71.7
BEVFusion	L+C	68.5	71.4	70.2	72.9
IS-Fusion (Ours)	L+C	72.8	74.0	73.0	75.2

Use IS-Fusion

Installation

This project is based on torch 1.10.1, mmdet 2.14.0, mmcv 1.4.0 and mmdet3d 0.16.0. Please install mmdet3d following getting_started.md. In addition, please install TorchEx with cd mmdet3d/ops/TorchEx and pip install -v ..

Dataset Preparation

Please refer to data_preparation.md to prepare the nuScenes dataset.

python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes --version v1.0

Training and Evaluation

We provide the multimodal 3D detection config in isfusion_0075voxel.py. IS-FUSION features a two-stage training paradigm that only requires a pretrained image model, enabling faster convergence in just 10 epochs. This is more efficient than other multimodal detection approaches like BEVFusion, which typically require a three-stage training strategy (image stage, LiDAR stage and fusion stage). Start the training and evluation by running:

bash tools/run-nus.sh extra-tag

To obtain detection results using the pretrained model, run the following command:

bash tools/dist_test.sh configs/isfusion/isfusion_0075voxel.py path_to_ckpt_directory 1 --eval bbox

Pretrained Models

Models	Link
Pretrained Image Backbone	https://drive.google.com/file/d/1k3Eiy5SeeAt36SJVcVwpEUBtal8Uiz9P/view?usp=sharing
Pretrained IS-Fusion	https://drive.google.com/file/d/1mY2juJ2n0Dw5NWDSraZXrdU1RwkE-40h/view?usp=sharing

License

This project is released under MIT license, as seen in LICENSE.

Acknowlegement

Our project is partially supported by the following codebase. We would like to thank for their contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
demo		demo
docker		docker
docs		docs
docs_zh-CN		docs_zh-CN
mmdet3d		mmdet3d
ops		ops
requirements		requirements
resources		resources
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
model-index.yml		model-index.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Updates

Abstract

Citation

Main Results

3D object detection results on nuScenes dataset.

Use IS-Fusion

Installation

Dataset Preparation

Training and Evaluation

Pretrained Models

License

Acknowlegement

About

Releases

Packages

Languages

License

yinjunbo/IS-Fusion

Folders and files

Latest commit

History

Repository files navigation

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Updates

Abstract

Citation

Main Results

3D object detection results on nuScenes dataset.

Use IS-Fusion

Installation

Dataset Preparation

Training and Evaluation

Pretrained Models

License

Acknowlegement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages