By Wen Wang, Yang Cao, Jing Zhang, and Dacheng Tao
This repository is an official implementation of FP-DETR in the paper FP-DETR: Detection Transformer Advanced by Fully Pre-training, which is accepted to ICLR 2022.
Abstract. Large-scale pre-training has proven to be effective for visual representation learning on downstream tasks, especially for improving robustness and generalization. However, the recently developed detection transformers only employ pre-training on its backbone while leaving the key component, i.e., a 12-layer transformer, being trained from scratch, which prevents the model from above benefits. This separated training paradigm is mainly caused by the discrepancy between the upstream and downstream tasks. To mitigate the issue, we propose FP-DETR, a new method that Fully Pre-Trains an encoder-only transformer and smoothly fine-tunes it for object detection via a task adapter. Inspired by the success of textual prompts in NLP, we treat query positional embeddings as visual prompts to help the model attend to the target area (prompting) and recognize the object. To this end, we propose the task adapter which leverages self-attention to model the contextual relation between object query embedding. Experiments on the challenging COCO dataset demonstrate that our FP-DETR achieves competitive performance. Moreover, it enjoys better robustness to common corruptions and generalization to small-size datasets than state-of-the-art detection transformers.
The experimental results and model weights trained on COCO 2017 are shown below.
Model | mAP | mAP@50 | mAP@75 | mAP@S | mAP@M | mAP@L | Model Weights |
---|---|---|---|---|---|---|---|
FP-DETR-Lite | 37.9 | 57.5 | 41.1 | 21.7 | 40.6 | 50.7 | Google Drive |
FP-DETR-Small | 42.5 | 62.6 | 45.9 | 25.3 | 45.5 | 56.9 | Google Drive |
FP-DETR-Base | 43.2 | 63.1 | 47.5 | 25.7 | 46.7 | 57.5 | Google Drive |
FP-DETR-Base_IN21K-P | 43.7 | 64.1 | 47.8 | 26.5 | 46.7 | 58.2 | Google Drive |
Note:
- The ImageNet pre-trained weights are available here.
- FP-DETR-Base_IN21K-P is pre-trained on ImageNet-21K-P, while other models are pre-trained on ImageNet-1K.
- FP-DETR-Lite and FP-DETR-Base are re-implemented, thus the results are slightly different from what's reported in our paper.
- All experiments are implemented on NVIDIA A100 GPUs.
Our implementation is based on mmdetection. Please refer to get_started.md for installation.
Note the implementation is based on mmdet==2.12.0
and mmcv==1.3.2
.
pip install -r requirements/custom.txt
The COCO 2017 dataset can be downloaded from here and the Cityscapes datasets can be downloaded from here. The Cityscapes annotations in COCO format can be obtained from here. Afterward, please organize the datasets and annotations as following:
data
└─ cityscapes
└─ leftImg8bit
|─ train
└─ val
└─ annotations
|─ instancesonly_filtered_gtFine_train.json
└─ instancesonly_filtered_gtFine_val.json
└─ coco
|─ annotations
|─ train2017
└─ val2017
Please download the ImageNet pre-trained weights here and organize the weights as following:
pretrained_weights
└─ fp-detr-lite
└─ pretrained_epoch_299.pth
└─ fp-detr-small
└─ pretrained_epoch_299.pth
└─ fp-detr-base
└─ pretrained_epoch_299.pth
└─ fp-detr-base_21k-p
└─ pretrained_epoch_299.pth
Note in each folder, pretrained_epoch_299.pth
is obtained by running postprocess_ckpt.py on epoch_299.pth
. And only pretrained_epoch_299.pth
is needed for fine-tuning.
./tools/dist_train.sh configs/fp_detr/fp-detr-lite_in1k.py 8
./tools/dist_train.sh configs/fp_detr/fp-detr-small_in1k.py 8
./tools/dist_train.sh configs/fp_detr/fp-detr-base_in1k.py 8
./tools/dist_train.sh configs/fp_detr/fp-detr-base_in21k-p.py 8
./tools/dist_train.sh configs/fp_detr_city/fp-detr-lite_in1k_city.py 4
Better performance can be achieved on Cityscapes by training with a batch size of 4.
You can get the trained model (the link is in "Main Results" session), then run following command to evaluate it on the validation set:
./tools/dist_test.sh <path to config> <path to pre-trained model> <num gpus> --eval bbox
e.g. evaluate the trained FP-DETR-Lite on COCO2017 validation set with 8 gpus:
./tools/dist_test.sh configs/fp_detr/fp-detr-lite_in1k.py work_dirs/fp-detr-lite_in1k/epoch_50.pth 8 --eval bbox
This project is released under the Apache 2.0 license.
If you find FP-DETR useful in your research, please consider citing:
@inproceedings{
wang2022fpdetr,
title={{FP}-{DETR}: Detection Transformer Advanced by Fully Pre-training},
author={Wen Wang and Yang Cao and Jing Zhang and Dacheng Tao},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=yjMQuLLcGWK}
}