FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

This repository is an official implementation of the paper Feature Aggregated Queries for Transformer-based Video Object Detectors.

Installation

The codebase is built on top of Deformable DETR.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7

We recommend you to use Anaconda to create a conda environment:
```
conda create -n FAQ python=3.7 pip
```
Then, activate the environment:
```
conda activate FAQ
```
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here

For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
```
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
```
Other requirements
```
pip install -r requirements.txt
```
Build MultiScaleDeformableAttention
```
cd ./models/ops
sh ./make.sh
```

Usage

Dataset preparation

Please download ILSVRC2015 DET and ILSVRC2015 VID dataset from here. Then we covert jsons of two datasets by using the code. The joint json of two datasets is provided. The After that, we recommend to symlink the path to the datasets to datasets/. And the path structure should be as follows:

code_root/
└── data/
    └── vid/
        ├── Data
            ├── VID/
            └── DET/
        └── annotations/
        	  ├── imagenet_vid_train.json
            ├── imagenet_vid_train_joint_30.json
        	  └── imagenet_vid_val.json

Training

We use ResNet50 and ResNet101 as the network backbone. We train our FAQ with ResNet50 as backbone as following:

Training on single node

Train SingleBaseline. You can download COCO pretrained weights from Deformable DETR.

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 r50 $2 configs/r50_train_single.sh

Train FAD. Using the model weights of SingleBaseline as the resume model.

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 r50 $2 configs/r50_train_multi.sh

Training on slurm cluster

If you are using slurm cluster, you can simply run the following command to train on 1 node with 8 GPUs:

GPUS_PER_NODE=8 ./tools/run_dist_slurm.sh <partition> r50 8 configs/r50_train_multi.sh

Evaluation

You can get the config file and pretrained model of FAQ (the link is in "Main Results" session), then put the pretrained_model into correponding folder.

code_root/
└── exps/
    └── our_models/
        ├── COCO_pretrained_model
        ├── exps_single
        └── exps_multi

And then run following command to evaluate it on ImageNET VID validation set:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 eval_r50 $2 configs/r50_eval_multi.sh

Citing FAQ

If you find FAQ useful in your research, please consider citing:

@misc{cui2023faq,
      title={FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors}, 
      author={Yiming Cui and Linjie Yang},
      year={2023},
      eprint={2303.08319},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CondVOD		CondVOD
DeformVOD		DeformVOD
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Installation

Requirements

Usage

Dataset preparation

Training

Training on single node

Training on slurm cluster

Evaluation

Citing FAQ

About

Releases

Packages

Contributors 2

Languages

YimingCuiCuiCui/FAQ

Folders and files

Latest commit

History

Repository files navigation

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Installation

Requirements

Usage

Dataset preparation

Training

Training on single node

Training on slurm cluster

Evaluation

Citing FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages