The official PyTorch implementation of our CVPR 2023 paper:
Generalized Relation Modeling for Transformer Tracking
Shenyuan Gao, Chunluan Zhou, Jun Zhang
[CVF Open Access] [ArXiv Preprint] [YouTube Video] [Trained Models] [Raw Results] [SOTA Paper List]
Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose generalized relation modeling (GRM) based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens.
Variant | GRM-GOT | GRM | GRM-L320 |
---|---|---|---|
Model Config | ViT-B, 256^2 resolution | ViT-B, 256^2 resolution | ViT-L, 320^2 resolution |
Training Setting | only GOT, 100 epochs | 4 datasets, 300 epochs | 4 datasets, 300 epochs |
GOT-10k (AO / SR 0.5 / SR 0.75) | 73.4 / 82.9 / 70.4 | - | - |
LaSOT (AUC / Norm P / P) | - | 69.9 / 79.3 / 75.8 | 71.4 / 81.2 / 77.9 |
TrackingNet (AUC / Norm P / P) | - | 84.0 / 88.7 / 83.3 | 84.4 / 88.9 / 84.0 |
AVisT (AUC / OP50 / OP75) | - | 54.5 / 63.1 / 45.2 | 55.1 / 63.8 / 46.9 |
NfS30 (AUC) | - | 65.6 | 66.0 |
UAV123 (AUC) | - | 70.2 | 72.2 |
Our baseline model (backbone: ViT-B, resolution: 256x256) can run at 45 fps (frames per second) on a single NVIDIA GeForce RTX 3090.
It takes less than half a day to train our baseline model for 300 epochs on 8 NVIDIA GeForce RTX 3090 (each of which has 24GB GPU memory).
Trained Models (including the baseline model GRM, GRM-GOT and a stronger variant GRM-L320) [download zip file]
Raw Results (including raw tracking results on six datasets we benchmarked in the paper and listed above) [download zip file]
Download and unzip these two zip files into the output
directory under GRM project path, then both of them can be directly used by our code.
-
Our experiments are conducted with Ubuntu 20.04 and CUDA 11.6.
-
-
Clone our repository to your local project directory.
-
Download the pre-trained weights from MAE or DeiT, and place the files into the
pretrained_models
directory under GRM project path. You may want to try different pre-trained weights, so I list the links of pre-trained models integrated in this project.Backbone Type Model File Checkpoint Link 'vit_base' 'mae_pretrain_vit_base.pth' download 'vit_large' 'mae_pretrain_vit_large.pth' download 'vit_base' 'deit_base_patch16_224-b5f2ef4d.pth' download 'vit_base' 'deit_base_distilled_patch16_224-df68dfff.pth' download -
Download the training datasets (LaSOT, TrackingNet, GOT-10k, COCO2017) and testing datasets (NfS, UAV123, AVisT) to your disk, the organized directory should look like:
--LaSOT/ |--airplane |... |--zebra --TrackingNet/ |--TRAIN_0 |... |--TEST --GOT10k/ |--test |--train |--val --COCO/ |--annotations |--images --NFS30/ |--anno |--sequences --UAV123/ |--anno |--data_seq --AVisT/ |--anno |--full_occlusion |--out_of_view |--sequences
-
Edit the paths in
lib/test/evaluation/local.py
andlib/train/adim/local.py
to the proper ones.
-
-
We use conda to manage the environment.
conda create --name grm python=3.9 conda activate grm bash install.sh
-
-
Multiple GPU training by DDP (suppose you have 8 GPU)
python tracking/train.py --mode multiple --nproc 8
-
Single GPU debugging (too slow, not recommended for training)
python tracking/train.py
-
For GOT-10k evaluation, remember to set
--config vitb_256_got_ep100
. -
To pursuit performance, switch to a stronger variant by setting
--config vitl_320_ep300
.
-
-
-
Make sure you have prepared the trained model.
-
LaSOT
python tracking/test.py --dataset lasot
Then evaluate the raw results using the official MATLAB toolkit.
-
TrackingNet
python tracking/test.py --dataset trackingnet python lib/test/utils/transform_trackingnet.py
Then upload
test/tracking_results/grm/vitb_256_ep300/trackingnet_submit.zip
to the online evaluation server. -
GOT-10k
python tracking/test.py --param vitb_256_got_ep100 --dataset got10k_test python lib/test/utils/transform_got10k.py
Then upload
test/tracking_results/grm/vitb_256_got_ep100/got10k_submit.zip
to the online evaluation server. -
NfS30, UAV123, AVisT
python tracking/test.py --dataset nfs python tracking/test.py --dataset uav python tracking/test.py --dataset avist python tracking/analysis_results.py
-
For multiple threads inference, just add
--threads 40
aftertracking/test.py
(suppose you want to use 40 threads in total). -
To show the immediate prediction results during inference, modify
settings.show_result = True
inlib/test/evaluation/local.py
(may have bugs if you try this on a remote sever). -
Please refer to DynamicViT Example for the visualization of search token division results.
-
❤️❤️❤️Our idea is implemented base on the following projects. We really appreciate their excellent open-source works!
- OSTrack [related paper]
- AiATrack [related paper]
- DynamicViT [related paper]
- PyTracking [related paper]
If any parts of our paper and code help your research, please consider citing us and giving a star to our repository.
@inproceedings{gao2023generalized,
title={Generalized Relation Modeling for Transformer Tracking},
author={Gao, Shenyuan and Zhou, Chunluan and Zhang, Jun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18686--18695},
year={2023}
}
If you have any questions or concerns, feel free to open issues or directly contact me through the ways on my GitHub homepage. Suggestions and collaborations are also highly welcome!