By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai.
This repo holds the code for TadTR, described in the paper End-to-end temporal action detection with Transformer published in IEEE Transactions on Image Processing (TIP) 2022.
We have also explored fully end-to-end training from RGB images with TadTR. See our CVPR 2022 work E2E-TAD.
TadTR is an end-to-end Temporal Action Detection TRansformer. It has the following advantages over previous methods:
- Simple. It adopts a set-prediction pipeline and achieves TAD with a single network. It does not require a separate proposal generation stage.
- Flexible. It removes hand-crafted design such as anchor setting and NMS.
- Sparse. It produces very sparse detections (e.g. 10 on ActivityNet), thus requiring lower computation cost.
- Strong. As a self-contained temporal action detector, TadTR achieves state-of-the-art performance on HACS and THUMOS14. It is also much stronger than concurrent Transformer-based methods such as RTD-Net and AGT.
[2023.2.19] Fix a bug a loss caculation (issue #21). Thank @zachpvin for raising this issue!
[2022.8.7] Add support for training/testing on THUMOS14!
[2022.7.4] Glad to share that this paper will appear in IEEE Transactions on Image Processing (TIP). Although I am still busy with my thesis, I will try to make the code accessible soon. Thanks for your patience.
[2022.6] Update the technical report of this work on arxiv (now v3).
[2022.3] Our new work E2E-TAD based on TadTR is accepted to CVPR 2022. It supports fully end-to-end training from RGB images.
[2021.9.15] Update the performance on THUMOS14.
[2021.9.1] Add demo code.
[2021.7] Our revised paper was submitted to IEEE Transactions on Image Processing.
[2021.6] Our revised paper was uploaded to arxiv.
[2021.1.21] Our paper was submitted to IJCAI 2021.
- add model code
- add inference code
- add training code
- support training/inference with video input. See E2E-TAD
- HACS Segments
Method | Feature | [email protected] | [email protected] | [email protected] | Avg. mAP |
---|---|---|---|---|---|
TadTR | I3D RGB | 47.14 | 32.11 | 10.94 | 32.09 |
- THUMOS14
Method | Feature | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Avg. mAP |
---|---|---|---|---|---|---|---|
TadTR | I3D 2stream | 74.8 | 69.1 | 60.1 | 46.6 | 32.8 | 56.7 |
- ActivityNet-1.3
Method | Feature | [email protected] | [email protected] | [email protected] | Avg. mAP |
---|---|---|---|---|---|
TadTR | TSN 2stream | 51.29 | 34.99 | 9.49 | 34.64 |
TadTR | TSP | 53.62 | 37.52 | 10.56 | 36.75 |
-
Linux or Windows
-
Python>=3.7
-
(Optional) CUDA>=9.2, GCC>=5.4
-
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)
-
Other requirements
pip install -r requirements.txt
The RoIAlign operator is implemented with CUDA extension.
If your machine does have a NVIDIA GPU with CUDA support, you can run this step. Otherwise, please set disable_cuda=True
in opts.py
.
cd model/ops;
# If you have multiple installations of CUDA Toolkits, you'd better add a prefix
# CUDA_HOME=<your_cuda_toolkit_path> to specify the correct version.
python setup.py build_ext --inplace
python demo.py
Currently we only support thumos14
.
Download all data from [BaiduDrive(code: adTR)] or [OneDrive].
- Features: Download the I3D features
I3D_2stream_Pth.tar
. It was originally provided by the authors of P-GCN. I have concatenated the RGB and Flow features (drop the tail of the longer one if the lengths are inconsistent) and converted the data to float32 precision to save space. - Annotations: The annotations of action instances and the meta information of feature files. Both are in JSON format (
th14_annotations_with_fps_duration.json
andth14_i3d2s_ft_info.json
). - Pre-trained Reference Models: Our pretrained model that use I3D features
thumos14_i3d2s_tadtr_reference.pth
. This model corresponds to the config fileconfigs/thumos14_i3d2s_tadtr.yml
.
After downloading is finished, extract the archived feature files inplace by cd data;tar -xf I3D_2stream_Pth.tar
. Then put the features, annotations, the model under the data/thumos14
directory. We expect the following structure in root folder.
- data
- thumos14
- I3D_2stream_Pth
- xxxxx
- xxxxx
- th14_annotations_with_fps_duration.json
- th14_i3d2s_ft_info.json
- thumos14_tadtr_reference.pth
Run
python main.py --cfg CFG_PATH --eval --resume CKPT_PATH
CFG_PATH is the path to the YAML-format config file that defines the experimental setting. For example, configs/thumos14_i3d2s_tadtr.yml
. CKPT_PATH is the path of the pre-trained model. Alternatively, you can execute the Shell script bash scripts/test_reference_models.sh thumos14
for simplity.
Run the following command
python main.py --cfg CFG_PATH
This codebase supports running on both CPU and GPU.
- To run on CPU: please add
--device cpu
to the above command. Also, you need to setdisable_cuda=True
inopts.py
. The CPU mode does not support actionness regression and the detection performance is lower. - To run on GPU: since the model is very lightweight, just one GPU is enough. You may specify the GPU device ID (e.g., 0) to use by the adding the prefix
CUDA_VISIBLE_DEVICES=ID
before the above command. To run on multiple GPUs, please refer toscripts/run_parallel.sh
.
During training, our code will automatically perform testing every N epochs (N is the test_interval
in opts.py). Training takes 6~10 minutes on THUMOS14 if you use a modern GPU (e.g. TITAN Xp). You can also monitor the training process with Tensorboard (need to set cfg.tensorboard=True
in opts.py
). The tensorboard record and the checkpoint will be saved at output_dir
(can be modified in config file).
After training is done, you can also test your trained model by running
python main.py --cfg CFG_PATH --eval
It will automatically use the best model checkpoint. If you want to manually specify the model checkpoint, run
python main.py --cfg CFG_PATH --eval --resume CKPT_PATH
Note that the performance of the model trained by your own may be different from the reference model, even though all seeds are fixed. The reason is that TadTR uses the grid_sample
operator, whoses gradient computation involves the non-deterministic AtomicAdd
operator. Please refer to ref1 ref2 ref3(Chinese) for details.
The code is based on the DETR and Deformable DETR. We also borrow the implementation of the RoIAlign1D from G-TAD. Thanks for their great works.
@article{liu2022end,
title={End-to-end Temporal Action Detection with Transformer},
author={Liu, Xiaolong and Wang, Qimeng and Hu, Yao and Tang, Xu and Zhang, Shiwei and Bai, Song and Bai, Xiang},
journal={IEEE Transactions on Image Processing (TIP)},
year={2022}
}
For questions and suggestions, please contact Xiaolong Liu by email ("liuxl at hust dot edu dot cn").