This is a PyTorch implementation for our ECCV 2022 paper "Real-time Online Video Detection with Temporal Smoothing Transformers
".
-
The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1
-
Clone the repo recursively.
git clone --recursive [email protected]:zhaoyue-zephyrus/TeSTra.git
-
[Optional but recommended] create a new conda environment.
conda create -n testra python=3.7.7
And activate the environment.
conda activate testra
-
Install the requirements
pip install -r requirements.txt
-
You can directly download the pre-extracted feature (.zip) from the UTBox links below.
Description | backbone | pretrain | UTBox Link |
---|---|---|---|
frame label | N/A | N/A | link |
RGB | ResNet-50 | Kinetics-400 | link |
Flow (TV-L1) | BN-Inception | Kinetics-400 | link |
Flow (NVOF) | BN-Inception | Kinetics-400 | link |
RGB | ResNet-50 | ANet v1.3 | link |
Flow (TV-L1) | ResNet-50 | ANet v1.3 | link |
Description | backbone | pretrain | UTBox Link |
---|---|---|---|
action label | N/A | N/A | link |
noun label | N/A | N/A | link |
verb label | N/A | N/A | link |
RGB | BN-Inception | IN-1k + EK100 | link |
Flow (TV-L1) | BN-Inception | IN-1k + EK100 | link |
Object | Faster-RCNN | MS-COCO + EK55 | link |
- Note: The features are converted from RULSTM to be compatible with the codebase.
- Note: Object feature is not used in TeSTRa. The feature is uploaded for completeness only.
Once the zipped files are downloaded, you are suggested to unzip them and follow to file organization (see below).
It may be easier to download from static links via wget
for non-GUI systems.
To do so, simply change the utbox link from https://utexas.box.com/s/xxxx
to https://utexas.box.com/shared/static/xxxx.zip
.
Unfortunately, UTBox does not support customized url names.
Therfore, to wget
while keeping the name readable, please refer to the bash scripts provided in DATASET.md.
You can also try to prepare the datasets from scratch by yourself.
For TH14, please refer to LSTR.
For EK100, please find more details at RULSTM.
I will release a pure-python version of DenseFlow in the near future. Will post a cross-link here once done.
-
If you want to use our dataloaders, please make sure to put the files as the following structure:
-
THUMOS'14 dataset:
$YOUR_PATH_TO_THUMOS_DATASET ├── rgb_kinetics_resnet50/ | ├── video_validation_0000051.npy (of size L x 2048) │ ├── ... ├── flow_kinetics_bninception/ | ├── video_validation_0000051.npy (of size L x 1024) | ├── ... ├── target_perframe/ | ├── video_validation_0000051.npy (of size L x 22) | ├── ...
-
EK100 dataset:
$YOUR_PATH_TO_EK_DATASET ├── rgb_kinetics_bninception/ | ├── P01_01.npy (of size L x 2048) │ ├── ... ├── flow_kinetics_bninception/ | ├── P01_01.npy (of size L x 2048) | ├── ... ├── target_perframe/ | ├── P01_01.npy (of size L x 3807) | ├── ... ├── noun_perframe/ | ├── P01_01.npy (of size L x 301) | ├── ... ├── verb_perframe/ | ├── P01_01.npy (of size L x 98) | ├── ...
-
-
Create softlinks of datasets:
cd TeSTra ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS ln -s $YOUR_PATH_TO_EK_DATASET data/EK100
The commands for training are as follows.
cd TeSTra/
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
MODEL.CHECKPOINT $PATH_TO_CHECKPOINT
For existing checkpoints, please refer to the next section.
Run the online inference in batch mode
for performance benchmarking.
```
cd TeSTra/
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```
Run the online inference in stream mode
to calculate runtime in the streaming setting.
```
cd TeSTra/
# Online inference in stream mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream
# The above one will take quite long over the entire dataset,
# If you only want to look at a particular video, attach an additional argument:
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream \
DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
```
For more details on the difference between batch mode
and stream mode
, please check out LSTR.
method | kernel type | mAP (%) | config | checkpoint |
---|---|---|---|---|
LSTR (baseline) | Cross Attention | 69.9 | yaml | UTBox link |
TeSTra | Laplace (α=e^-λ=0.97) | 70.8 | yaml | UTBox link |
TeSTra | Box (α=e^-λ=1.0) | 71.2 | yaml | UTBox link |
TeSTra (lite) | Box (α=e^-λ=1.0) | 67.3 | yaml | UTBox link |
method | kernel type | verb (overall) | noun (overall) | action (overall) | config | checkpoint |
---|---|---|---|---|---|---|
TeSTra | Laplace (α=e^-λ=0.9) | 30.8 | 35.8 | 17.6 | yaml | UTBox link |
TeSTra | Box (α=e^-λ=1.0) | 31.4 | 33.9 | 17.0 | yaml | UTBox link |
If you are using the data/code/model provided here in a publication, please cite our paper:
@inproceedings{zhao2022testra,
title={Real-time Online Video Detection with Temporal Smoothing Transformers},
author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
booktitle={European Conference on Computer Vision (ECCV)},
year={2022}
}
For any question, feel free to raise an issue or drop me an email via yzhao [at] cs.utexas.edu
This project is licensed under the Apache-2.0 License.
This codebase is built upon LSTR.
The code snippet for evaluation on EK100 is borrowed from RULSTM.
Also, thanks to Mingze Xu for assistance to reproduce the feature on THUMOS'14.