Skip to content

Code for ECCV2022 "Real-time Online Video Detection with Temporal Smoothing Transformers"

License

Notifications You must be signed in to change notification settings

zhaoyue-zephyrus/TeSTra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TeSTra: Real-time Online Video Detection with Temporal Smoothing Transformers

Introduction

This is a PyTorch implementation for our ECCV 2022 paper "Real-time Online Video Detection with Temporal Smoothing Transformers".

teaser

Environment

  • The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1

    1. Clone the repo recursively.

      git clone --recursive [email protected]:zhaoyue-zephyrus/TeSTra.git
      
    2. [Optional but recommended] create a new conda environment.

      conda create -n testra python=3.7.7
      

      And activate the environment.

      conda activate testra
      
    3. Install the requirements

      pip install -r requirements.txt
      

Data Preparation

Pre-extracted Feature

You can directly download the pre-extracted feature (.zip) from the UTBox links below.

THUMOS'14

Description backbone pretrain UTBox Link
frame label N/A N/A link
RGB ResNet-50 Kinetics-400 link
Flow (TV-L1) BN-Inception Kinetics-400 link
Flow (NVOF) BN-Inception Kinetics-400 link
RGB ResNet-50 ANet v1.3 link
Flow (TV-L1) ResNet-50 ANet v1.3 link

EK100

Description backbone pretrain UTBox Link
action label N/A N/A link
noun label N/A N/A link
verb label N/A N/A link
RGB BN-Inception IN-1k + EK100 link
Flow (TV-L1) BN-Inception IN-1k + EK100 link
Object Faster-RCNN MS-COCO + EK55 link
  • Note: The features are converted from RULSTM to be compatible with the codebase.
  • Note: Object feature is not used in TeSTRa. The feature is uploaded for completeness only.

Once the zipped files are downloaded, you are suggested to unzip them and follow to file organization (see below).

(Alterative) Static links

It may be easier to download from static links via wget for non-GUI systems. To do so, simply change the utbox link from https://utexas.box.com/s/xxxx to https://utexas.box.com/shared/static/xxxx.zip. Unfortunately, UTBox does not support customized url names. Therfore, to wget while keeping the name readable, please refer to the bash scripts provided in DATASET.md.

(Alternative) Prepare dataset from scratch

You can also try to prepare the datasets from scratch by yourself.

THUMOS14

For TH14, please refer to LSTR.

EK100

For EK100, please find more details at RULSTM.

Computing Optical Flow

I will release a pure-python version of DenseFlow in the near future. Will post a cross-link here once done.

Data Structure

  1. If you want to use our dataloaders, please make sure to put the files as the following structure:

    • THUMOS'14 dataset:

      $YOUR_PATH_TO_THUMOS_DATASET
      ├── rgb_kinetics_resnet50/
      |   ├── video_validation_0000051.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── video_validation_0000051.npy (of size L x 1024)
      |   ├── ...
      ├── target_perframe/
      |   ├── video_validation_0000051.npy (of size L x 22)
      |   ├── ...
      
    • EK100 dataset:

      $YOUR_PATH_TO_EK_DATASET
      ├── rgb_kinetics_bninception/
      |   ├── P01_01.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── P01_01.npy (of size L x 2048)
      |   ├── ...
      ├── target_perframe/
      |   ├── P01_01.npy (of size L x 3807)
      |   ├── ...
      ├── noun_perframe/
      |   ├── P01_01.npy (of size L x 301)
      |   ├── ...
      ├── verb_perframe/
      |   ├── P01_01.npy (of size L x 98)
      |   ├── ...
      
  2. Create softlinks of datasets:

    cd TeSTra
    ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
    ln -s $YOUR_PATH_TO_EK_DATASET data/EK100
    

Training

The commands for training are as follows.

cd TeSTra/
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

For existing checkpoints, please refer to the next section.

Batch mode

Run the online inference in batch mode for performance benchmarking.

```
cd TeSTra/
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```

Stream mode

Run the online inference in stream mode to calculate runtime in the streaming setting.

```
cd TeSTra/
# Online inference in stream mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream
# The above one will take quite long over the entire dataset,
# If you only want to look at a particular video, attach an additional argument:
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream \
    DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
```

For more details on the difference between batch mode and stream mode, please check out LSTR.

Main Results and checkpoints

THUMOS14

method kernel type mAP (%) config checkpoint
LSTR (baseline) Cross Attention 69.9 yaml UTBox link
TeSTra Laplace (α=e^-λ=0.97) 70.8 yaml UTBox link
TeSTra Box (α=e^-λ=1.0) 71.2 yaml UTBox link
TeSTra (lite) Box (α=e^-λ=1.0) 67.3 yaml UTBox link

EK100

method kernel type verb (overall) noun (overall) action (overall) config checkpoint
TeSTra Laplace (α=e^-λ=0.9) 30.8 35.8 17.6 yaml UTBox link
TeSTra Box (α=e^-λ=1.0) 31.4 33.9 17.0 yaml UTBox link

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{zhao2022testra,
	title={Real-time Online Video Detection with Temporal Smoothing Transformers},
	author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
	booktitle={European Conference on Computer Vision (ECCV)},
	year={2022}
}

Contacts

For any question, feel free to raise an issue or drop me an email via yzhao [at] cs.utexas.edu

License

This project is licensed under the Apache-2.0 License.

Acknowledgements

This codebase is built upon LSTR.

The code snippet for evaluation on EK100 is borrowed from RULSTM.

Also, thanks to Mingze Xu for assistance to reproduce the feature on THUMOS'14.

About

Code for ECCV2022 "Real-time Online Video Detection with Temporal Smoothing Transformers"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages