An implementation of our work (Online Real-time Multiple Spatiotemporal Action Localisation and Prediction) published in ICCV 2017.
Originally, we used Caffe implementation of SSD-V2 for publication. I have forked the version of SSD-CAFFE which I used to generate results for paper, you try that if you want to use caffe. You can use that repo if like caffe other I would recommend using this version. This implementation is bit off from original work. It works slightly, better on lower IoU and higher IoU and vice-versa. Tube generation part in original implementations as same as this. I found that this implementation of SSD is slight worse @ IoU greater or equal to 0.5 in context of the UCF24 dataset.
I decided to release the code with PyTorch implementation of SSD, because it would be easier to reuse than caffe version (where installation itself could be a big issue). We build on Pytorch implementation of SSD by Max deGroot, Ellis Brown. We made few changes like (different learning rate for bias and weights during optimization) and simplified some parts to accommodate ucf24 dataset.
- Install PyTorch(version v0.2, you try v0.03 but that would require few fixes) by selecting your environment on the website and running the appropriate command.
- Please install cv2 as well. I recommend using anaconda 3.6 and it's opnecv package.
- You will also need Matlab. If you have distributed computing license then it would be faster otherwise it should also be fine.
Just replace
parfor
with simplefor
in Matlab scripts. I would be happy to accept a PR for python version of this part. - Clone this repository.
- Note: We currently only support Python 3+ with Pytorch version v0.2 on Linux system.
- We currently only support UCF24 with revised annotaions released with our paper, we will try to add JHMDB21 as soon as possible, but can't promise, you can check out our BMVC2016 code to get started your experiments on JHMDB21.
- To simulate the same training and evaluation setup we provide extracted
rgb
images from videos along with optical flow images (bothbrox flow
andreal-time flow
) computed for the UCF24 dataset. You can download it from my google drive link - We also support Visdom for visualization of loss and frame-meanAP on subset during training.
- To use Visdom in the browser:
# First install Python server and client pip install visdom # Start the server (probably in a screen or tmux) python -m visdom.server --port=8097
- Then (during training) navigate to http://localhost:8097/ (see the Training section below for more details).
To make things easy, we provide extracted rgb
images from videos along with optical flow images (both brox flow
and real-time flow
) computed for ucf24 dataset,
you can download it from my google drive link.
It is almost 6Gb tarball, download it and extract it wherever you going to store your experiments.
UCF24DETECTION is a dataset loader Class in data/ucf24.py
that inherits torch.utils.data.Dataset
making it fully compatible with the torchvision.datasets
API.
- Requires fc-reduced VGG-16 model weights,
weights are already there in dataset tarball under
train_data
subfolder. - By default, we assume that you have downloaded that dataset.
- To train SSD using the training script simply specify the parameters listed in
train-ucf24.py
as a flag or manually change them.
Let's assume that you extracted dataset in /home/user/ucf24/
directory then your train command from the root directory of this repo is going to be:
CUDA_VISIBLE_DEVICES=0 python3 train-ucf24.py --data_root=/home/user/ucf24/ --save_root=/home/user/ucf24/
--visdom=True --input_type=rgb --stepvalues=70000,90000 --max_iter=120000
To train of flow inputs
CUDA_VISIBLE_DEVICES=0 python3 train-ucf24.py --data_root=/home/user/ucf24/ --save_root=/home/user/ucf24/
--visdom=True --input_type=brox --stepvalues=70000,90000 --max_iter=120000
Different parameters in train-ucf24.py
will result in different performance
- Note:
- Network occupies almost 9.2GB VRAM on a GPU, we used 1080Ti for training and normal training takes about 32-40 hrs
- For instructions on Visdom usage/installation, see the Installation section. By default, it is off.
- If you don't like to use visdom then you always keep track of train using logfile which is saved under save_root directory
- During training checkpoint is saved every 10K iteration also log it's frame-level
frame-mean-ap
on a subset of 22k test images. - We recommend training for 120K iterations for all the input types.
To generate the tubes and evaluate them, first, you will need frame-level detection then you can navigate to 'online-tubes' to generate tubes using I01onlineTubes
and I02genFusedTubes
.
Once you have trained network then you can use test-ucf24.py
to generate frame-level detections.
To eval SSD using the test script simply specify the parameters listed in test-ucf24.py
as a flag or manually change them. for e.g.:
CUDA_VISIBLE_DEVICES=0 python3 test-ucf24.py --data_root=/home/user/ucf24/ --save_root=/home/user/ucf24/
--input_type=rgb --eval_iter=120000
To evaluate on optical flow models
CUDA_VISIBLE_DEVICES=0 python3 test-ucf24.py --data_root=/home/user/ucf24/ --save_root=/home/user/ucf24/
--input_type=brox --eval_iter=120000
-Note
- By default it will compute frame-level detections and store them as well as compute frame-mean-AP in models saved at 90k and 120k iteration.
- There is a log file created for each iteration's frame-level evaluation.
You will need frame-level detections and you will need to navigate to online-tubes
Step-1: you will need to spacify data_root
, data_root
and iteration_num_*
in I01onlineTubes
and I02genFusedTubes
;
Step 2: run I01onlineTubes
and I02genFusedTubes
in matlab this print out video-mean-ap and save the results in a .mat
file
Results are saved in save_root/results.mat
. Additionally,action-path
and action-tubes
are also stroed under save_root\ucf24\*
folders.
- NOTE:
I01onlineTubes
andI02genFusedTubes
not only produce video-level mAP; they also produce video-level classification accuracy on 24 classes of UCF24.
To compute frame-mAP you can use frameAP.m
script. You will need to specify data_root
, data_root
.
Use this script to produce results for your publication not the python one, both are almost identical,
but their ap computation from precision and recall is slightly different.
The table below is similar to table 1 in our paper. It contains more info than that in the paper, mostly about this implementation.
IoU Threshold = | 0.20 | 0.50 | 0.75 | 0.5:0.95 | [email protected] | accuracy(%) |
Peng et al [3] RGB+BroxFLOW | 73.67 | 32.07 | 00.85 | 07.26 | -- | -- |
Saha et al [2] RGB+BroxFLOW | 66.55 | 36.37 | 07.94 | 14.37 | -- | -- |
Singh et al [4] RGB+FastFLOW | 70.20 | 43.00 | 14.10 | 19.20 | -- | -- |
Singh et al [4] RGB+BroxFLOW | 73.50 | 46.30 | 15.00 | 20.40 | -- | 91.12 |
This implentation[4] RGB | 71.71 | 39.36 | 14.57 | 17.95 | 64.12 | 88.68 |
This implentation[4] FastFLOW | 73.50 | 67.63 | 03.57 | 11.56 | 46.33 | 85.60 |
This implentation[4] BroxFLOW | 44.62 | 14.43 | 00.12 | 03.42 | 21.94 | 70.55 |
This implentation[4] RGB+FastFLOW (boost-fusion) | 70.61 | 40.18 | 11.42 | 17.03 | 64.40 | 89.01 |
This implentation[4] RGB+FastFLOW (union-set) | 72.80 | 43.23 | 13.14 | 18.51 | 60.70 | 89.89 |
This implentation[4] RGB+FastFLOW(mean fusion) | 74.34 | 44.27 | 13.50 | 18.96 | 60.70 | 91.54 |
This implentation[4] RGB+BroxFLOW (boost-fusion) | 73.58 | 43.76 | 12.60 | 18.60 | 67.60 | 91.10 |
This implentation[4] RGB+BroxFLOW (union-set) | 74.88 | 45.14 | 13.93 | 19.73 | 64.36 | 92.64 |
This implentation[4] RGB+BroxFLOW(mean fusion) | 76.91 | 47.56 | 15.14 | 20.66 | 67.01 | 93.08 |
Kalogeiton et al. [5] RGB+BroxFLOW (stack of flow images)(mean fusion) | 76.50 | 49.20 | 19.70 | 23.40 | 69.50 | -- |
Effect of training iterations:
There is an effect due to the choice of learning rate and the number of iterations the model is trained.
If you train the SSD network on initial learning rate for
many iterations then it performs is better on
lower IoU threshold, which is done in this case.
In original work using caffe implementation of SSD,
I trained the SSD network with 0.0005 learning rate for first 30K
iterations and dropped then learning rate by the factor of 5
(divided by 5) and further trained up to 45k iterations.
In this implementation, all the models are trained for 120K
iterations, the initial learning rate is set to 0.0005 and learning is dropped by the factor of 5 after 70K and 90K iterations.
Kalogeiton et al. [5]
make use mean fusion, so I thought we could try in our pipeline which was very easy to incorporate.
It is evident from above table that mean fusion performs better than other fusion techniques.
Also, their method relies on multiple frames as input in addition to post-processing of bounding box coordinates at tubelet level.
This implementation is mainly focused on producing the best numbers (mAP) in the simplest manner, it can be modified to run faster. There few aspect that would need changes:
- NMS is performed once in python then again in Matlab; one has to do that on GPU in python
- Most of the time spent during tube generations is taken by disc operations; which can be eliminated completely.
- IoU computation during action path is done multiple time just to keep the code clean that can be handled more smartly
Contact me if you want to implement the real-time version. The Proper real-time version would require converting Matlab part into python. I presented the timing of individual components in the paper, which still holds true.
To use pre-trained model download the pre-trained weights from the links given below and make changes in test-ucf24.py
to accept the downloaded weights.
- Currently, we provide the following PyTorch models:
- SSD300 trained on ucf24 ; available from my google drive
- appearence model trained on rgb-images (named
rgb-ssd300_ucf24_120000
) - accurate flow model trained on brox-images (named
brox-ssd300_ucf24_120000
) - real-time flow model trained on fastOF-images (named
fastOF-ssd300_ucf24_120000
)
- appearence model trained on rgb-images (named
- SSD300 trained on ucf24 ; available from my google drive
- These models can be used to reproduce above table which is almost identical in our paper
- Incorporate JHMDB-21 dataset
- Convert matlab part into python (happy to accept PR)
If this work has been helpful in your research please consider citing [1] and [4]
@inproceedings{singh2016online,
title={Online Real time Multiple Spatiotemporal Action Localisation and Prediction},
author={Singh, Gurkirt and Saha, Suman and Sapienza, Michael and Torr, Philip and Cuzzolin, Fabio},
jbooktitle={ICCV},
year={2017}
}
- [1] Wei Liu, et al. SSD: Single Shot MultiBox Detector. ECCV2016.
- [2] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos. BMVC 2016
- [3] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. ECCV 2016
- [4] G. Singh, S Saha, M. Sapienza, P. H. S. Torr and F Cuzzolin. Online Real time Multiple Spatiotemporal Action Localisation and Prediction. ICCV, 2017.
- [5] Kalogeiton, V., Weinzaepfel, P., Ferrari, V. and Schmid, C., 2017. Action Tubelet Detector for Spatio-Temporal Action Localization. ICCV, 2017.
- Original SSD Implementation (CAFFE)
- A huge thanks to Max deGroot, Ellis Brown for Pytorch implementation of SSD