This repository contains code for 3D-RetinaNet, a novel Single-Stage action detection newtwork proposed along with ROAD dataset. Our TPAMI paper contain detailed description 3D-RetinaNet and ROAD dataset. This code contains training and evaluation for ROAD and UCF-24 datasets.
We need three things to get started with training: datasets, kinetics pre-trained weight, and pytorch with torchvision and tensoboardX.
-
We currently only support following two dataset.
- ROAD dataset in dataset release paper
- UCF24 with revised annotations released with our ICCV-2017 paper.
-
Visit ROAD dataset for download and pre-processing.
-
You can download
rgb-images
it from my google drive link for UCF24 Dataset. Download annotations from corrected-UCF10-annots-repo.- Your data directory should look like:
- ucf24/ - pyannot_with_class_names.pkl - rgb-images - class-name ... - video-name ... - images ......
- Your data directory should look like:
- Install Pytorch and torchvision
- INstall tensorboardX viad
pip install tensorboardx
- Pre-trained weight on kinetics-400. Download them by changing current directory to
kinetics-pt
and run the bash file get_kinetics_weights.sh. OR Download them from Google-Drive. Name the folderkinetics-pt
, it is important to name it right.
- We assume that you have downloaded and put dataset and pre-trained weight in correct places.
- To train 3D-RetinaNet using the training script simply specify the parameters listed in
main.py
as a flag or manually change them.
You will need 4 GPUs (each with at least 10GB VRAM) to run training.
Let's assume that you extracted dataset in /home/user/road/
and weights in /home/user/kinetics-pt/
directory then your train command from the root directory of this repo is going to be:
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/ /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=road --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041
Second instance of /home/user/
in above command specifies where checkpoint weight and logs are going to be stored. In this case, checkpoints and logs will be in /home/user/road/cache/<experiment-name>/
.
Different parameters in main.py
will result in different performance. Validation split is automatically selected based in training split number in road.
You can train ucf24
dataset by change some command line parameter as the training sechdule and learning rate differ compared ot road
training.
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/ /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --MILESTONES=6,8 --MAX_EPOCHS=10
- Training notes:
- Network occupies almost 9.7GB VRAM on each GPU, we used 1080Ti for training and normal training takes about 24 hrs on road dataset.
- During training checkpoint is saved every epoch also log it's frame-level
frame-mean-ap
on a subset of validation split test. - Crucial parameters are
LR
,MILESTONES
,MAX_EPOCHS
, andBATCH_SIZE
for training process. label_types
is very important variable, it defines label-types are being used for training and validation time it is bummed up by one withego-action
label type. It is created indata\dataset.py
for each dataset separately and copied toargs
inmain.py
, further used at the time of evaluations.- Event detection and triplet detection is used interchangeably in this code base.
To generate the tubes and evaluate them, first, you will need frame-level detection and link them. It is pretty simple in out case. Similar to training command, you can run following commands. These can run on single GPUs.
There are various MODEs
in main.py
. You can do each step independently or together. At the moment gen-dets
mode generates and evaluated frame-wise detection and finally performs tube building and evaluation.
For ROAD dataset, run the following commands.
python main.py /home/user/ /home/user/ /home/user/kinetics-pt/ --MODE=gen_dets --MODEL_TYPE=I3D --TEST_SEQ_LEN=8 --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041
and for UCF24
python main.py /home/user/ /home/user/ /home/user/kinetics-pt/ --MODE=gen_dets --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --EVAL_EPOCHS=10 --GEN_NMS=80 --TOPK=20 --PATHS_IOUTH=0.25 --TRIM_METHOD=indiv
- Testing notes
- Evaluation can be done on single GPU for test sequence length up to 32
- No temporal trimming is performed for ROAD dataset however we use class specific alphas with temporal trimming formulation described in paper, which relies on temporal label consistency.
- Please go through the hypermeter in
main.py
to understand there functions. - After performing tubes a detection
.json
file is dumped, which is used for evaluation, seetubes.py
for more detatils. - See
modules\evaluation.py
anddata\dataset.py
for frame-level and video-level evaluation code to computeframe-mAP
andvideo-mAP
.
Here, you find the reproduced results from our paper. We use training split #3 for reproduction on a different machines compared to where results were generated for the paper. Below you will find the test results on validation split #3, which closer to test set compared to other split in terms of environmental conditions. We there is little change in learning rate here, so results are little different than the paper. Also, there are six tasks in ROAD dataset that makes it difficult balance the learning among tasks.
Model is set to I3D
with resnet50
backbone. Kinetics pre-trained weights used for resnet50I3D
, download link to given above in Requirements section. Results on split #3 with test-sequence length being 8 <[email protected]>/<[email protected]>
.
Model | I3D |
Agentness | 54.7/-- |
Agent | 31.1/26.0 |
Action | 22.0/16.1 |
Location | 27.3/24.2 |
Duplexes | 23.7/19.5 |
Events/triplets | 13.9/15.5 |
AV-action | 44.8/-- |
UCF24 results | |
Actionness | -- |
Action detection | -- |
ActionNess-framewise | -- |
- Currently, we provide the models from above table:
- trained weights are available from my Google Drive
- These models can be used to reproduce above table which is almost same as in our paper
If this work has been helpful in your research please cite following articles:
@ARTICLE {singh2022road,
author = {Singh, Gurkirt and Akrigg, Stephen and Di Maio, Manuele and Fontana, Valentina and Alitappeh, Reza Javanmard and Saha, Suman and Jeddisaravi, Kossar and Yousefi, Farzad and Culley, Jacob and Nicholson, Tom and others},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {ROAD: The ROad event Awareness Dataset for autonomous Driving},
year = {5555},
volume = {},
number = {01},
issn = {1939-3539},
pages = {1-1},
keywords = {roads;autonomous vehicles;task analysis;videos;benchmark testing;decision making;vehicle dynamics},
doi = {10.1109/TPAMI.2022.3150906},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {feb}
}
@inproceedings{singh2017online,
title={Online real-time multiple spatiotemporal action localisation and prediction},
author={Singh, Gurkirt and Saha, Suman and Sapienza, Michael and Torr, Philip HS and Cuzzolin, Fabio},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
pages={3637--3646},
year={2017}
}
@article{maddern20171,
title={1 year, 1000 km: The Oxford RobotCar dataset},
author={Maddern, Will and Pascoe, Geoffrey and Linegar, Chris and Newman, Paul},
journal={The International Journal of Robotics Research},
volume={36},
number={1},
pages={3--15},
year={2017},
publisher={SAGE Publications Sage UK: London, England}
}