This repository contains all the code related to the participation of the Universitat Politècnica de Catalunya (UPC) at the ActivityNet Challenge 2016 at the CVPR.
All the code available is to reproduce and check the model proposed to face the classification and detection task over the video dataset. It will be explained step by step all the stages required to reproduce the results and also how to obtain predictions with our proposed model.
The first steps would require to have a Python virtual environment set and work there. It is recommended to use Python 2.7 as all the experiments have run over this version.
Next step is to install all the required packages:
virtualenv -p python2.7 --system-site-packages venv
source venv/bin/activate
pip install -r requirements.txt
In addition to this, it will be required also to have OpenCV install on the machine to run it.
Here there is the steps to install it. Assert that the
installation install the Python package. The installation can be on the machine as
--system-site-packages
have been enabled, the cv package will be seen inside the virtual
environment.
All the experiments have been done with the Framework Keras with the
Theano as the computational backend. The version of Keras is a
fork with some modifications which allow to make the 3D operations from the C3D over GPU (the
original implementation crashed over GPU). To run it successfully over GPU, the file ~/.theanorc
should look like this:
[global]
floatX = float32
device = gpu
optimizer_including = cudnn
[lib]
cnmem = 1
[dnn]
enabled = True
To run the full pipeline, first it would be necessary to download the weights of both models: C3D and our trained model:
cd data/models
sh get_c3d_sports_weights.sh
sh get_temporal_location_weights.sh
Then it is only necessary to run the script with the input video specified:
python scripts/run_all_pipeline.py -i path/to/test/video.mp4
The dataset is made up by videos from Youtube so they require to be download from the internet. To
do so, it has been used the youtube-dl package. To download all
the dataset, it has been extracted to the file videos_ids.lst
the YouTube IDs of all the dataset
videos. Some of the videos are no longer available so they have been removed from the list, but
some others require to sign in to youtube to download it. For this reason, the download script will
require to give a valid YouTube login. Also, by default, all the videos will be downloaded into
the directory ./data/videos
. You can also specify which directory you want to store the videos.
cd dataset
# This will download the videos on the default directory
sh download_videos.sh username password
# This will download the videos on the directory you specify
sh download_videos.sh username password /path/you/want
As the next step is to pass all the videos through the C3D network, first is required to download the weights ported to Keras.
cd data/models
sh get_c3d_sports_weights.sh
Then, with the weights, there is a script which will read all the videos and extract its features. To read the videos, it will require also to have OpenCV framework.
>> python -u scripts/extract_features.py -h
usage: extract_features.py [-h] [-d DIRECTORY] [-o OUTPUT] [-b BATCH_SIZE]
[-t NUM_THREADS] [-q QUEUE_SIZE] [-g NUM_GPUS]
Extract video features using C3D network
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --videos-dir DIRECTORY
videos directory (default: data/videos)
-o OUTPUT, --output-dir OUTPUT
directory where to store the extracted features
(default: data/dataset)
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size when extracting features (default: 32)
-t NUM_THREADS, --num-threads NUM_THREADS
number of threads to fetch videos (default: 8)
-q QUEUE_SIZE, --queue-size QUEUE_SIZE
maximum number of elements at the queue when fetching
videos (default 12)
-g NUM_GPUS, --num-gpus NUM_GPUS
number of gpus to use for extracting features
(default: 1)
Because extracting a huge amount of features from a very big dataset (ActivityNet dataset videos have a 600GB size once downloaded) it require to do all the process very efficiently.
The script is based in producer/consumer paradigm, where there are multiple process fetching videos from disk (this task only requires CPU workload). Then one or multiple (not tested) process are created which each one works with one GPU and load the model and extract the features. Finally to safely store the extracted features, all the extracted ones are placed in a queue that a single process store them on a HDF5 file.
If appear any error trying to allocate memory from Theano, try to run over a GPU with a more memory, or reduce the batch size.
Once all the features have been extracted, it is required to place all the videos in batches but presenting continuity between them and so be able to train a recurrent neural network with a stateful approach.
With the following script it will be created the stateful dataset for training and validation data and be stored in a HDF5 file:
>> python scripts/create_stateful_dataset.py -h
usage: create_stateful_dataset.py [-h] [-i VIDEO_FEATURES_FILE]
[-v VIDEOS_INFO] [-l LABELS] [-o OUTPUT_DIR]
[-b BATCH_SIZE] [-t TIMESTEPS]
[-s {training,validation}]
Put all the videos features into the correct way to train a RNN in a stateful
way
optional arguments:
-h, --help show this help message and exit
-i VIDEO_FEATURES_FILE, --video-features VIDEO_FEATURES_FILE
HDF5 where the video features have been extracted
(default: data/dataset/video_features.hdf5)
-v VIDEOS_INFO, --videos-info VIDEOS_INFO
File containing the annotations of all the videos on
the dataset (default: dataset/videos.json)
-l LABELS, --labels LABELS
File containing the labels of the whole dataset
(default: dataset/labels.txt)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
directory where to store the stateful dataset
(default: data/dataset)
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size desired to use for training (default: 256)
-t TIMESTEPS, --timesteps TIMESTEPS
timesteps desired for training the RNN (default: 20)
-s {training,validation}, --subset {training,validation}
Subset you want to create the stateful dataset
(default: training and validation)
The next step is to train the RNN using the provided script. The script also allows to change the configuration such as the learning rate, the number of LSTM cells or even the number of layers. During training, snapshots of the model weight's will be being stored for future prediction of the best model.
On src/visualize
there is a function to plot the script's training.
>> python scripts/train.py -h
usage: train.py [-h] [--id EXPERIMENT_ID] [-i INPUT_DATASET] [-n NUM_CELLS]
[--num-layers NUM_LAYERS] [-p DROPOUT_PROBABILITY]
[-b BATCH_SIZE] [-t TIMESTEPS] [-e EPOCHS] [-l LEARNING_RATE]
[-w LOSS_WEIGHT]
Train the RNN
optional arguments:
-h, --help show this help message and exit
--id EXPERIMENT_ID Experiment ID to track and not overwrite resulting
models
-i INPUT_DATASET, --input-data INPUT_DATASET
File where the stateful dataset is stored (default:
data/dataset/dataset_stateful.hdf5)
-n NUM_CELLS, --num-cells NUM_CELLS
Number of cells for each LSTM layer (default: 512)
--num-layers NUM_LAYERS
Number of LSTM layers of the network to train
(default: 1)
-p DROPOUT_PROBABILITY, --drop-prob DROPOUT_PROBABILITY
Dropout Probability (default: 0.5)
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size used to create the stateful dataset
(default: 256)
-t TIMESTEPS, --timesteps TIMESTEPS
timesteps used to create the stateful dataset
(default: 20)
-e EPOCHS, --epochs EPOCHS
number of epochs to last the training (default: 100)
-l LEARNING_RATE, --learning-rate LEARNING_RATE
learning rate for training (default: 1e-05)
-w LOSS_WEIGHT, --loss-weight LOSS_WEIGHT
value to weight the loss to the background samples
(default: 0.3)
Once the model is trained, its time to predict the results for the validation and test subset. To do so:
python scripts/predict.py -h
usage: predict.py [-h] [--id EXPERIMENT_ID] [-i VIDEO_FEATURES] [-n NUM_CELLS]
[--num-layers NUM_LAYERS] [-e EPOCH] [-o OUTPUT_PATH]
[-s {validation,testing}]
Predict the output with the trained RNN
optional arguments:
-h, --help show this help message and exit
--id EXPERIMENT_ID Experiment ID to track and not overwrite resulting
models
-i VIDEO_FEATURES, --video-features VIDEO_FEATURES
File where the video features are stored (default:
data/dataset/video_features.hdf5)
-n NUM_CELLS, --num-cells NUM_CELLS
Number of cells for each LSTM layer when trained
(default: 512)
--num-layers NUM_LAYERS
Number of LSTM layers of the network to train when
trained (default: 1)
-e EPOCH, --epoch EPOCH
epoch at which you want to load the weights from the
trained model (default: 100)
-o OUTPUT_PATH, --output OUTPUT_PATH
path to store the output file (default: data/dataset)
-s {validation,testing}, --subset {validation,testing}
Subset you want to predict the output (default:
validation and testing)
Be sure to specify correctly the experiment_id
and the epoch
of the previous trained model in order to use the correct weights.
Finally, to obtain the classification and temporal localization of activities on the ActivityNet dataset, requires to do some post-processing. The script provided let choose some values but the default ones are the ones with better performance. The script returns 4 json
files (classification and detection task for both validation and testing subset) with all the results in the format required by the ActivityNet Challenge.
>> python scripts/process_prediction.py -h
usage: process_prediction.py [-h] [--id EXPERIMENT_ID] [-p PREDICTIONS_PATH]
[-o OUTPUT_PATH] [-k SMOOTHING_K]
[-t ACTIVITY_THRESHOLD] [-s {validation,testing}]
Post-process the prediction of the RNN to obtain the classification and
temporal localization of the videos activity
optional arguments:
-h, --help show this help message and exit
--id EXPERIMENT_ID Experiment ID to track and not overwrite results
-p PREDICTIONS_PATH, --predictions-path PREDICTIONS_PATH
Path where the predictions file is stored (default:
data/dataset)
-o OUTPUT_PATH, --output-path OUTPUT_PATH
Path where is desired to store the results (default:
data/dataset)
-k SMOOTHING_K Smoothing factor at post-processing (default: 5)
-t ACTIVITY_THRESHOLD
Activity threshold at post-processing (default: 0.2)
-s {validation,testing}, --subset {validation,testing}
Subset you want to post-process the output (default:
validation and testing)