Official code of HierVL: Learning Hierarchical Video-Language Embeddings, CVPR 2023.
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We pretrain on Ego4D narrations and summaries and also transfer the representations to Charades-Ego, EPIC-KITCHENS and HowTo100M.
To create a conda enviornment with the required dependencies, run the following command:
conda env create -f environment.yml
source activate hiervl
Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py
). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv
is available here.
All the references to the datasets must be set correctly to run the codes. To help this process, we have replaced all the paths with a suitable string and documented it in PATHS. Use git grep path
to find all the occurences of that filepath and replace it with your processed path.
We use four nodes for distributed training. Each node has 32GB GPUs and 480GB CPU memory. The pretraining can be run as
python -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_egoaggregate.py --config configs/pt/egoaggregation.json
We experiment mainly on SLURM and the instructions to run this code on SLURM is given next.
To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts
to this directly level and run
bash mover_trainer.sh job_name
The parameters of the SLURM job can be changed in the trainer.sh script. We use 4 nodes each with 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.
The pretraining checkpoint is available here.
Change the following flags to run the baselines and ablations
- HierVL-Avg: Change
self-attention
toaverage
inconfigs/pt/egoaggregation.json
- HierVL-w/o Joint: Set
catastrophic_forgetting_baseline
to True intrainer/trainer_egoaggregate.py
. - HierVL-w/o Hier: Set
append_summary_baseline
to True inEgoClip_EgoMCQ_dataset.py
and run EgoVLP pretraining. - HierVL-w/o Summ: Set
only_sa_no_summary_baseline
to True intrainer/trainer_egoaggregate.py
- HierVL-w/o Summ <-> Narr: Set
only_video_with_summary_baseline
to True intrainer/trainer_egoaggregate.py
To run the downstream tasks, modify the trainer.sh
commands with the following flags
--experiment charades --config configs/ft/charades.json
for Charages-Ego Action Classification downstream training--experiment epic_mir --config configs/ft/epic.json
for EPIC-KITCHENS-100 MIR downstream training--experiment howto100m --config configs/ft/howto100m.json
for HowTo100M long video classification
To test the performance, run
python run/test_charades.py
Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.
To test the performance, run
python run/test_epic.py
Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.
Please open an issue in this repository (preferred for better visibility) or reach out to [email protected].
See the CONTRIBUTING file for how to help out.
If you use the code or the method, please cite the following paper:
@InProceedings{Ashutosh_2023_CVPR,
author = {Ashutosh, Kumar and Girdhar, Rohit and Torresani, Lorenzo and Grauman, Kristen},
title = {HierVL: Learning Hierarchical Video-Language Embeddings},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {23066-23078}
}
The pretraining and Chrades-Ego, EPIC-KITCHENS finetuning codebase is based on EgoVLP repository. Ego4D LTA is based on Ego4D Baseline Code. We thank the authors and maintainers of these codebases.
HierVL is licensed under the CC-BY-NC license.