Single-Stage Visual Query Localization in Egocentric Videos
conda create --name vqloc python=3.8
conda activate vqloc
# Install pytorch or use your own torch version
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install -r requirements.txt
We provide the model weights trained on here.
- Please follow vq2d baseline step 1/2/4/5 to process the dataset into video clips.
- Use
./train.sh
and change your training config accordingly. - The default training configurations require about 200GB at most, e.g. 8 A40 GPUs with 40GB VRAM, each.
-
- Use
./inference_predict.sh
to inference on the target video clips. Change the path of your model checkpoint.
- Use
-
- Use
python inference_results.py --cfg ./config/val.yaml
to format the results. Use--eval
and--cfg ./config/eval.yaml
for evaluation (submit to leaderboard).
- Use
-
- Use
python evaluate.py
to get the numbers. Please change--pred-file
and--gt-file
accordingly.
- Use
- The hard negative mining is not very steady. We set
use_hnm=False
by default.
@article{jiang2023vqloc,
title={Single-Stage Visual Query Localization in Egocentric Videos},
author={Jiang, Hanwen and Ramakrishnan, Santhosh and Grauman, Kristen},
journal={ArXiv},
year={2023},
volume={2306.09324}
}