Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Gül Varol1 2, Weidi Xie1 3, Andrew Zisserman1
1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
3 CMIC, Shanghai Jiao Tong University
-
Basic Dependencies:
pytorch=2.0.0
,Pillow
,pandas
,decord
,opencv
,moviepy=1.0.3
transformers=4.37.2
accelerate==0.26.1
-
VideoLLaMA2: After installation, modify the
sys.path.append("/path/to/VideoLLaMA2")
instage1/main.py
andstage1/utils.py
. Please download the VideoLLaMA2-7B checkpoint here. -
Set up cache model path (for LLaMA3, etc.) by modifying
os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/"
instage1/main.py
andstage2/main.py
In this work, we evaluate our model on CMD-AD, MAD-Eval, and TV-AD.
- CMD-AD can be downloaded here.
- MAD-Eval can be downloaded here.
- TV-AD adopts a subset of TV-QA as visual sources (3fps), and can be downloaded here. Each folder containing .jpg video frames needs to be converted to a .tar file. This can be done by the code provided in
tools/compress_subdir.py
.
For example,python tools/compress_subdir.py \ --root_dir="resources/example_file_structures/tvad_raw/" \ # for downloaded raw (.jpg folders) files from TVQA --save_dir="resources/example_file_structures/tvad/" # for compressed tar files
- All annotations can be found in
resources/annotations
- The AutoAD-Zero predictions can be downloaded here.
The pre-computed character recognition results are available in resources/annotations
(e.g. resources/annotations/cmdad_anno_with_face_0.2_0.4.csv
), which can be directly feeded into stage I (next step).
It is also possible to run character recognition code from stratch. Please refer to the char_recog
folder for more details.
python stage1/main.py \
--dataset={dataset} \ #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \ #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \ #e.g. "resources/charbanks/cmdad_charbank.json"
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}
--dataset
: choices are cmdad
, madeval
, and tvad
.
--video_dir
: directory of video datasets, example file structures can be found in resources/example_file_structures
(files are empty, for references only).
--anno_path
: path to AD annotations (with predicted face IDs and bboxes), available in resources/annotations
.
--charbank_path
: path to external character banks, available in resources/charbanks
.
--model_path
: path to videollama2 checkpoint.
--output_dir
: directory to save output csv.
python stage2/main.py \
--dataset={dataset} \ #e.g. "cmdad"
--pred_path={stage1_result_path}
--dataset
: choices are cmdad
, madeval
, and tvad
.
--pred_path
: path to the stage1 saved csv file.
If you find this repository helpful, please consider citing our work:
@InProceedings{xie2024autoad0,
title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
booktitle={ACCV},
year={2024}
}
VideoLLaMA2: https://github.com/DAMO-NLP-SG/VideoLLaMA2
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct