Skip to content

[ECCVW'24] Long-form Video Understanding by Bridging Episodic Memory and Semantic Knowledge

License

MIT, BSD-3-Clause licenses found

Licenses found

MIT
LICENCE
BSD-3-Clause
LICENCE_lavis.txt
Notifications You must be signed in to change notification settings

joslefaure/HERMES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

PWC PWC PWC PWC

This is the official repository of our papers:

  • "HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics"
  • "BREASE: Bridging Episodes and Semantics, A Novel Framework for Long-Form Video Understanding" (ECCVW'24).

teaser

Model Overview

model

🔥 News

  • [2024.08.24] ⌨️ Our short paper "BREASE: Bridging Episodes and Semantics, A Novel Framework for Long-Form Video Understanding" has been accepted by the EVAL-FoMo workshop at ECCV'24.

Requirements

You can install the conda environment by running:

git clone https://github.com/joslefaure/HERMES.git
cd HERMES
pip install -e .

Supported Datasets

Prepare MovieChat-1k

  1. Download the train data (if you want to finetune HERMES) from here and the test data from here

  2. Extract the frames at 10FPS and organize it as follows:

├── data
    └── moviechat
        ├── annotation
        ├── frames
            └── {video_id}
                ├── frame000001.jpg
                ├── ...

Running

Download Pre-trained LLM

We use Vicuna-v1.1 (we report results using the 7B model only) as our pre-trained LLM weights, you can download from this link and arrange in this format.

I prefer my bert-base-uncased locally, therefore I added it here too. Download it from there.

├── llm
     ├── vicuna-7b
     ├── vicuna-13b
     ├── bert-based-uncased

Inference

We inference the model on 4 V100 GPUs (32GB).

First add your openai API to the environment variable export OPENAI_API_KEY='sk-***** (only for moviechat dataset), as we use GPT3.5 for scoring. For the other datasets, we report top-1 accuracy.

# Zero-shot
bash run_scripts/moviechat/test.sh

# Fully-supervised
bash run_scripts/moviechat/test.sh path/to/your/model.pth

Same for the other datasets. All the scripts are included in run_scripts.

Pretrained Checkpoints

Coming Soon

Train

We train the model on 8 V100 GPUs (32GB).

bash run_scripts/{dataset}/train.sh

Citation

If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:

@misc{faure2024bridgingepisodessemanticsnovel,
      title={Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding}, 
      author={Gueter Josmy Faure and Jia-Fong Yeh and Min-Hung Chen and Hung-Ting Su and Winston H. Hsu and Shang-Hong Lai},
      year={2024},
      eprint={2408.17443},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.17443}, 
}

Acknowledgement

We thank the authors of the following repositories for open-sourcing their code.

About

[ECCVW'24] Long-form Video Understanding by Bridging Episodic Memory and Semantic Knowledge

Topics

Resources

License

MIT, BSD-3-Clause licenses found

Licenses found

MIT
LICENCE
BSD-3-Clause
LICENCE_lavis.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published