The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning?
To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
This repository describes the usage of our evaluation data (EgoPlan-Val and EgoPlan-Test) and instruction-tuning data (EgoPlan-IT), and provides the corresponding codes for evaluating and fine-tuning MLLMs on our benchmark. Welcome to evaluate your models and explore methods to enhance the models' EgoPlan capabilities on our benchmark!
The EgoPlan datasets are constructed based on the two existing egocentric video sources: Epic-Kitchens-100 and Ego4D.
Download the RGB frames of Epic-Kitchens-100. The folder structure of the dataset is shown below:
EPIC-KITCHENS
└── P01
└── rgb_frames
└── P01_01
├── frame_0000000001.jpg
└── ...
Download the videos of Ego4D. The folder structure of the dataset is shown below:
Ego4D
└──v1_288p
├── 000786a7-3f9d-4fe6-bfb3-045b368f7d44.mp4
└── ...
Questions from the human-verified evaluation data are formatted as multiple-choice problems. MLLMs need to select the most reasonable answer from four candidate choices. The primary metric is Accuracy.
We divide the evaluation data into two subsets: EgoPlan-Val (containing 3,355 samples) for validation and EgoPlan-Test (containing 1,584 samples) for test, wherein the ground-truth answers of EgoPlan-Test are kept non-public. Below shows an example from the validation set:
{
"sample_id": 115,
"video_source": "EpicKitchens",
"video_id": "P01_13",
"task_goal": "store cereal",
"question": "Considering the progress shown in the video and my current observation in the last frame, what action should I take next in order to store cereal?",
"choice_a": "put cereal box into cupboard",
"choice_b": "take cereal bag",
"choice_c": "open cupboard",
"choice_d": "put cereal bag into cereal box",
"golden_choice_idx": "A",
"answer": "put cereal box into cupboard",
"current_observation_frame": 760,
"task_progress_metadata": [
{
"narration_text": "take cereal bag",
"start_frame": 36,
"stop_frame": 105
},
{
"narration_text": "fold cereal bag",
"start_frame": 111,
"stop_frame": 253
},
{
"narration_text": "put cereal bag into cereal box",
"start_frame": 274,
"stop_frame": 456
},
{
"narration_text": "close cereal box",
"start_frame": 457,
"stop_frame": 606
},
{
"narration_text": "open cupboard",
"start_frame": 689,
"stop_frame": 760
}
],
}
We provide an automatically constructed instruction-tuning dataset EgoPlan_IT, which contains 50K samples, for fine-tuning the model. Below shows an example from EgoPlan-IT:
{
"sample_id": 39,
"video_source": "EpicKitchens",
"video_id": "P07_113",
"task_goal": "Cut and peel the onion",
"question": "Considering the progress shown in the video and my current observation in the last frame, what action should I take next in order to cut and peel the onion?",
"answer": "grab onion",
"current_observation_frame": 9308,
"task_progress_metadata": [
{
"narration_text": "open drawer",
"start_frame": 9162,
"stop_frame": 9203
},
{
"narration_text": "grab knife",
"start_frame": 9214,
"stop_frame": 9273
},
{
"narration_text": "close drawer",
"start_frame": 9272,
"stop_frame": 9303
}
],
"negative_answers": [
"open drawer",
"grab knife",
"close drawer",
"slice onion",
"remove peel from onion",
"peel onion"
]
}
Clone the repo and install dependent packages:
git clone https://github.com/ChenYi99/EgoPlan.git
cd EgoPlan
pip install -r requirements.txt
Prepare Egocentric videos: Download the RGB frames of Epic-Kitchens-100 and the videos of Ego4D.
Prepare EgoPlan datasets: Download the validation data set EgoPlan_validation.json and the training dataset EgoPlan_IT.json. Put these two JSON files under the directory data/.
For details of the data structure, please refer to Data.
We use Video-LLaMA as an example for evaluation and instruction-tuning.
- The checkpoint of the vanilla Video-LLaMA can be downloaded from Video-LLaMA-2-7B-Finetuned.
- Alternatively, the checkpoint of the Video-LLaMA that has been further tuned on our EgoPlan-IT can be downloaded from EgoPlan-Video-LLaMA-2-7B.
Video-LLaMA is based on Llama2 Chat 7B. The corresponding LLM weights can be downloaded from Llama-2-7b-chat-hf.
If the server cannot access the Internet, the following weights should be downloaded in advance:
- VIT (eva_vit_g.pth)
- Q-Former (blip2_pretrained_flant5xxl.pth)
- Bert (bert-base-uncased)
Config the paths for model weights in video_llama_eval_only_vl.yaml.
Set the paths for the project root
, Epic-Kitchens-100 RGB frames
and Ego4D videos
in eval_video_llama.sh.
Then, run the script on 1xV100 (32G) GPU:
bash scripts/eval_video_llama.sh
Config the paths for model weights in egoplan_video_llama_eval_only_vl.yaml.
Set the paths for the project root
, Epic-Kitchens-100 RGB frames
and Ego4D videos
in eval_egoplan_video_llama.sh.
Then, run the script on 1xV100 (32G) GPU:
bash scripts/eval_egoplan_video_llama.sh
For increasing instruction diversity, in addition to EgoPlan-IT, we also include an additional collection of 164K instruction data following Video-LLaMA:
- 3K image-based instructions from MiniGPT-4 [link].
- 150K image-based instructions from LLaVA [link]. The images can be downloaded from here.
- 11K video-based instructions from VideoChat [link]. The videos can be downloaded following the instructions from the official Github repo of Webvid.
Config the paths for model weights and datasets in visionbranch_stage3_finetune_on_EgoPlan_IT.yaml.
Set the path for the project root
in finetune_egoplan_video_llama.sh.
Then, run the script on 8xV100 (32G) GPUs:
bash scripts/finetune_egoplan_video_llama.sh
We are consistently maintaining an EgoPlan-Bench Leaderboard. To show your model's performance on our leaderboard, please contact [email protected] with attached prediction files for the validation and test sets.
We ONLY accept ".json" files. The submitted data format should be like:
[
{
"sample_id": "int",
"label": "str"
},
...
]
where the "sample_id" field should be an integer and the "label" field should be a string within ["A","B","C","D"]. An example submission file for the validation set can be found here.
If you find our project helpful, hope you can star our repository and cite our paper as follows:
@misc{chen2024egoplanbenchbenchmarkingmultimodallarge,
title={EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning},
author={Yi Chen and Yuying Ge and Yixiao Ge and Mingyu Ding and Bohao Li and Rui Wang and Ruifeng Xu and Ying Shan and Xihui Liu},
year={2024},
eprint={2312.06722},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2312.06722},
}
This repo benefits from Epic-Kitchens, Ego4D, Video-LLaMA, LLaMA, MiniGPT-4, LLaVA, VideoChat. Thanks for their wonderful works!