Skip to content

[NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve performance of numerous vision language performances for diverse capabilities.

License

Notifications You must be signed in to change notification settings

ByungKwanLee/Meteor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Meteor: Mamba-based traversal of rationale for Large Language and Vision Models [ArXiv]

πŸ“° News

  • Online Demo of Meteor is now available in πŸ€—Huggingface Space, thanks to ZeroGPU support (NVIDIA A100) by Huggingface Staff! However, there is ⚠️Warning that input queries are limited and lots of optimization libraries (Causal-Conv1d, Mamba-SSM) cannot be applied within its space, so inference speed is slower than this official repository.
  • Meteor has been featured by πŸ€—Huggingface daily papers
  • Meteor is now available in πŸ€—Huggingface Models: Meteor-Mamba, Meteor-MLM.
  • Curated 1.1M Question-Rationale-Answer Triples are now available in πŸ€—Huggingface Datasets.
  • Preprint of Meteor has been uploaded in ArXiv.

ezgif-1-389577e9b3

Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.

The contributions of Meteor can be simply summarized as the following lists

  • Curated 1.1M Question-Rationale-Answer Triples.
  • Meteor is the efficient 7B model, compared with highly Larger LLVMs.
  • Meteor-7B acquires diverse capabilities, thereby showing surprising powerful vision language performances.

πŸ’‘ Highlights

Open-source LLVMs with Standard Model Size

LLVMs SQA-IMG POPE MME MMB MathVista SEED-IMG MM-Vet LLaVA-W
Yi-VL-6B 71.7 82.5 1915 64.2 29.7 67.5 32.1 51.9
LLaVA-NeXT-7B 70.1 86.5 1851 69.6 34.6 70.2 43.9 72.3
MM1-7B 72.6 86.6 1858 72.3 35.9 70.9 42.1 -
Meteor-7B 88.3 88.7 2229 82.9 53.4 75.0 57.3 87.1

Open-source LLVMs with Large Model Sizes

LLVMs AI2D ChartQA MME MMB MathVista MM-Vet LLaVA-W
InternVL1.5-40B 79.0 68.0 2175 82.2 47.7 48.9 -
InternVL1.5-26B 80.7 83.8 2188 82.2 53.5 62.8 -
MM1-30B - - 2069 75.1 39.4 48.7 -
MiniGemini-34B - - 2105 79.6 38.9 53.0 -
MiniGemini-HD-34B - - 2141 80.6 43.3 59.3 -
LLaVA-NeXT-8B 71.6 69.5 1972 72.1 37.5 - 80.1
LLaVA-NeXT-34B 74.9 68.7 2030 79.3 46.0 57.4 88.8
LLaVA-NeXT-72B 77.4 77.0 2159 80.5 46.6 - 89.2
LLaVA-NeXT-110B 80.4 80.4 2201 80.5 49.0 - 90.4
Meteor-7B 77.9 74.9 2229 82.9 53.4 57.3 87.1

Closed-source LLVMs

LLVMs SQA-IMG AI2D ChartQA MME MMB MathVista SEED-IMG MMStar
Qwen-VL-Plus 71.6 75.9 78.1 2183 67.0 43.3 72.7 39.7
Gemini-Pro 80.1 73.9 74.1 1933 73.6 45.2 70.7 41.6
GPT-4V 84.6 78.2 78.5 1927 77.0 49.9 69.1 46.1
Meteor-7B 88.3 77.9 74.9 2229 82.9 53.4 75.0 52.8

😎 How to run demo?

Run the following order.

bash install
pip install -r requirements.txt

and run the demo (Enjoy Meteor).

python demo.py

(Optional) If you want to make πŸ“» Gradio demo by yourself, then you should run the following file or change it to fit your style.

python app.py

(Optional) If you want to enjoy the curated question-ratinale-answer triples, then you should debug the following file.

python check_dataset.py

(Optional) If you want to conduct the vision language evaluation, then you should run the following file.

bash run

πŸ“‹ Gathered & Curated Dataset Description

Gathered Total: 2130830, 2.1M

------------------------------
* Real-World Image: 755k
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
    - Math with Vision: 180k
    - Math with Text only: 566k
------------------------------

- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)

Curated Total: 1059382, 1.1M

--------------------------------------------
Real-World Image: 338K
Document & Chart & Diagram & Sign & Symbol: 379K
Math: 342K
     Math with Vision: 165K
     Math with Text only: 177K
--------------------------------------------


- ShareGPT4V-Caption (72507, 73K)
- ShareGPT4V-Instruction (266072, 266K)
- MiniGemini-Instruction (26885, 27K)
- DocDownstream (298748, 299K)
- DocReason (53065, 53K)
- GLLaVA (162378, 162K)
- MathVision (2992, 3K)
- MathInstruct (81496, 81K)
- MathPlus (95239, 95K)

πŸš€ Download Training Datasets

We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.

Gathered Dataset Layout

Meteor_Dataset_Path
β”œβ”€β”€ llava                                                       # ShareGPT4V
β”‚   └── llava_pretrain                  
β”‚       └── images                  
β”œβ”€β”€ coco                                                        # ShareGPT4V
β”‚   └── train2017                   
β”œβ”€β”€ sam                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ gqa                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ ocr_vqa                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ textvqa                                                     # ShareGPT4V
β”‚   └── train_images                    
β”œβ”€β”€ vg                                                          # ShareGPT4V
β”‚   β”œβ”€β”€ VG_100K                 
β”‚   └── VG_100K_2                   
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-celebrity                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-landmark                                                # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ wikiart                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ docvqa                                                      # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ chartqa                                                     # MiniGemini
β”‚   └── train                   
β”‚       └── images                  
β”œβ”€β”€ dvqa                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ai2d                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ imgs                                                        # DocDownstream & DocReason
β”‚   └── ChartQA
β”‚   └── DUE_Benchmark
β”‚       └── DeepForm
β”‚       └── DocVQA
β”‚       └── InfographicsVQA
β”‚       └── KleisterCharity
β”‚       └── TabFact
β”‚       └── WikiTableQuestions
β”‚   └── TextCaps
β”‚   └── TextVQA
β”‚   └── VisualMRC
β”œβ”€β”€ geo3k                                                       # GLLaVA
|   └── train
β”œβ”€β”€ geoqa_plus                                                  # GLLaVA
β”œβ”€β”€ images                                                      # MathVision
|
β”œβ”€β”€ sharegpt4v_instruct_gpt4-vision_cap100k.json                # ShareGPT4V-Caption
β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json  # ShareGPT4V-Instruction
β”œβ”€β”€ train.jsonl                                                 # DocDownstream
β”œβ”€β”€ detailed_explanation.jsonl                                  # DocReason
β”œβ”€β”€ minigemini_instruction.json                                 # MiniGemini-Instruction
β”œβ”€β”€ gllava_align.parquet                                        # GLLaVA-Align
β”œβ”€β”€ gllava_qa.parquet                                           # GLLaVA-QA
β”œβ”€β”€ mathvision.parquet                                          # MathVision
β”œβ”€β”€ MathInstruct.json                                           # MathInstruct
└── mathplus.parquet                                            # MathPlus

πŸ“‚ Evaluation Benchmarks

These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.

Evaluation Dataset Directory Layout

Evaluation_Dataset_Path
β”œβ”€β”€ LLVisionQA-QBench               # Q-Bench
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ ai2d                            # AI2D
β”œβ”€β”€ chartqa                         # ChartQA
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ HallusionBench                  # HallusionBench
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ MMBench                         # MMB
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ llava-bench-in-the-wild         # LLaVA Bench in the Wild
β”œβ”€β”€ MMStar                          # MMStar
└── MathVerse                       # MathVerse

About

[NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve performance of numerous vision language performances for diverse capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published