Skip to content

Latest commit

 

History

History
173 lines (155 loc) · 7.63 KB

Model_zoo.md

File metadata and controls

173 lines (155 loc) · 7.63 KB

Model Zoo

Performance

Method Resolution Visual Tokens LLM MME MMB SEED RealWorldQA MMMU MMVet Text Doc POPE
ConvLLaVA 768 144 7B 1541 68 68.8 55.9 36.3 44.8 59.1 44.8 87.3
ConvLLaVA 1024 256 7B 1553 68.8 69.3 58.8 35.1 44.4 62.5 48.5 87.7
ConvLLaVA 1536 576 7B 1575 68.7 70.2 59.9 35.8 45.9 65.8 59 87.3
Method Resolution Visual Tokens LLM RefCOCO RefCOCO+ RefCOCOg Avg
val test-A test-B val test-A test-B val test
ConvLLaVA 768 144 7B 84.5 89.0 79.2 77.7 84.9 69.7 79.8 79.7 80.6
ConvLLaVA 1024 256 7B 85.5 89.6 78.8 79.3 86.1 70.3 80.6 81.2 81.4
ConvLLaVA 1536 576 7B 86.5 90.6 80.5 80.0 86.8 71.5 82.0 82.4 82.3

Download

We release checkpoints after vision language pretraining and visual instruction tuning. You could directly use the sft model and finetune the vision language pretraining checkpoints on you own data.

model Huggingface ModelScope WiseModel
ConvLLaVA-768 pretrain, sft pretrain, sft pretrain, sft
ConvLLaVA-1024 pretrain, sft pretrain, sft pretrain, sft
ConvLLaVA-1536 pretrain, sft pretrain, sft pretrain, sft

The pretrain above means the checkpoints are after the second stage vision-language pretraining. The sft above means the checkpoints are after the third stage instruction tuning.

Usage of the scripts

The three stages training scripts are listed below:

  • Projector Initialzation: stage1
  • Vision Language Pretraining: stage2
  • Instruction Tuning: stage3

Customize training

If you want to custimze your model, you can directly load the second stage pretrained visual encoder and LLM for instruction tuning. It takes about 5 hours to train the 768 resolution model with LLaVA-Instruct-665k on a single 8 A800 GPUs.

Training from scratch

If you wang to train from scratch, you could download our processed ConvNeXt model (modify from LAION ConvNeXt). Then follow the three stage training scripts to train the model.

ConvNeXt: huggingface, modelscope

You need to modify the config from the folder to the resolution you want to train your model on:

  • config.json: image_size
  • preprocessor_config: size, crop_size

Then load that weights and start training.