Method | Resolution | Visual Tokens | LLM | MME | MMB | SEED | RealWorldQA | MMMU | MMVet | Text | Doc | POPE |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ConvLLaVA | 768 | 144 | 7B | 1541 | 68 | 68.8 | 55.9 | 36.3 | 44.8 | 59.1 | 44.8 | 87.3 |
ConvLLaVA | 1024 | 256 | 7B | 1553 | 68.8 | 69.3 | 58.8 | 35.1 | 44.4 | 62.5 | 48.5 | 87.7 |
ConvLLaVA | 1536 | 576 | 7B | 1575 | 68.7 | 70.2 | 59.9 | 35.8 | 45.9 | 65.8 | 59 | 87.3 |
Method | Resolution | Visual Tokens | LLM | RefCOCO | RefCOCO+ | RefCOCOg | Avg | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
val | test-A | test-B | val | test-A | test-B | val | test | |||||
ConvLLaVA | 768 | 144 | 7B | 84.5 | 89.0 | 79.2 | 77.7 | 84.9 | 69.7 | 79.8 | 79.7 | 80.6 |
ConvLLaVA | 1024 | 256 | 7B | 85.5 | 89.6 | 78.8 | 79.3 | 86.1 | 70.3 | 80.6 | 81.2 | 81.4 |
ConvLLaVA | 1536 | 576 | 7B | 86.5 | 90.6 | 80.5 | 80.0 | 86.8 | 71.5 | 82.0 | 82.4 | 82.3 |
We release checkpoints after vision language pretraining and visual instruction tuning. You could directly use the sft model and finetune the vision language pretraining checkpoints on you own data.
model | Huggingface | ModelScope | WiseModel |
---|---|---|---|
ConvLLaVA-768 | pretrain, sft | pretrain, sft | pretrain, sft |
ConvLLaVA-1024 | pretrain, sft | pretrain, sft | pretrain, sft |
ConvLLaVA-1536 | pretrain, sft | pretrain, sft | pretrain, sft |
The pretrain above means the checkpoints are after the second stage vision-language pretraining. The sft above means the checkpoints are after the third stage instruction tuning.
The three stages training scripts are listed below:
If you want to custimze your model, you can directly load the second stage pretrained visual encoder and LLM for instruction tuning. It takes about 5 hours to train the 768 resolution model with LLaVA-Instruct-665k on a single 8 A800 GPUs.
If you wang to train from scratch, you could download our processed ConvNeXt model (modify from LAION ConvNeXt). Then follow the three stage training scripts to train the model.
ConvNeXt: huggingface, modelscope
You need to modify the config from the folder to the resolution you want to train your model on:
- config.json: image_size
- preprocessor_config: size, crop_size
Then load that weights and start training.