Model Zoo

Performance

Method	Resolution	Visual Tokens	LLM	MME	MMB	SEED	RealWorldQA	MMMU	MMVet	Text	Doc	POPE
ConvLLaVA	768	144	7B	1541	68	68.8	55.9	36.3	44.8	59.1	44.8	87.3
ConvLLaVA	1024	256	7B	1553	68.8	69.3	58.8	35.1	44.4	62.5	48.5	87.7
ConvLLaVA	1536	576	7B	1575	68.7	70.2	59.9	35.8	45.9	65.8	59	87.3

Method	Resolution	Visual Tokens	LLM	RefCOCO			RefCOCO+			RefCOCOg		Avg
Method	Resolution	Visual Tokens	LLM	val	test-A	test-B	val	test-A	test-B	val	test	Avg
ConvLLaVA	768	144	7B	84.5	89.0	79.2	77.7	84.9	69.7	79.8	79.7	80.6
ConvLLaVA	1024	256	7B	85.5	89.6	78.8	79.3	86.1	70.3	80.6	81.2	81.4
ConvLLaVA	1536	576	7B	86.5	90.6	80.5	80.0	86.8	71.5	82.0	82.4	82.3

Download

We release checkpoints after vision language pretraining and visual instruction tuning. You could directly use the sft model and finetune the vision language pretraining checkpoints on you own data.

model	Huggingface	ModelScope	WiseModel
ConvLLaVA-768	pretrain, sft	pretrain, sft	pretrain, sft
ConvLLaVA-1024	pretrain, sft	pretrain, sft	pretrain, sft
ConvLLaVA-1536	pretrain, sft	pretrain, sft	pretrain, sft

The pretrain above means the checkpoints are after the second stage vision-language pretraining. The sft above means the checkpoints are after the third stage instruction tuning.

Usage of the scripts

The three stages training scripts are listed below:

Projector Initialzation: stage1
Vision Language Pretraining: stage2
Instruction Tuning: stage3

Customize training

If you want to custimze your model, you can directly load the second stage pretrained visual encoder and LLM for instruction tuning. It takes about 5 hours to train the 768 resolution model with LLaVA-Instruct-665k on a single 8 A800 GPUs.

Training from scratch

If you wang to train from scratch, you could download our processed ConvNeXt model (modify from LAION ConvNeXt). Then follow the three stage training scripts to train the model.

ConvNeXt: huggingface, modelscope

You need to modify the config from the folder to the resolution you want to train your model on:

config.json: image_size
preprocessor_config: size, crop_size

Then load that weights and start training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model_zoo.md

Model_zoo.md

Model Zoo

Performance

Download

Usage of the scripts

Customize training

Training from scratch

Files

Model_zoo.md

Latest commit

History

Model_zoo.md

File metadata and controls

Model Zoo

Performance

Download

Usage of the scripts

Customize training

Training from scratch