GitHub

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Pre-trained models and datasets

Pre-trained models

For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.

Model	Visual Backbone	Text Enc Layers	Fusion Layers	Text Dec Layers	#params	Download

Pre-train Datasets

	COCO	VG	SBU	CC3M	CC13M
image	113K	100K	860K	3M	10M
text	567K	769K	860K	3M	10M

Requirements

PyTorch version >= 1.11.0
Install other libraries via

pip install -r requirements.txt

Pre-training

Comming soon.

Fine-tuning

Download json files of downstream tasks

Visual Question Answering

Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
Download and extract the provided dataset json files.
In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/vqa_mplug_base.sh

Evaluate the result using the official evaluation server.

Image Captioning

Download COCO Caption dataset from the original websites.
Download and extract the provided dataset json files.
Download language evalution tool(language_evalution).
In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/caption_mplug_base.sh

Image-text Retrieval

Download MSCOCO or Flickr30k datasets from the original websites.
Download and extract the provided dataset json files.
In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

sh scripts/retrieval_flickr30k_mplug_base.sh

sh scripts/retrieval_coco_mplug_base.sh

Visual Grounding

Download RefCOCO datasets from the original websites.
Download and extract the provided dataset json files.
In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
Finetune the pre-trained checkpoint using 8 A100 GPUs:

 sh scripts/grounding_mplug_base.sh

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
dataset		dataset
models		models
optim		optim
refTools		refTools
scheduler		scheduler
vgTools		vgTools
vqaTools		vqaTools
README.md		README.md
TPMix.py		TPMix.py
caption_mplug.py		caption_mplug.py
grounding_mplug.py		grounding_mplug.py
pretrain_mplug.py		pretrain_mplug.py
pretrain_mplug_text.py		pretrain_mplug_text.py
requirements.txt		requirements.txt
retrieval_img_mplug.py		retrieval_img_mplug.py
utils.py		utils.py
vqa_albef.py		vqa_albef.py
vqa_mplug.py		vqa_mplug.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-trained models and datasets

Requirements

Pre-training

Fine-tuning

Visual Question Answering

Image Captioning

Image-text Retrieval

Visual Grounding

About

Releases

Packages

Languages

chaoyajiang/TiMiX

Folders and files

Latest commit

History

Repository files navigation

Pre-trained models and datasets

Requirements

Pre-training

Fine-tuning

Visual Question Answering

Image Captioning

Image-text Retrieval

Visual Grounding

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages