TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
- Pre-trained models
For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.
Model | Visual Backbone | Text Enc Layers | Fusion Layers | Text Dec Layers | #params | Download |
---|
- Pre-train Datasets
COCO | VG | SBU | CC3M | CC13M | |
---|---|---|---|---|---|
image | 113K | 100K | 860K | 3M | 10M |
text | 567K | 769K | 860K | 3M | 10M |
-
PyTorch version >= 1.11.0
-
Install other libraries via
pip install -r requirements.txt
Comming soon.
Download json files of downstream tasks
- Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
- Download and extract the provided dataset json files.
- In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
- Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/vqa_mplug_base.sh
- Evaluate the result using the official evaluation server.
- Download COCO Caption dataset from the original websites.
- Download and extract the provided dataset json files.
- Download language evalution tool(language_evalution).
- In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
- Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/caption_mplug_base.sh
- Download MSCOCO or Flickr30k datasets from the original websites.
- Download and extract the provided dataset json files.
- In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
- Finetune the pre-trained checkpoint using 8 A100 GPUs:
sh scripts/retrieval_flickr30k_mplug_base.sh
sh scripts/retrieval_coco_mplug_base.sh
- Download RefCOCO datasets from the original websites.
- Download and extract the provided dataset json files.
- In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
- Finetune the pre-trained checkpoint using 8 A100 GPUs:
sh scripts/grounding_mplug_base.sh