Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha
NeurIPS 2024
- [2025/01/14] Add Urban1k for evaluation, following Long-CLIP.
- [2024/11/26] Release long captions of LAION and COYO in huggingface.
- [2024/10/20] Upload LoTLIP checkpoints and evaluation code for LoTLIP.
- [2024/10/13] Upload long text-image retrieval evaluation for CLIP.
- [2024/09/26] 🎉 LoTLIP is accepted by NeurIPS 2024!
-
🔥 Enhancing the long text understanding ability of language-image pre-training model.
-
🔥 Models trained on short text-image datasets tend to neglect certain tokens in long texts.
- 🔥 New long text-image retrieval benchmark.
conda create -n lotlip python=3.9
conda activate lotlip
make install-training
make install-test
Preparing Datasets for long text-image retrieval following EVAL_DATASETS.md
Please download pre-trained weights of BERT, ViT-B-16-in21k, and ViT-B-32-in21k to cache-dir.
$cache-dir/
|–– vit_base_patch16_224.augreg_in21k/
|–– vit_base_patch32_224.augreg_in21k/
|–– bert-base-uncased/
- Evaluate CLIP-ViT-B-16 from openai (pretrained on 400M scale dataset):
python -m training.main \
--share4v-retrieval $path_to_SA-1B_dataset$ \
--share4v-anno dataloaders/share4v/share4v_sam_10k.json \
--share4v_val_num 1000,10000 \
--dci-retrieval $path_to_dci_dataset$ \
--iiw-retrieval $path_to_iiw_dataset$ \
--model ViT-B-16 \
--pretrained 'openai'
- Evaluate LoTLIP-ViT-B-16 (pretrained on 100M scale dataset):
Download LoTLIP-ViT-B-16 to path_to_lotlip_checkpoints/
Note: If you built the environment before 2024/10/20, please run pip install transformers==4.39.3 before evaluation.
python -m training.main \
--share4v-retrieval $path_to_SA-1B_dataset$ \
--share4v-anno dataloaders/share4v/share4v_sam_10k.json \
--share4v_val_num 1000,10000 \
--dci-retrieval $path_to_dci_dataset$ \
--iiw-retrieval $path_to_iiw_dataset$ \
--cache-dir $cache-dir$ \
--model lotlip_bert-ViT-B-16 \
--pretrained $path_to_lotlip_checkpoints$/model.pt
Model | Pre-training Data Scale | DCI I2T | DCI T2I | IIW I2T | IIW T2I | SV-10k I2T | SV-10k T2I | Urban-1k I2T | Urban-1k T2I |
---|---|---|---|---|---|---|---|---|---|
CLIP-ViT-B-32 | 400M | 43.06 | 40.32 | 86.76 | 84.15 | 58.08 | 51.77 | 60.90 | 47.00 |
LoTLIP-ViT-B-32 | 100M | 59.90 | 56.36 | 93.14 | 91.83 | 83.76 | 78.97 | 84.10 | 81.80 |
Long-CLIP-ViT-B-16 | 400M | 51.68 | 57.28 | 89.61 | 93.20 | 79.24 | 77.06 | 78.90 | 79.50 |
CLIP-ViT-B-16 | 400M | 45.45 | 43.01 | 88.24 | 87.58 | 60.22 | 56.16 | 67.10 | 52.9 |
LoTLIP-ViT-B-16 | 100M | 64.11 | 62.63 | 94.28 | 92.65 | 88.40 | 82.72 | 88.80 | 84.80 |
- [2024/11/13] We update the evaluation of Long-CLIP and its performance on the long-text-image retrieval tasks.
@inproceedings{LoTLIP,
title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
booktitle={arXiv},
year={2024}
}
Our code is built on top of open_clip. Thanks for their nice work!
Thanks to DCI, IIW and ShareGPT4V for their valuable datasets, which are used in our long-text-image retrieval benchmark.
We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.