LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha

NeurIPS 2024

| Project Page | Paper | Long Caption-CC | Long Caption-LAION | Long Caption-COYO |

📬 News

[2025/01/14] Add Urban1k for evaluation, following Long-CLIP.
[2024/11/26] Release long captions of LAION and COYO in huggingface.
[2024/10/20] Upload LoTLIP checkpoints and evaluation code for LoTLIP.
[2024/10/13] Upload long text-image retrieval evaluation for CLIP.
[2024/09/26] 🎉 LoTLIP is accepted by NeurIPS 2024!

💡 Highlights

🔥 Enhancing the long text understanding ability of language-image pre-training model.
🔥 Models trained on short text-image datasets tend to neglect certain tokens in long texts.

🔥 New long text-image retrieval benchmark.

💻 How to Install

conda create -n lotlip python=3.9
conda activate lotlip

make install-training
make install-test

🗒 Long Text-Image Retrieval

Data Preparation

Preparing Datasets for long text-image retrieval following EVAL_DATASETS.md

Pre-trained Weights Preparation

Please download pre-trained weights of BERT, ViT-B-16-in21k, and ViT-B-32-in21k to cache-dir.

$cache-dir/
|–– vit_base_patch16_224.augreg_in21k/
|–– vit_base_patch32_224.augreg_in21k/
|–– bert-base-uncased/

How to evaluate

Evaluate CLIP-ViT-B-16 from openai (pretrained on 400M scale dataset):

python -m training.main \
    --share4v-retrieval $path_to_SA-1B_dataset$ \
    --share4v-anno dataloaders/share4v/share4v_sam_10k.json \
    --share4v_val_num 1000,10000 \
    --dci-retrieval $path_to_dci_dataset$ \
    --iiw-retrieval $path_to_iiw_dataset$ \
    --model ViT-B-16 \
    --pretrained 'openai'

Evaluate LoTLIP-ViT-B-16 (pretrained on 100M scale dataset):

Download LoTLIP-ViT-B-16 to path_to_lotlip_checkpoints/

Note: If you built the environment before 2024/10/20, please run pip install transformers==4.39.3 before evaluation.

python -m training.main \
    --share4v-retrieval $path_to_SA-1B_dataset$ \
    --share4v-anno dataloaders/share4v/share4v_sam_10k.json \
    --share4v_val_num 1000,10000 \
    --dci-retrieval $path_to_dci_dataset$ \
    --iiw-retrieval $path_to_iiw_dataset$ \
    --cache-dir $cache-dir$ \
    --model lotlip_bert-ViT-B-16 \
    --pretrained $path_to_lotlip_checkpoints$/model.pt

Evaluation Results

Model	Pre-training Data Scale	DCI I2T	DCI T2I	IIW I2T	IIW T2I	SV-10k I2T	SV-10k T2I	Urban-1k I2T	Urban-1k T2I
CLIP-ViT-B-32	400M	43.06	40.32	86.76	84.15	58.08	51.77	60.90	47.00
LoTLIP-ViT-B-32	100M	59.90	56.36	93.14	91.83	83.76	78.97	84.10	81.80
Long-CLIP-ViT-B-16	400M	51.68	57.28	89.61	93.20	79.24	77.06	78.90	79.50
CLIP-ViT-B-16	400M	45.45	43.01	88.24	87.58	60.22	56.16	67.10	52.9
LoTLIP-ViT-B-16	100M	64.11	62.63	94.28	92.65	88.40	82.72	88.80	84.80

[2024/11/13] We update the evaluation of Long-CLIP and its performance on the long-text-image retrieval tasks.

🔷 Bibtex

@inproceedings{LoTLIP,
  title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
  author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
  booktitle={arXiv},
  year={2024}
}

❤️ Acknowledgements

Our code is built on top of open_clip. Thanks for their nice work!

Thanks to DCI, IIW and ShareGPT4V for their valuable datasets, which are used in our long-text-image retrieval benchmark.

We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dataloaders		dataloaders
figures		figures
open_clip		open_clip
training		training
CITATION.cff		CITATION.cff
EVAL_DATASETS.md		EVAL_DATASETS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

| Project Page | Paper | Long Caption-CC | Long Caption-LAION | Long Caption-COYO |

📬 News

💡 Highlights

💻 How to Install

🗒 Long Text-Image Retrieval

Data Preparation

Pre-trained Weights Preparation

How to evaluate

Evaluation Results

🔷 Bibtex

❤️ Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

wuw2019/LoTLIP

Folders and files

Latest commit

History

Repository files navigation

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

| Project Page | Paper | Long Caption-CC | Long Caption-LAION | Long Caption-COYO |

📬 News

💡 Highlights

💻 How to Install

🗒 Long Text-Image Retrieval

Data Preparation

Pre-trained Weights Preparation

How to evaluate

Evaluation Results

🔷 Bibtex

❤️ Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages