Skip to content
/ LoTLIP Public

[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

License

Notifications You must be signed in to change notification settings

wuw2019/LoTLIP

Repository files navigation

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha

NeurIPS 2024

📬 News

  • [2025/01/14] Add Urban1k for evaluation, following Long-CLIP.
  • [2024/11/26] Release long captions of LAION and COYO in huggingface.
  • [2024/10/20] Upload LoTLIP checkpoints and evaluation code for LoTLIP.
  • [2024/10/13] Upload long text-image retrieval evaluation for CLIP.
  • [2024/09/26] 🎉 LoTLIP is accepted by NeurIPS 2024!

💡 Highlights

  • 🔥 Enhancing the long text understanding ability of language-image pre-training model.

  • 🔥 Models trained on short text-image datasets tend to neglect certain tokens in long texts.

  • 🔥 New long text-image retrieval benchmark.

💻 How to Install

conda create -n lotlip python=3.9
conda activate lotlip

make install-training
make install-test

🗒 Long Text-Image Retrieval

Data Preparation

Preparing Datasets for long text-image retrieval following EVAL_DATASETS.md

Pre-trained Weights Preparation

Please download pre-trained weights of BERT, ViT-B-16-in21k, and ViT-B-32-in21k to cache-dir.

$cache-dir/
|–– vit_base_patch16_224.augreg_in21k/
|–– vit_base_patch32_224.augreg_in21k/
|–– bert-base-uncased/

How to evaluate

  • Evaluate CLIP-ViT-B-16 from openai (pretrained on 400M scale dataset):
python -m training.main \
    --share4v-retrieval $path_to_SA-1B_dataset$ \
    --share4v-anno dataloaders/share4v/share4v_sam_10k.json \
    --share4v_val_num 1000,10000 \
    --dci-retrieval $path_to_dci_dataset$ \
    --iiw-retrieval $path_to_iiw_dataset$ \
    --model ViT-B-16 \
    --pretrained 'openai'
  • Evaluate LoTLIP-ViT-B-16 (pretrained on 100M scale dataset):

Download LoTLIP-ViT-B-16 to path_to_lotlip_checkpoints/

Note: If you built the environment before 2024/10/20, please run pip install transformers==4.39.3 before evaluation.

python -m training.main \
    --share4v-retrieval $path_to_SA-1B_dataset$ \
    --share4v-anno dataloaders/share4v/share4v_sam_10k.json \
    --share4v_val_num 1000,10000 \
    --dci-retrieval $path_to_dci_dataset$ \
    --iiw-retrieval $path_to_iiw_dataset$ \
    --cache-dir $cache-dir$ \
    --model lotlip_bert-ViT-B-16 \
    --pretrained $path_to_lotlip_checkpoints$/model.pt

Evaluation Results

Model Pre-training Data Scale DCI I2T DCI T2I IIW I2T IIW T2I SV-10k I2T SV-10k T2I Urban-1k I2T Urban-1k T2I
CLIP-ViT-B-32 400M 43.06 40.32 86.76 84.15 58.08 51.77 60.90 47.00
LoTLIP-ViT-B-32 100M 59.90 56.36 93.14 91.83 83.76 78.97 84.10 81.80
Long-CLIP-ViT-B-16 400M 51.68 57.28 89.61 93.20 79.24 77.06 78.90 79.50
CLIP-ViT-B-16 400M 45.45 43.01 88.24 87.58 60.22 56.16 67.10 52.9
LoTLIP-ViT-B-16 100M 64.11 62.63 94.28 92.65 88.40 82.72 88.80 84.80
  • [2024/11/13] We update the evaluation of Long-CLIP and its performance on the long-text-image retrieval tasks.

🔷 Bibtex

@inproceedings{LoTLIP,
  title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
  author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
  booktitle={arXiv},
  year={2024}
}

❤️ Acknowledgements

Our code is built on top of open_clip. Thanks for their nice work!

Thanks to DCI, IIW and ShareGPT4V for their valuable datasets, which are used in our long-text-image retrieval benchmark.

We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.

About

[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published