Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao
Due to the many requests we received for releasing the BiomedCLIP fine-tuning code, we have updated the repo and added the necessary code to do so. Follow the steps here
Abstract: Segmentation of anatomical structures and pathological regions in medical images is essential for modern clinical diagnosis, disease research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing precise segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is still needed and highly relevant. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks from SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further. Extensive testing across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework.
Public datasets used in our study:
- Radiology Objects in COntext (ROCO)
- MedPix
- Breast UltraSound Images (BUSI)
- UDIAT
- COVID-QU-Ex
- Brain Tumors
- Lung CT
You can download the segmentation datasets here.
Create a directory for your data that you want to work with in the main working directory like the following:
data
├── breast_tumors
│ ├── train_images
│ ├── train_masks
│ ├── val_images
│ ├── val_masks
│ ├── test_images
│ └── test_masks
│
├── brain_tumors
│ ├── train_images
│ ├── train_masks
│ ├── val_images
│ ├── val_masks
│ ├── test_images
│ └── test_masks
│
└── ...
Install anaconda following the anaconda installation documentation. Create an environment with all required packages with the following command :
conda env create -f medclipsamv2_env.yml
conda activate medclipsamv2
then setup the segment-anything library:
cd segment-anything
pip install -e .
cd ..
finally setup the nnUNet framework:
cd weak_segmentation
pip install -e .
cd ..
Three model versions of the SAM model are available with different backbone sizes. These models can be instantiated by running
Click the links below to download the checkpoint for the corresponding model type and place it at segment-anything/sam_checkpoints/sam_vit_h_4b8939.pth
default
orvit_h
: ViT-H SAM model.vit_l
: ViT-L SAM model.vit_b
: ViT-B SAM model.
You can fine-tune the BiomedCLIP pre-trained model using our DHN-NCE Loss.
Place your image-text dataset in biomedclip_finetuning/open_clip/src/data
(please refer to the MedPix dataset to see how your custom dataset should be structured)
You can then start fine-tuning BiomedCLIP like this:
bash biomedclip_finetuning/scripts/biomedclip.sh
If you have the model saved with the .pt
format, you can convert it to .bin
by moving the saved model checkpoint to saliency_maps/model
and then calling:
python saliency_maps/model/convert.py
Our fine-tuned model can be downloaded here. Place it at saliency_maps/model/pytorch_model.bin
You can run the whole zero-shot framework with the following:
bash zeroshot.sh <path/to/dataset>
You can change the settings by specifying which CLIP model you want to use, the post-processing algorithm, the SAM model and the type of visual prompts to use (boxes/points/both).
The text prompts we used can be found here.
Some zeroshot_scripts to reproduce the results are found at zeroshot_scripts
.
Go to weak_segmentation
:
cd weak_segmentation
Please follow this guideline to prepare your datasets. Place all your prepared datasets in data
.
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
nnUNetv2_train DATASET_ID 2d all --npz --num_epochs EPOCHS --num_of_cycles CYCLES
nnUNetv2_predict_from_folder --dataset DATASET_ID --fold all --input_folder INPUT_PATH --output_folder OUTPUT_PATH --rule RULE
nnUNetv2_run_uncertainty_on_fold --proba_dir PATH --raw_path PATH --labels PATH --score_type TYPE --output_pred_path PATH
Special thanks to open_clip, M2IB, nnUNet, and segment-anything for making their valuable code publicly available.
If you use MedCLIP-SAM, please consider citing:
@article{koleilat2024medclipsamv2,
title={MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
journal={arXiv preprint arXiv:2409.19483},
year={2024}
}
@inproceedings{koleilat2024medclip,
title={MedCLIP-SAM: Bridging text and image towards universal medical image segmentation},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={643--653},
year={2024},
organization={Springer}
}