Repository for the paper Using diffusion models to generate synthetic labeled data for medical image segmentation.
Medical image analysis has become a prominent area where machine learning has been applied. However, high-quality, publicly available data are limited either due to patient privacy laws or the time and cost required for experts to annotate images. In this retrospective study, we designed and evaluated a pipeline to generate synthetic labeled polyp images for augmenting medical image segmentation models with the aim of reducing this data scarcity.
We trained diffusion models on the HyperKvasir dataset, comprising 1000 images of polyps in the human GI tract from 2008 to 2016. Qualitative expert review, Fréchet Inception Distance (FID), and Multi-Scale Structural Similarity (MS-SSIM) were tested for evaluation. Additionally, various segmentation models were trained with the generated data and evaluated using Dice score (DS) and Intersection over Union (IoU).
Our pipeline produced images more akin to real polyp images based on FID scores. Segmentation model performance also showed improvements over GAN methods when trained entirely, or partially, with synthetic data, despite requiring less compute for training. Moreover, the improvement persists when tested on different datasets, showcasing the transferability of the generated images.
The proposed pipeline produced realistic image and mask pairs which could reduce the need for manual data annotation when performing a machine learning task. We support this use case by showing that the methods proposed in this study enhanced segmentation model performance, as measured by Dice and IoU scores, when trained fully or partially on synthetic data.
Pytorch 1.12.1, Python 3.9
$ conda create --name diffgen python=3.9
$ conda activate diffgen
$ pip install ftfy regex matplotlib lpips kornia opencv-python color-matcher blobfile scikit-learn pytorch_msssim
$ pip install torch==1.12.1 torchvision==0.13.1
$ pip install git+https://github.com/openai/CLIP.git
$ (conda or pip) install mpi4py
To generate images, please download the pre-trained diffusion model(s)
Full Model LINK ~2 GB
Cluster (20) Models LINK ~40GB
download the model into ./checkpoints
folder.
The .pkl file containing the cluster assignments may be found in the clustering
folder. Note that cluster assignments for 10, 20, 30, and 40 are provided, as well as a simple notebook to generate the clusters.
Unfortunately, we aren't able to provide pre-trained models for
# Cluster Models
python ~/diffusion-gen/guided_diffusion/BOI_train.py \
--data_dir ~/diffusion-gen/guided_diffusion/segmented-images \
--cluster_file ~/diffusion-gen/clusters.pkl \
--cluster_index 1 \
--image_size 256 \
--out_dir ~/diffusion-gen/checkpoints \
--batch_size 1
# Full Model
python ~/diffusion-gen/guided_diffusion/image_train.py \
--data_dir ~/diffusion-gen/guided_diffusion/segmented-images \
--image_size 256 \
--out_dir ~/diffusion-gen/checkpoints \
--batch_size 1
--cluster_index
may be changed to any cluster index from the cluster file
python ~/diffusion-gen/main.py \
--cluster_path ~/diffusion-gen/clusters.pkl \
--output_path ~/diffusion-gen/cluster_image_samples \
--model_path ~/diffusion-gen/checkpoints/cluster_models/model110000.pt \
--diff_iter 100 \
--timestep_respacing 200 \
--skip_timesteps 80 \
--model_output_size 256 \
--num_samples 1 \
--batch_size 1 \
--use_noise_aug_all \
--use_colormatch \
-fti -sty -inp -spi
The cluster path, output path, and model path can be changed to your own path.
Note the format of the model path: each model will be of the form cluster_{cluster index}_model_{model iteration}.pt
. For example, cluster_0_model_110000.pt
is the model for cluster 0 at iteration 110000. In the main.py
file, we loop through all the cluster models with the given iteration number and generate samples for each cluster.
-fti
selects a random image from the cluster to use as the target image for image-guided generation
-sty
applies styling
-inp
uses the inpainting technique
-spi
instead of just sampling {num_samples} images, we sample {num_samples} images from each image
python ~/diffusion-gen/main.py \
--data_dir ~/diffusion-gen/guided_diffusion/segmented-images/masked-images \
--output_path ~/diffusion-gen/image_samples \
--model_path ~/diffusion-gen/logs/256models/model200000.pt \
--diff_iter 100 \
--timestep_respacing 200 \
--skip_timesteps 80 \
--model_output_size 256 \
--num_samples 1 \
--batch_size 1 \
--use_noise_aug_all \
--use_colormatch \
-fti -sty -inp -spi
Refer to above section for clarification on the arguments.
Our source code rely on Blended-diffusion, guided-diffusion, flexit, splicing vit, and DiffuseIT
If you find this work useful, please consider citing:
@article{saragihUsingDiffusionModels2024,
title = {Using Diffusion Models to Generate Synthetic Labeled Data for Medical Image Segmentation},
author = {Saragih, Daniel G. and Hibi, Atsuhiro and Tyrrell, Pascal N.},
year = {2024},
month = aug,
journal = {International Journal of Computer Assisted Radiology and Surgery},
volume = {19},
number = {8},
pages = {1615--1625},
issn = {1861-6429},
doi = {10.1007/s11548-024-03213-z},
urldate = {2024-08-19},
}