This is the official repository for the MedClip research project from MindKind research group.
The project investigates and compares different pretraining tasks for medical image feature extraction and captioning.
The experiments are structured as follows:
download datasets ---> prepare data ---> build models ---> run experiments
Each stage is run by a python script of its own that allows to custumize options in every step.
To download datasets use the download_all.py script from the src directory. This folder will download and extract the zip files for each dataset and sort their contents into the data/raw directory.
python download_all.py
The downloaded data is used to produce clean dataframes aswell as model training material. Each downloaded dataset generates a dataframe as shown below
File | Modality | Anatomy | Patient history | Findings | Impression | Diagnosis |
---|---|---|---|---|---|---|
path to file | imaging modality | imaged anatomy | clinicla history of patient in natural language | findings in natural language | diagnostic impression | concise diagnosis |
For some datasets there may be extra columns or missing columns but the names are consistent across all generated datasets. To prepare the data run the prepare_data.py script from the src directory:
python prepare_data.py
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.