Skip to content

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Notifications You must be signed in to change notification settings

ibrahimethemhamamci/CT-CHAT

Repository files navigation

CT-CHAT

Welcome to the official repository for CT-CHAT, a cutting-edge visual-language chat model designed specifically for 3D chest CT volumes. CT-CHAT provides an open-source codebase and pre-trained models, utilizing CT-CLIP and a VQA (Visual Question Answering) dataset adapted from CT-RATE, making it accessible to researchers worldwide. The VQA dataset and model weights are available via the HuggingFace repository.

System Requirements

Before you get started, ensure that your environment meets the following requirements:

  • Python version: > 3.12.4
  • Necessary dependencies: Install CT-CLIP’s dependencies by following the instructions in the CT-CLIP repository.
  • Additional libraries: Ensure that the following libraries are installed:
    • PyTorch v2.4.0
    • CUDA v12.4
    • SciPy v1.14.0
    • Torchvision v0.19.0
    • Scikit-learn v1.2.2
    • Pandas v2.2.2
    • Transformers v4.44.0
    • NumPy v1.26.4

Hardware Requirements

  • For training:

    • Small models: Minimum of 2 A100 GPUs with 80GB VRAM.
    • Large models (80B Llama 3.1): Minimum of 4 A100 GPUs.
  • For inference:

    • Large models: At least 2 A100 GPUs.
    • Smaller models: 1 A100 GPU.

Training

To train the model, follow the provided scripts. It's crucial to run the training data through the image encoder to generate embeddings prior to training. Use the provided Encoder Script as a reference for encoding a single image. Note that this differs from the latent-saving process in CT-CLIP; the outputs must be saved before latent projection. Update the training scripts with the correct path to the saved encodings and other necessary configurations.

Inference

For inference, refer to the serve scripts. To perform CLI-based inference, the validation data must first be encoded similarly to the training data. After encoding, adjust the required paths in the CT-CHAT validation scripts for CLI inference. After calculating latent embeddings, inference with 4 A100 GPUs is expected to be 5-10 tokens/s for Llama 70B, for Llama 8B model, it is expected to be 10-20 tokens/s in 2 A100 GPUs.

For GUI-based inference, run the following commands:

python -m llava.serve.controller --host 0.0.0.0 --port 10000
python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path "path_to_model" --model-base "path_to_model"

Pretrained Models

We offer pre-trained models for several LLMs, trained on the VQA dataset described in our paper. You can download them from the links below:

VQA Dataset

The VQA dataset has been derived from the CT-RATE data using the Llama 3.1 80B model with the scripts provided here. Short-answer questions have been sampled from the RadGenome Chest CT dataset. The dataset is available in the CT-RATE HuggingFace repository.

Citing Us

If you use CT-CHAT, CT-CLIP, or our CT-RATE dataset in your research, please cite our paper.

License

We are committed to fostering innovation and collaboration in the research community. To this end, all elements of CT-RATE, CT-CLIP, and CT-CHAT are released under a Creative Commons Attribution (CC-BY-NC-SA) license. This licensing framework ensures that our contributions can be freely used for non-commercial research purposes, while also encouraging contributions and modifications, provided that the original work is properly cited and any derivative works are shared under similar terms.

Acknowledgements

We would like to express our sincere gratitude to the following works, whose contributions were invaluable to our research. Our VQA dataset includes a subset of data from RadGenome Chest CT. Additionally, our CT-CHAT model is a 3D adaptation of the LLaVA model for CT volumes. CT-CHAT leverages CT-ViT architecture as the vision encoder which is introduced as part of GenerateCT. We are deeply appreciative of these researchers for their outstanding open-source contributions. If you use our VQA data or CT-CHAT model in your work, we kindly ask that you also cite the related works to acknowledge their impact.

About

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published