Skip to content

Latest commit

 

History

History
112 lines (83 loc) · 7.31 KB

README.md

File metadata and controls

112 lines (83 loc) · 7.31 KB

oncotrialLLM: Large Language Models applied for the Extraction of Genomic Biomarkers from cancer clinical trials

Abstract

Clinical trials are an essential component of drug development for new cancer treatments, yet the information required to determine a patient's eligibility for enrollment is scattered in large amounts of unstructured text. Genomic biomarkers play an important role in precision medicine, particularly in targeted therapies, underscoring the need to consider them for patient-to-trial matching. Large language models (LLMs) can accurately handle information extraction from clinical trials to assist physicians and patients in identifying potential matches. Here, we investigate different LLM strategies to extract genetic biomarkers from oncology trials to boost the likelihood of enrollment for a potential patient. Our findings suggest that the out-of-the-box open-source language models can capture complex logical expressions and structure the genomic biomarkers in the disjunctive normal form, outperforming closed-source models such as GPT-4 and GPT-3.5-Turbo. Additionally, fine-tuning open-source models with sufficient data can further enhance their performance.

Datasets and Evaluation

Raw Clinical Trials
  1. The randomly selected clinical trials from previously filtered data for this project from clinicaltrials.gov can be found here

  2. The manually annotated clinical trial samples can be found here.

  3. The reviewed synthetic clinical trial samples can be found here.

Datasets used for DPO Fine-tuning
  1. The training data used for fine-tuning Hermes-FT can be found here.

  2. The training data used for fine-tuning Hermes-synth-FT can be found in results.

Datasets used for Evaluation
  1. The test dataset used for evaluation can be found here.

  2. The evaluation metrics per strategie can be found in results.

Repository Structure

./llm/*
├── assets/              # Asset files such as images or other resources
├── conf/                # Configuration file containing environment variables and settings
├── data/                # Datasets
│   ├── interim/         # Intermediate data
│   ├── processed/       # Processed data ready for analysis
│   └── raw/             # Raw, unprocessed data
│   └── simulated/       # GPT-4 simulaated data ready for analysis
├── results/             # Results of analyses and experiments
├── figures/             # Figures generated from data analysis
├── modules/             # Python modules for handling specific tasks
│   ├── biomarker_handler.py     # Module for biomarker data handling
│   ├── chromadb_handler.py      # Module for ChromaDB handling
│   └── gpt_handler.py           # Module for GPT-based operations
├── prompts/             # Prompt files used with Openai models
├── scripts/             # Python scripts for various analyses and model operations
│   ├── aacr_analysis.py              # Script for AACR analysis
│   ├── dpo_train.py                 # Script for training with Direct Preference Optimization (DPO)
│   ├── evaluate_gpt_chain_of_prompts.py   # Script for evaluating GPT models with chain of prompts
│   ├── evaluate_gpt_fewshots.py          # Script for evaluating GPT models with few-shot learning
│   ├── evaluate_hermes_models.py         # Script for evaluating Hermes models
│   ├── generate_jsonL.py                 # Script for generating JSONL data from JSON
│   ├── generate_negatives.py             # Script for preparing the training data for Fine-tuning with DPO
│   ├── plot_cancer_patient_distribution.py  # Script for plotting cancer patient distribution
│   ├── plot_f2_scores.py                # Script for plotting F2 scores
│   ├── plot_token_distribution.py       # Script for plotting token distribution
│   ├── process_civic.py                 # Script for processing CIViC data
│   ├── random_trial_selection.py        # Script for random trial selection
│   └── simulate_trials_gpt4.py          # Script for simulating trials using GPT-4
├── utils/               # Utility scripts used across the project
│   ├── evaluation.py               # Utility functions for model evaluation
│   ├── jsons.py                    # Utility functions for handling JSON files
│   └── __init__.py                 # Initialization file for utils module
├── venv-llm/            # Virtual environment for LLMs
├── .gitignore           # Git ignore file
├── Makefile             # Makefile for automating tasks
├── pyproject.toml       # Poetry project configuration
└── README.md            # Project overview and instructions

Getting Started

Setup Environment

Start by cloning the repository:

git clone https://github.com/BIMSBbioinfo/oncotrialLLM.git
cd oncotrialLLM

Once you have successfully cloned the repository and navigate to its root directory, execute the following commands to create and activate the environment:

make install-env
source venv-llm/bin/activate

Reproducibility

To ensure that you can reproduce the results we obtained, please follow the detailed instructions provided in the Reproducibility Guide. This guide will walk you through setting up the configuration, preparing the data, running the necessary scripts, and verifying the outputs.

Citation

If you use this code/repository in your research, please cite the following paper:

# paper