CTO

Code for CTO: A Large Clinical Trial Outcome

Paper Link: Automatically Labeling $200B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark

Please see Tutorials if you want to simply get started as soon as possible.

Please see the following modules for implementations of each of the following sources of weakly supervised labelling functions.

llm_prediction_on_pubmed

PubMed abstracts have been automatically linked to clinical trials by initiatives such as the Clinical Trials Transformation Initiative (CTTI). We focuse only on the Derived and Results categories to concentrate on clinical trial outcomes. For trials with multiple abstracts, we selected the top two based on their title similarity to the trial's official title to ensure relevance. Using these abstracts, we prompted the gpt-3.5 model to summarize important trial-related statistical tests and predict the outcomes. Additionally, we had the model generate QA pairs about the trials, which are included as a supplement to our benchmark. The prompts used for these tasks are provided in the paper.

clinical_trial_linkage

The journey from drug discovery to FDA approval involves several stages, starting with Phase 1 trials for safety and dosage, followed by Phase 2 and 3 trials to evaluate efficacy and compare with existing therapies. After Phase 3, a drug may be submitted for FDA approval. This linkage is difficult due to unstructured data, inconsistent reporting, and discrepancies in intervention details. Despite these challenges, linking trials is invaluable for clinical trial outcome predictions. We present a novel trial-linking algorithm, the first systematic attempt to connect different phases of clinical trials, involving two steps: linking trials across phases and matching Phase 3 trials with FDA approvals.

news_headlines

We conducted web scraping, sending requests to Google News for headlines related to the top 1000 industry sponsors, covering approximately 80% of industry-sponsored trials (27,720 trials). Due to rate limits, we queried at a rate of 1 query every 3-5 seconds, collecting up to 100 articles per month from the sponsor's earliest trial to the present, totaling 1,115,017 news articles. We used FinBERT to classify the financial sentiment of these headlines as 'Positive' or 'Negative,' discarding 'Neutral' sentiments. For news/trial matching, we filtered and reranked headlines using a top-K retriever and cross-encoder, encoding trials and headlines with PubMedBERT and calculating relevancy scores.

stock_price

The stock price of pharmaceutical and biotech companies often reflects market expectations about clinical trial outcomes. Positive expectations can lead to a stock rise, while low expectations or past failures may result in poor performance. We used Yahoo Finance to gather historical stock data for companies with publicly available tickers for completed trials. By calculating a 5-day simple moving average (SMA) of stock prices, we smoothed out short-term fluctuations to highlight underlying trends. The slope of the SMA over a 7-day window from the trial's completion date indicates the trend's direction and steepness, capturing the immediate impact of trial completions.

labeling

For Phases 1, 2, and 3, we determine specific quantile thresholds from a range of values for all tunable labeling functions (LFs), fine-tuning each on the TOP training dataset. Our final labeling process employs both an unsupervised data programming approach and a supervised Random Forest model to estimate labels and align weakly supervised signals with the human-annotated TOP training data. In data programming, we incorporate TOP training labels with other weakly supervised LFs. For the supervised approach, we train a Random Forest model using other weakly supervised LF outputs.

Additionally, lfs.py contains LFs computed from clinical trial metrics, such as status, number of adverse events, number of admenments, etc.

Reference

@article{gao2024automatically,
  title={Automatically Labeling $200 B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark},
  author={Gao, Chufan and Pradeepkumar, Jathurshan and Das, Trisha and Thati, Shivashankar and Sun, Jimeng},
  journal={arXiv preprint arXiv:2406.10292},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
baselines		baselines
clinical_trial_linkage		clinical_trial_linkage
labeling		labeling
llm_prediction_on_pubmed		llm_prediction_on_pubmed
news_headlines		news_headlines
stock_price		stock_price
tutorials		tutorials
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CTO.png		CTO.png
LICENSE		LICENSE
README.md		README.md
arrange_labels.py		arrange_labels.py
download_ctti.py		download_ctti.py
index.html		index.html
pipeline.sh		pipeline.sh
requirements.txt		requirements.txt
update_labels.sh		update_labels.sh
update_news.sh		update_news.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTO

llm_prediction_on_pubmed

clinical_trial_linkage

news_headlines

stock_price

labeling

Reference

Authors

About

Releases

Packages

Contributors 3

Languages

License

chufangao/CTOD

Folders and files

Latest commit

History

Repository files navigation

CTO

Reference

Authors

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages