Welcome to our repository dedicated to the evaluation of DxGPT across various AI models with many-shot learning, across different categories of rare diseases.
This project is, in a way, a continuation of a previous paper (https://www.medrxiv.org/content/10.1101/2024.05.08.24307062v1) evaluating the straight effectiveness of different LLMs in diagnosing rare diseases.
This project looks to investigate two further main questions. 1. Which types of rare diseases do LLM models excel at diagnosing? Do LLMs perform better given diseases that affect certain biological systems? 2. Given the success that many-shot learning has had at improving LLM results in certain tasks, how much, if any, improvement is seen when applying many-shot learning to LLM diagnosis?
This repository mainly contains the code created to carry out the tests, as well as the data gathered in the tests. The data is quite extensive, so we recommend becoming familiar with the naming conventions to navigate through it more easily. We may also add notebooks and plots as the analysis of the data continues.
The naming convention of the files in this repository is systematic and provides quick insights into the contents and purpose of each file. Understanding the naming structure will help you navigate and utilize the data effectively.
Each file name is composed of four main parts:
-
Evaluation data prefix: All files related to model evaluation scores begin with
scores_
. This prefix is a clear indicator that the file contains data from the evaluation process.diagnoses_
prefix is used for the files that contain the actual diagnoses from each test run, same naming convention as the scores files. -
Dataset: The dataset name is included to provide context. Example datasets include:
RAMEDIS
is the RAMEDIS dataset from RareBenchPUMCH_ADM
is the PUMCH dataset from RareBenchMME
is the MME dataset from RareBenchHHS
is the HHS dataset from RareBenchaggregated
is all of the datasets from Rarebench aggregated into one
-
Model identifier: Following the prefix, the name includes an identifier for the AI model used during the evaluation. Some of the possible model identifiers are:
gpt4o
: Data evaluated using the GPT-4o Model_llama3_70b
: Data evaluated using the LLaMA 3 70b model._c3opus
: Data evaluated using the Claude 3 Opus model.
-
Shot: Identifies whether it was a many-shot or no-shot test
-
Categorized Identifies if categories were taken into account when generating examples or if examples were general (cat/nocat)
-
Dataset Examples Included: Identifies whether or not examples were taken from the dataset being tested (i/ni). All of the tests so far have been run ni, but the option is there in case further testing is done.
Example: diagnoses_PUMCH_ADM_gpt4omini_manyshot_cat_ni.csv
is the csv file containing diagnoses for the PUMCH_ADM dataset done by GPT-4omini using many-shot examples, taking into account categories, and not including the dataset cases as examples
This structured approach to file naming ensures that each file is easily identifiable and that its contents are self-explanatory based on the name alone.
- Strict Accuracy (P1): Top suggestion matches the ground truth.
- Top-5 Accuracy (P1+P5): Ground truth appears within the top 5 suggestions.