Skip to content

Code space for 'Evaluation of large language models for discovery of gene set function'

License

Notifications You must be signed in to change notification settings

idekerlab/llm_evaluation_for_gene_set_interpretation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation of large language models for discovery of gene set function

Description

Code associated with paper "Evaluation of large language models for discovery of gene set function"

Dependencies

Set up an environment

conda create -n llm_eval python=3.11.5

Set up an environment variable to store GPT-4 API key

conda activate llm_eval
conda env config vars set OPENAI_API_KEY="<your api key>" 
conda deactivate  # reactivate 

conda activate llm_eval
echo $OPENAI_API_KEY # make sure the key setup 

%python
import os
import openai
 
openai.api_key = os.environ["OPENAI_API_KEY"]

From OpenAI website for the best practice for API key safety

Python requirements:

The code was developed using Python 3.11.5.

git clone [email protected]:idekerlab/llm_evaluation_for_gene_set_interpretation.git

cd llm_evaluation_for_gene_set_interpretation

pip install -r requirements.txt

DDOT is required for downloading GO and can be installed in one of two ways:

To install DDOT by downloading the zip file of the source tree:

wget https://github.com/idekerlab/ddot/archive/refs/heads/python3.zip
unzip python3.zip
cd ddot-python3
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

To install DDOT by cloning the repo:

git clone --branch python3 https://github.com/idekerlab/ddot.git
cd ddot
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

Documentation

The notebooks are numbered according to the evaluation steps

  1. Data Preperation (this step can be omitted for testing purposes)

    The data is already in the data directory (refer to the README in this directory for detail information about the data)

    If need to download GO, follow the code below:

    ## download and parse GO_BP terms
    outdir = 'data/GO_BP/'
    namespace = 'biological_process'
    python process_the_gene_ontology.py $outdir --namespace $namespace 
    

    and the notebook for parsing GO terms

    The addition of contamination to the gene set is filed in this notebook

    If need to download Omics data, run notebook. The notebook processes the omics data and saves them into a tab delimited text file.

  2. Query GPT-4 for names and supporting analysis and run functional enrichment

    GO gene set GPT-4 analysis is stored in Run_LLM_analysis

    GO gene set analysis with different models

    Batch run 1000 GO terms using slurm job with the parameter file

    omic gene set GPT-4 analysis and omics gene set gProfiler

    ## example code to process from 1st to 5th terms in the table
    # run in the command line  
    
    input_file='data/GO_term_analysis/toy_example.csv' #input table path
    config='./jsonFiles/GOLLMrun_config.json' #configuration file 
    set_index='GO' #index of the table
    gene_column='Genes' #name of the gene list column
    start=0
    end=5   
    out_file='data/GO_term_analysis/LLM_processed_toy_example_gpt_4' #output path prefix
    
    source activate llm_eval
    # Run the Python script for the given range
    python query_llm_for_analysis.py --config $config \
                --initialize \
                --input $input_file \
                --input_sep  ','\
                --set_index $set_index \
                --gene_column $gene_column\
                --gene_sep ' ' \
                --start $start \
                --end $end \
                --output_file $out_file
    
  3. Semantic Similarity evaluation of names

    GO gene set analysis evalution

    # get the ranking of similarities from the GO gene set analysis
    
    python rank_GOterm_LLM_sim_rand.py --input_file ./data/GO_term_analysis/LLM_processed_toy_example_w_contamination_gpt_4.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 3 --output_file ./data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv --background_file data/GO_term_analysis/all_go_sim_scores_toy.txt
    
  4. Further evaluation of the performance: model comparison evaluation, gene set functional enrichment, and gene set similarity comparison Evaluation Task 1 related

    Model Comparison

    Analysis related to Fig. 2A Compare the semantic similarities between models

    Analysis related to Fig. 3 Run GO gene set functional enrichment for control

    Compare the confidence score between real, contaminated, and random gene sets

    Check broader concepts of the LLM names

    Analysis for Fig. 2d

    Analysis for whether the best matching GO term is a broader concept as the queried term

    Evaluation Task 2 related Count genes supporting LLM name, then calculate LLM name Jaccard Index

    Analysis related to Fig.4

    Omics data naming evaluation

    Evaluate LLM name matching with any significantly enriched GO term name, use this notebook

  5. Development and assessment of the citation module

  6. Quantification of citation module check citation module

  7. Visualization of results

    extended data fig.1 + Fig.2 + Fig.3

    extract sub hierarchy (Fig.2e)

    Omics figures (Fig 4, Extended Data Fig.5)

License

MIT License

Citing

Hu M, Alkhairy S, Lee I, Pillich RT, Bachelder R, Ideker T, Pratt D. Evaluation of large language models for discovery of gene set function. Preprint at https://doi.org/10.48550/arXiv.2309.04019 (2023)

About

Code space for 'Evaluation of large language models for discovery of gene set function'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published