Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

✨In this repository, we have the datasets, models, and code used in our study!✨

🛠️ Installation and environment set-up

First, please clone this repository and create a corresponding conda environment 🐍.
❗ NOTE: For the PyTorch installation, please install the version appropriate for your hardware: see here

conda create -n tplm python=3.10
conda activate tplm
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install scikit-learn==1.3.1
pip install -U "huggingface_hub[cli]"

We provide the environment.yml but recommend running the commands above instead of installing from the yml file.

💻 Reproducing Results

❗ NOTE: These experiments are performed with an NVIDIA A6000 GPU with CUDA 12.3. Please note exact reproducibility is not guaranteed across devices: see here

To reproduce the results from our study in sequential order, please follow the steps listed below.

download_data_embs.sh
run_tplm_benchmarks.sh
run_embedding_fusion_benchmarks.sh
run_ppi.sh
run_cath.sh

1️⃣ Downloading Data and Embeddings

The data and embeddings are stored in HuggingFace and our download_data_embs.sh uses huggingface-cli to download the necessary files.

❗ NOTE: Before running download_data_embs.sh, please add your HuggingFace token after the --token flag. Once added, run download_data_embs.sh.

huggingface-cli login --add-to-git-credential --token # Add your Huggingface token here

Dataset Details

The datasets used in this study are created by the following authors:

AAV, GB1, and Meltome: https://github.com/J-SNACKKB/FLIP
GFP and Stability: https://github.com/songlab-cal/tape
Location: https://github.com/HannesStark/protein-localization
PPI: https://github.com/daisybio/data-leakage-ppi-prediction
CATH/Homologous sequence recovery: https://www.cathdb.info/

Generating New Embeddings

We have provided sample scripts for generating embeddings for each protein language model (pLM) in the embedding_generation/ directory. To generate your own embeddings using the pLMs from this study, follow these steps:

Clone the Repository:
- Clone the repository of the respective pLM you intend to use. Please follow the specific setup and environment setup instructions detailed in each pLM's repository.
Generate Embeddings:
- Copy the embedding generation script we provided in embedding_generation/ into the cloned pLM's directory. Each pLM has a different embedding generation script, so please make sure you use the appropriate one.
- Execute these scripts within the pLM's environment and directory to generate new embeddings. Ensure that the outputs are directed to the appropriate location.

2️⃣ Benchmarking text-integrated protein language models against ESM2 3B

Run run_tplm_benchmarks.sh to train models for benchmarking tpLMs against ESM2 3B on AAV, GB1, GFP, Location, Meltome, and Stability.

3️⃣ Evaluating embedding fusion

Run run_embedding_fusion_benchmarks.sh to train models for benchmarking embedding fusion with tpLMs on AAV, GB1, GFP, Location, Meltome, and Stability.

4️⃣ Identifying optimal combinations and evaluating performance on protein-protein interaction prediction

Run run_ppi.sh to use the greedy heuristic to identify a promising combination of embeddings, then train models with all possible combinations of embeddings to identify the true best combination.

5️⃣ Identifying optimal combinations and evaluating performance on homologous sequence recovery

Run run_cath.sh to use the greedy heuristic to identify a promising combination of embeddings, then evaluate all possible combinations of embeddings to identify the true best combination.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

🛠️ Installation and environment set-up

💻 Reproducing Results

1️⃣ Downloading Data and Embeddings

2️⃣ Benchmarking text-integrated protein language models against ESM2 3B

3️⃣ Evaluating embedding fusion

4️⃣ Identifying optimal combinations and evaluating performance on protein-protein interaction prediction

5️⃣ Identifying optimal combinations and evaluating performance on homologous sequence recovery

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
embedding_generation		embedding_generation
results		results
src		src
.gitignore		.gitignore
README.md		README.md
download_data_embs.sh		download_data_embs.sh
environment.yml		environment.yml
run_cath.sh		run_cath.sh
run_embedding_fusion_benchmarks.sh		run_embedding_fusion_benchmarks.sh
run_ppi.sh		run_ppi.sh
run_tplm_benchmarks.sh		run_tplm_benchmarks.sh

Wang-lab-UCSD/Benchmarking-tpLMs

Folders and files

Latest commit

History

Repository files navigation

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

🛠️ Installation and environment set-up

💻 Reproducing Results

1️⃣ Downloading Data and Embeddings

2️⃣ Benchmarking text-integrated protein language models against ESM2 3B

3️⃣ Evaluating embedding fusion

4️⃣ Identifying optimal combinations and evaluating performance on protein-protein interaction prediction

5️⃣ Identifying optimal combinations and evaluating performance on homologous sequence recovery

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages