Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks
✨In this repository, we have the datasets, models, and code used in our study!✨
First, please clone this repository and create a corresponding conda environment 🐍.
❗ NOTE: For the PyTorch installation, please install the version appropriate for your hardware: see here
conda create -n tplm python=3.10
conda activate tplm
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install scikit-learn==1.3.1
pip install -U "huggingface_hub[cli]"
We provide the environment.yml
but recommend running the commands above instead of installing from the yml file.
❗ NOTE: These experiments are performed with an NVIDIA A6000 GPU with CUDA 12.3. Please note exact reproducibility is not guaranteed across devices: see here
To reproduce the results from our study in sequential order, please follow the steps listed below.
download_data_embs.sh
run_tplm_benchmarks.sh
run_embedding_fusion_benchmarks.sh
run_ppi.sh
run_cath.sh
The data and embeddings are stored in HuggingFace and our download_data_embs.sh
uses huggingface-cli
to download the necessary files.
❗ NOTE: Before running download_data_embs.sh
, please add your HuggingFace token after the --token
flag. Once added, run download_data_embs.sh
.
huggingface-cli login --add-to-git-credential --token # Add your Huggingface token here
Dataset Details
The datasets used in this study are created by the following authors:
-
AAV, GB1, and Meltome: https://github.com/J-SNACKKB/FLIP
-
GFP and Stability: https://github.com/songlab-cal/tape
-
Location: https://github.com/HannesStark/protein-localization
-
PPI: https://github.com/daisybio/data-leakage-ppi-prediction
-
CATH/Homologous sequence recovery: https://www.cathdb.info/
Generating New Embeddings
We have provided sample scripts for generating embeddings for each protein language model (pLM) in the embedding_generation/
directory. To generate your own embeddings using the pLMs from this study, follow these steps:
-
Clone the Repository:
- Clone the repository of the respective pLM you intend to use. Please follow the specific setup and environment setup instructions detailed in each pLM's repository.
-
Generate Embeddings:
- Copy the embedding generation script we provided in
embedding_generation/
into the cloned pLM's directory. Each pLM has a different embedding generation script, so please make sure you use the appropriate one. - Execute these scripts within the pLM's environment and directory to generate new embeddings. Ensure that the outputs are directed to the appropriate location.
- Copy the embedding generation script we provided in
Run run_tplm_benchmarks.sh
to train models for benchmarking tpLMs against ESM2 3B on AAV, GB1, GFP, Location, Meltome, and Stability.
Run run_embedding_fusion_benchmarks.sh
to train models for benchmarking embedding fusion with tpLMs on AAV, GB1, GFP, Location, Meltome, and Stability.
4️⃣ Identifying optimal combinations and evaluating performance on protein-protein interaction prediction
Run run_ppi.sh
to use the greedy heuristic to identify a promising combination of embeddings, then train models with all possible combinations of embeddings to identify the true best combination.
Run run_cath.sh
to use the greedy heuristic to identify a promising combination of embeddings, then evaluate all possible combinations of embeddings to identify the true best combination.