Update README.md

Implementing GREMLIN's ProteinGym performances
niklases · Jul 1, 2024 · c628dd5 · c628dd5
1 parent aacafbb
commit c628dd5
Show file tree

Hide file tree

Showing 4 changed files with 14 additions and 0 deletions.
diff --git a/.github/imgs/multi_point_mut_performance.png b/.github/imgs/multi_point_mut_performance.png
diff --git a/.github/imgs/single_point_mut_performance.png b/.github/imgs/single_point_mut_performance.png
diff --git a/README.md b/README.md
@@ -461,6 +461,19 @@ Other well-performing zero-shot prediction methods with available source code ar
   
 This list is by no means complete, see ProteinGym [repository](https://github.com/OATML-Markslab/ProteinGym) and [website](https://proteingym.org/) for a more detailed overview of available methods and achieved performances (as well as for getting many benchmark data sets).
 
+The performance of the GREMLIN model used is shown in the following for predicting
+(I) single substitution effects
+<p align="center">
+    <img src=".github/imgs/single_point_mut_performance.png" alt="drawing" width="500"/>
+</p>
+
+(II) multi-substitution effects
+<p align="center">
+    <img src=".github/imgs/multi_point_mut_performance.png" alt="drawing" width="500"/>
+</p>
+
+for some ProteinGym datasets computed using the scripts located at [scripts/ProteinGym_runs](scripts/ProteinGym_runs).
+
 <a name="api-usage"></a>
 ## API Usage for Sequence Encoding
 For script-based encoding of sequences using PyPEF and the available AAindex-, OneHot- or DCA-based techniques, the classes and corresponding functions can be imported, i.e. `OneHotEncoding`, `AAIndexEncoding`, `GREMLIN` (DCA),  `PLMC` (DCA), and `DCAHybridModel`. In addition, implemented functions for CV-based tuning of regression models can be used to train and validate models, eventually deriving them to obtain performances on retained data for testing. An exemplary script and a Jupyter notebook for CV-based (low-*N*) tuning of models and using them for testing is provided at [scripts/Encoding_low_N/api_encoding_train_test.py](scripts/Encoding_low_N/api_encoding_train_test.py) and [scripts/Encoding_low_N/api_encoding_train_test.ipynb](scripts/Encoding_low_N/api_encoding_train_test.ipynb), respectively.

diff --git a/scripts/ProteinGym_runs/README.md b/scripts/ProteinGym_runs/README.md
@@ -1,3 +1,4 @@
 ## Benchmark runs on publicly available ProteinGym protein variant sequence-fitness datasets
 
 Data is taken (script-based download) from "DMS Assays"-->"Substitutions" and "Multiple Sequence Alignments"-->"DMS Assays" data from https://proteingym.org/download.
+First, run `download_proteingym_and_extract_data.py` to download and extract the ProteinGym data and subsequently run `run_performance_tests_proteingym_data.py` to get the predictions/the performance on those datasets.