Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Implementing GREMLIN's ProteinGym performances
  • Loading branch information
niklases committed Jul 1, 2024
1 parent aacafbb commit c628dd5
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 0 deletions.
Binary file added .github/imgs/multi_point_mut_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/imgs/single_point_mut_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,19 @@ Other well-performing zero-shot prediction methods with available source code ar
This list is by no means complete, see ProteinGym [repository](https://github.com/OATML-Markslab/ProteinGym) and [website](https://proteingym.org/) for a more detailed overview of available methods and achieved performances (as well as for getting many benchmark data sets).
The performance of the GREMLIN model used is shown in the following for predicting
(I) single substitution effects
<p align="center">
<img src=".github/imgs/single_point_mut_performance.png" alt="drawing" width="500"/>
</p>
(II) multi-substitution effects
<p align="center">
<img src=".github/imgs/multi_point_mut_performance.png" alt="drawing" width="500"/>
</p>
for some ProteinGym datasets computed using the scripts located at [scripts/ProteinGym_runs](scripts/ProteinGym_runs).
<a name="api-usage"></a>
## API Usage for Sequence Encoding
For script-based encoding of sequences using PyPEF and the available AAindex-, OneHot- or DCA-based techniques, the classes and corresponding functions can be imported, i.e. `OneHotEncoding`, `AAIndexEncoding`, `GREMLIN` (DCA), `PLMC` (DCA), and `DCAHybridModel`. In addition, implemented functions for CV-based tuning of regression models can be used to train and validate models, eventually deriving them to obtain performances on retained data for testing. An exemplary script and a Jupyter notebook for CV-based (low-*N*) tuning of models and using them for testing is provided at [scripts/Encoding_low_N/api_encoding_train_test.py](scripts/Encoding_low_N/api_encoding_train_test.py) and [scripts/Encoding_low_N/api_encoding_train_test.ipynb](scripts/Encoding_low_N/api_encoding_train_test.ipynb), respectively.
Expand Down
1 change: 1 addition & 0 deletions scripts/ProteinGym_runs/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## Benchmark runs on publicly available ProteinGym protein variant sequence-fitness datasets

Data is taken (script-based download) from "DMS Assays"-->"Substitutions" and "Multiple Sequence Alignments"-->"DMS Assays" data from https://proteingym.org/download.
First, run `download_proteingym_and_extract_data.py` to download and extract the ProteinGym data and subsequently run `run_performance_tests_proteingym_data.py` to get the predictions/the performance on those datasets.

0 comments on commit c628dd5

Please sign in to comment.