Skip to content

Commit

Permalink
Added zero-shot commands
Browse files Browse the repository at this point in the history
  • Loading branch information
niklases committed Jan 4, 2024
1 parent 2c00079 commit 61fa903
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 8 deletions.
91 changes: 89 additions & 2 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@
"--help"
]
},

{
"name": "Python: PyPEF MKLSTS",
"name": "Python: PyPEF MKLSTS ANEH",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
Expand All @@ -30,7 +31,93 @@
"--wt", "${workspaceFolder}/datasets/ANEH/Sequence_WT_ANEH.fasta",
"--input", "${workspaceFolder}/datasets/ANEH/37_ANEH_variants.csv"
]
}
},

{
"name": "Python: PyPEF MKLSTS avGFP",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
"program": "${workspaceFolder}/pypef/main.py",
"console": "integratedTerminal",
"justMyCode": true,
"cwd": "${workspaceFolder}/datasets/AVGFP/",
"args": [
"mklsts",
"--wt", "P42212_F64L.fasta",
"--input", "avGFP.csv"
]
},

{ // GREMLIN zero-shot steps:
// 1. $pypef param_inference --msa uref100_avgfp_jhmmer_119.a2m --opt_iter 100
// 2. $pypef hybrid -t TS.fasl --params GREMLIN
// or
// 2. $pypef hybrid -m GREMLIN -t TS.fasl --params GREMLIN
"name": "Python: PyPEF save GREMLIN avGFP model",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
"program": "${workspaceFolder}/pypef/main.py",
"console": "integratedTerminal",
"justMyCode": true,
"cwd": "${workspaceFolder}/datasets/AVGFP/",
"args": [
"param_inference",
"--msa", "uref100_avgfp_jhmmer_119.a2m",
"--opt_iter", "100"
]
},

{
"name": "Python: PyPEF hybrid/only-TS-zero-shot GREMLIN-DCA avGFP",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
"program": "${workspaceFolder}/pypef/main.py",
"console": "integratedTerminal",
"justMyCode": true,
"cwd": "${workspaceFolder}/datasets/AVGFP/",
"args": [
"hybrid",
//"-m", "GREMLIN", // optional, not required
"--ts", "TS.fasl",
"--params", "GREMLIN"
]
},

{ // PLMC zero-shot steps:
// 1. $pypef param_inference --params uref100_avgfp_jhmmer_119_plmc_42.6.params
// 2. $pypef hybrid -t TS.fasl --params PLMC
"name": "Python: PyPEF save PLMC avGFP model",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
"program": "${workspaceFolder}/pypef/main.py",
"console": "integratedTerminal",
"justMyCode": true,
"cwd": "${workspaceFolder}/datasets/AVGFP/",
"args": [
"param_inference",
"--params", "uref100_avgfp_jhmmer_119_plmc_42.6.params"
]
},

{
"name": "Python: PyPEF hybrid/only-TS-zero-shot PLMC-DCA avGFP",
"type": "python",
"request": "launch",
"env": {"PYTHONPATH": "${workspaceFolder}"},
"program": "${workspaceFolder}/pypef/main.py",
"console": "integratedTerminal",
"justMyCode": true,
"cwd": "${workspaceFolder}/datasets/AVGFP/",
"args": [
"hybrid",
"--ts", "TS.fasl",
"--params", "PLMC",
"--threads", "24"
]
}
]
}
17 changes: 12 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Preprint available at bioRxiv: https://doi.org/10.1101/2022.06.07.495081.
- [Model Hyperparameter Grids for Training](#model-hyperparameter-grids-for-training)
- [Setting Up the Scripts Yourself](#setting-up-the-scripts-yourself)
- [Preprocessing for DCA-based Sequence Encoding](#preprocessing-for-dca-based-sequence-encoding)
- [Unsupervised (DCA-based) zero-shot prediction](#unsupervised-dca-based-zero-shot-prediction)
- [Unsupervised/zero-shot prediction](#unsupervisedzero-shot-prediction)
- [API Usage for Sequence Encoding](#api-usage-for-sequence-encoding)
---

Expand Down Expand Up @@ -412,15 +412,22 @@ python3 ./pypef/main.py
```
<a name="zero-shot-prediction"></a>
## Unsupervised (DCA-based) zero-shot prediction
## Unsupervised/zero-shot prediction
Several developed methods allow unsupervised prediction of a proteins fitness based on its sequence (and/or structure).
These methods have the advantage that no initial knowledge about a proteins fitness is required for prediction while a correlation of the predicted score and a protein's natural fitness is assumed.
DCA itself was a statistical/unsupervised method based on MSA information that outperforms simpler MSA-based methods (such as (un)coupled raw MSA sequence frequencies or BLOSUM scores), see [scripts/GREMLIN_numba/using_gremlin_functionalities.ipynb](scripts/GREMLIN_numba/using_gremlin_functionalities.ipynb).
To make zero-shot predictions using PyPEF (plmc-DCA or GREMLIN-DCA) just do not provide a train set for model testing and use the DCA encoding method, e.g.
DCA itself is a statistical/unsupervised method based on MSA information that outperforms simpler MSA-based methods (such as (un)coupled raw MSA sequence frequencies or BLOSUM scores), e.g., see [scripts/GREMLIN_numba/using_gremlin_functionalities.ipynb](scripts/GREMLIN_numba/using_gremlin_functionalities.ipynb).
To make zero-shot predictions using PyPEF (plmc-DCA or GREMLIN-DCA) just do not provide a train set for model testing and apply the DCA encoding method for running, e.g., for the avGFP data,
```
TODO
pypef param_inference --msa uref100_avgfp_jhmmer_119.a2m
pypef hybrid -t AVGFP_TS.fasl --params GREMLIN
```
using the GREMLIN parameters, or,
```
pypef param_inference --params uref100_avgfp_jhmmer_119_plmc_42.6.params
pypef hybrid -t TS.fasl --params PLMC
```
using the plmc parameters.
Other well-performing zero-shot prediction methods with available source code are (list not complete, see ProteinGym [repository](https://github.com/OATML-Markslab/ProteinGym) and [website](https://proteingym.org/) for a more detailed overview of available methods and achieved performances):
- ESM-1v/ESM-2 (https://github.com/facebookresearch/esm)
Expand Down
5 changes: 4 additions & 1 deletion pypef/dca/hybrid_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,10 @@ def get_model_path(model: str):
elif isfile(f'Pickles/{model}'):
model_path = f'Pickles/{model}'
else:
raise SystemError("Did not find specified model file.")
raise SystemError("Did not find specified model file in current working directory "
" or /Pickles subdirectory. Make sure to train/save a model first "
"(e.g., for saving a GREMLIN model, type \"pypef param_inference --msa TARGET_MSA.a2m\" "
"or, for saving a plmc model, type \"pypef param_inference --params TARGET_PLMC.params\").")
return model_path
except TypeError:
raise SystemError("No provided model. "
Expand Down

0 comments on commit 61fa903

Please sign in to comment.