Protein Models (prtm) is an inference-only library for deep learning protein models.
This library started out as a learning project to catch up on the deep learning models being
used in protein science. After cloning a few repos it became clear that a nascent ecosystem was
forming and that there was a need for a common interface to accelerate the creation of new workflows.
The goal of prtm
is to provide an (hopefully) enjoyable and interactive API for running, comparing, and
chaining together protein DL models. Currently covered use cases include:
- Folding
- Inverse folding
- Structure design
- Sequence language modeling
- Ligand docking
With many more to come!
A very common workflow is to design a protein structure, apply inverse folding to generate plausible sequences, and then fold those sequences to see if they match the designed structure.
In prtm
, we accomplish this with a few lines of code:
from prtm import models
from prtm import visual
# Define models for structure design, inverse folding and folding
designer = models.RFDiffusionForStructureDesign(model_name="auto")
inverse_folder = models.ProteinMPNNForInverseFolding(model_name="ca_only_model-20")
folder = models.OmegaFoldForFolding()
# Tell RFDiffusion to create a structure with exactly 128 residues
designed_structure, _ = designer(
models.rfdiffusion_config.UnconditionalSamplerConfig(
contigmap_params=models.rfdiffusion_config.ContigMap(contigs=["128-128"]),
)
)
# Design a sequence and fold it!
designed_sequence, _ = inverse_folder(designed_structure)
predicted_designed_structure, _ = folder(designed_sequence)
# Visualize the designed structure and the predicted structure overlaid in a notebook
visual.view_superimposed_structures(designed_structure, predicted_designed_structure)
# Convert to PBD
pdb_str = predicted_designed_structure.to_pdb()
# Try docking a ligand (methotrexate) to the designed structure
ligand = "CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O"
docker = models.DiffDockForLigandDocking()
poses, aux_output = docker(predicted_designed_structure, ligand)
# Visualize the predicted ligand poses
visual.view_structure_with_ligand(predicted_designed_structure, poses)
At this early stage, prtm
has only been tested on a Linux system with a CUDA-enabled GPU.
There are no guarantees that it will work on other systems.
Before getting started it's assumed that you've already installed conda
or mamba
(preferred),
then clone this repo and create a prtm
environment:
git clone https://github.com/conradry/prtm.git
cd prtm
mamba env create -f environment.yaml
mamba activate prtm
pip install -e .
To make prtm
more accessible it was decided to remove custom CUDA kernels from all models that
previously used them, so that's it for most cases!
Optionally, Pyrosetta
is a soft-dependency of prtm
and is only required for the
protein_seq_des
model. A license is required to use Pyrosetta
and can
be obtained for free for academic use. For installation instructions, see
here.
Note: Most, but not all models, allow commerial use. Please check the license of each model.
AlphaFold is written and JAX but all other models are written in PyTorch, therefore we chose not
to directly integrate the AlphaFold inference code into this repo. Both OpenFold
and Uni-Fold
allow for the conversion of the AlphaFold JAX weights into PyTorch. The Uni-Fold
implementation
is designed to work with MMSeqs2
and has support for multimers which is why we adopted it. Eventually,
we may decide to subsume the OpenFold
models under Uni-Fold
.
Links for papers can be found on the Github repos for each model.
A real docs page is a work in progress, but to get started the provided notebooks should be enough.
In addition to minimal usage notebooks for each implemented model, there are also more general notebooks
that cover common use cases and some features of the prtm
API. A good order to try is:
For more complex design algorithms like RFDiffusion
and ProteinGenerator
, there are detailed
example notebooks to look at:
The currently implemented models only scratch the surface of what's available. There's a sketchy model tracking Google sheet for papers and code repos that are being considered for implementation. If you'd like to contribute or suggest priorities, please open an issue or PR and we can discuss!
There's, of course, also a lot of technical debt to payoff that accumulated from duct taping together code from many different sources. Docstrings, API improvements, bug fixes, and better tests are very welcome!
This project is an achievement of copy-paste engineering 😉. It would not have been possible without the hard work of the authors of the models that are implemented here. Please cite their work if you use their model!