Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure
This is the official code release for the paper Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure. This work was accepted as a spotlight paper at the NeurIPS 2023 Generative AI and Biology Workshop.
All code in this repository was run on python 3.10. In my experience, installing with python 3.11 caused dependency conflicts. I recommend using conda or mamba to create a python 3.10 environment. You should install the following dependencies in this order. You can copy and paste this code block into a shell script for easy use.
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
mamba install pytorch-cluster pytorch-scatter -c pyg -y
mamba install -c dglteam/label/cu118 dgl -y
mamba install -c conda-forge rdkit -y
mamba install -c conda-forge openbabel -y
mamba install -c conda-forge pandas -y
mamba install -c conda-forge biopython -y
Specifications of the model and the data that the model is trained on are all packaged into one config file. The config files are just yaml files. Once you setup your config file, you pass it as input to the data processing scripts in addition to the training scripts. An example config file is provided at configs/dev_config.yml
. This example config file also has some helpful comments in it describing what the different parameters mean.
Actual config files used to train models presented in the paper are available in the trained_models/
directory.
Note, you don't have to reprocess the dataset for every model you train, as long as the models you are training contain the same parameters under the dataset
section of the config file.
All of the steps of training, sampling, and evaluation are run through various scripts in this repo. In this readme, I describe in words the inputs that are provided to each script. Each of these scripts implements command line arguments via the argparse library. You can always run python <script_name>.py --help
to see a list of command line arguments that the script accepts. You can also just open the script and inspect the parse_args()
function to see what command line arguments are accepted.
For the paper all models are trained on the bindingmoad dataset. We use the same splits of bindingmoad as in DiffSBDD. Clone the DiffSBDD repo, and follow the directions for downloadng the bindingmoad dataset into the data/
folder of the DiffSBDD Repo. You don't need to run their scripts for processing the data, you just need to download it.
You will notice that there is some code in this repository for processing/working with the crossdocked dataset. The results in the paper only focus on models trained on the bindingmoad dataset. Future releases of this repository / versions of this paper may include results on the crossdocked dataset. However, for now, the crossdocked dataset is not used for training or evaluation, and the processing scripts are not guaranteed to work.
There are two scripts in this repository for processing the crossdocked and bindingmoad datasets so they can be ingested by our dataset class at training/evaluation time. These are process_crossdocked.py
and process_bindingmoad.py
. Unfortunately, process_crossdocked.py
might be broken at the moment. The required inputs for process_bindingmoad.py
are a config file and a valid filepath to the data/
directory of the DiffSBDD repo.
example command for processing bindingmoad dataset:
python process_bindingmoad.py --config_file=configs/dev_config.yml --data_dir=/home/ian/projects/mol_diffusion/DiffSBDD/data/
Running either process_crossdocked.py
or process_bindingmoad.py
will write a processed version of the dataset to whatever directory was specified by the --output_dir
command-line argument.
Models are trained using the train.py
script. You pretty much only have to provide a config file to train.py
using the --config
option. You can also override certain parameters in the config file via command-line arguments to the training script. However, this functionality was only implemented for weights and biases hyperparameters sweeps and as such, not all model hyperparameters defined in the config files are exposed via command-line arguments.
Within a config file, the parameter results_dir
in the experiment
section specifies where the results of training will be written.
Each time train.py
is run, a new directory will be created within results_dir
with a timestamped name. Within this directory will be the model's config file, pickled versions of the model, and a pickle file containing logs of the training and validation metrics.
This repository contains the all the trained models used to produce the results in the paper. These models are located in the trained_models/
directory which contains a subdirectory for each model.
The script byop.py
(bring your own protein) accepts as input: the location of a trained model, a .pdb
or .mmcif
file defining a protein structure, and a .sdf
file specifying a "reference ligand" which is located inside the protein binding pocket. This reference ligand need not actually be a ligand. The script will simply use the coordinates of the reference ligand to define the binding pocket and it will start sampling molecules by placing atoms at the center of mass of the reference ligand.
byop.py
will do a force-field minimization of generated ligands inside the binding pocket if you specify the --pocket_minimization
flag.
Example command for byop sampling:
python byop.py receptor.pdb reference_ligand.sdf --model_dir=trained_models/gvp_40kp
Use the script sample.py
with the --visualize
flag. Ideally, the visualize feature should be incorporated into test.py
because these scripts are doing almost the exact same thing, iirc. Note that this only allows you to visualize the sampling process for proteins in the original dataset. I haven't yet implemented sampling visualization with byop.py
.
The script test.py
will sample ligands for ligand-receptor complexes contained in the test/validation sets. You just need to point it to an output directory produced by train.py
with the --model_dir
flag. The test script will then load the last-saved model state. You can also instead specify a specific model checkpoint to use via the --model_file
flag.
Note that test.py
can only sample complexes from the process dataset described in the config file. If you want to sample ligands for a user-provided protein, see the byop section.
test.py
will write the sampled molecules to the directory specified by the --output_dir
flag. This directory will contain a subdirectory sampled_mols
which itself contains a directory for each pocket that is sampled. The structure of these subdirectories is very specific and this particular structure is required for the downstream evaluation scripts to work.
There are two options for doing ff-minimization of generated molecules inside the binding pocket:
- You can do it while sampling molecules by passing the
--pocket_minimization
flag totest.py
. This will causetest.py
to write the minimized molecules into the same directory as the raw sampled molecules. - You can do it after sampling molecules by running the script
analysis/pocket_minimization.py
. This script accepts as input a directory of sampled molecules produced bytest.py
.
To elbaorate on the second option here: the second method is advantageous if you are minimizing many molecules across many pockets because you can parallelize the minimization across a cluster. The force-field minimzation of a single binding pocket is performed by the script analysis/pocket_minimization.py
. You can run the script gen_pocket_min_cmds.py
and pass it a sampled_mols
directory produced by test.py
, and then it will generate a shell script where each line of the shell script calls pocket_minimization.py
on a different pocket. You can then run this shell script to run the force-field minimization on all of the sampled pockets. You could also use this shell script to construct an array job to parallelize minimization across a cluster. For a given pocket directory produced by test.py
, the minimization script analysis/pocket_minimization.py
will write the minimized molecules into that same directory as well as a csv containing the RMSDs of molecules before and after force-field minimization.
We actually use gnina to do molecule scoring. gnina returns Autodock Vina scores on molecules in addition to scores from its own scoring function. So to use our scoring pipeline, you will need to install gnina.
The script gen_docking_cmds.py
accepts as input a directory of sampled molecules produced by test.py
and produces as output a text file containing list of gnina commands which then can be run as a shell script or array jobs across a cluster. By default, the commands generated will completely redock the reference ligand. This is not what we do for the paper. For the paper, we use the --minimize
flag to generate commands that will simply minimize the ligand pose with respect to the Vina scoring function.
We consider complete re-docking to be an unreliable/inaccurate indicater of generative model performance, as it can hide/obscure failures of the model to respect the geometry of the binding pocket and protein-ligand interactions. This idea is supported by the results of the PoseChecker paper.
Additional metrics of ligand quality (validity, QED, SA, per-pocket diversity) can be computed using the compute_metrics.py
script. This script accepts as input a directory of sampled molecules produced by test.py
and will write a pickle file into the parent directory of the sampled_mols
directory. This pickle file contains a dictionary with all of these metrics.