This directory contains the code and resources of the following paper:
"The Human-In-the-Loop Drug Design Framework with Equivariant Rectified Flow". Under review.
-
HIL-DD is a new Human-In-the-Loop Drug Design framework that enables human experts and AI to co-design molecules in 3D space conditioned on a protein pocket.
-
The backbone model is a surprisingly simple generative model called rectified flow (RF) based on ordinary differential equation (ODE) [1-2]. By combining equivariant graph neural networks (EGNNs) [3], we create an equivariant rectified flow model (ERFM).
-
Our HIL-DD framework is built upon ERFM. It takes molecules generated by ERFM conditioning on a protein pocket as input, and incorporates human experts' preferences to generate new molecules with human preferences.
-
Our experimental results are based on the CrossDocked dataset [4], which is available here.
-
If you have any issues using this software, please do not hesitate to contact Youming Zhao ([email protected]). We will try our best to assist you. Any feedback is highly appreciated.
We introduce HIL-DD to bridge the gap between human experts and AI. Please check out this 1.5-minute video for a real-time interaction which showcases how human experts inject their preference (better vina scores in this case) into HIL-DD and then HIL-DD learns the preference quickly.
Step 1. Construct an equivariant rectified flow model (ERFM) and train it on the CrossDocked dataset
In this step, we combine EGNNs and RF to create the ERFM. The ERFM is then trained on the CrossDocked dataset using protein pockets as a condition.
We utilize a well-trained ERFM to generate molecules conditioned on a protein pocket of interest.
Step 3. Propose promising molecules as positive samples and unpromising molecules as negative samples
According to a specific preference, say binding affinity, given the generated samples, we select molecules with high binding affinity (measured by Vina score in our work) as promising samples, and molecules with low binding affinity as unpromising samples.
With the human annotations obtained from the previous step, we finetune the well-trained ERFM using the HIL-DD algorithm.
For more detailed information, please refer to our paper.
- [configs] contains configuration files for training, sampling, and finetuning.
- [statistics] includes statistics about the training data, such as bond-length distribution, bond-angle distribution, and dihedral-angle distribution, etc.
- [datasets] contains files for preprocessing data.
- [models] contains the architectures of ERFM and the classifier.
- [toy-experiment] contains code for conducting a toy experiment to validate our algorithm.
- [utils] contains various helper functions.
Before you check out our ERFM and HIL-DD, you can try to run the toy experiment. You will see the beauty of preference learning in a few minutes.
HIL-DD requires a standard computer with at least one graphics processing unit (GPU). In terms of GPU, the code has beed tested on NVIDIA GeForce RTX 2060 and 4090. You may want to run our web-based graphical user interface (GUI). In this case, if you have more GPUs, you can indeed achieve better real-time performance. We usually use 3 GPUs to run our web-based GUI. More specifically, these 4 GPUs are used to propose samples, learn human preferences, and evaluate learning performance, respctively.
This package is supported for Windows and Linux. The package has been tested on the following systems:
- Windows 11 Pro
- Linux: Ubuntu 20.04.3
We recommend using Anaconda to create an environment for installing all dependencies. If you have Anaconda installed, please run the following command to install all packages. Normally, this can be done within a few minutes:
conda create --name HIL-DD --file configs/spec-file.txt
The main dependencies are as follows:
- Python=3.9+
- PyTorch==1.12.1+
- PyTorch Geometric==2.1.0
- NumPy==1.23.3
- OpenBabel==3.1.1
- RDKit==2022.03.5
- QVina==2.1.0
- SciPy==1.9.1
We trained/tested ERFM and HIL-DD using the same datasets as SBDD,
Pocket2Mol, and TargetDiff.
If you only want to sample molecules for the pockets of the CrossDocked test set,
we have stored those pockets in configs/test_data_list.pt
, so you can skip the following steps.
- Download the dataset archive
crossdocked_pocket10.tar.gz
and the split filesplit_by_name.pt
from this link and place them under../data/
. The folderdata
is supposed to be parallel with the folderHIL-DD
. - Extract the TAR archive using the command:
tar -xzvf crossdocked_pocket10.tar.gz
. - Download
test_protein.zip
from here and unzip it under./configs
. - Preprocess data by running the first command in the Code Usage section.
Please note that it may take approximately 2 hours to preprocess the data when training ERFM or HIL-DD for the first time. This step is required for training and preference learning.
We provide an interactive software to run HIL-DD. To run the software, please make sure that the data preprocessing is done. Then you need to run the backend code first and then the frontend code.
The backend code is available in this repository, namely the folder backend
. To run the backend code, you need to
activate an installed conda environment, as described in the software requirements section,
and then run the following command under the backend
folder.
python manage.py runserver
The frontend code is written by one co-author and it is available in here. It takes about 5 minutes to install and compile.
To calculate Vina docking scores, you need to download the full protein pocket files from
here and
place them in the configs
folder. Then, unzip the files.
If all your experiments are based on the CrossDocked dataset, please skip the following two steps.
If you want to compute the binding affinity for the generated molecules conditioned on your own pocket, it is recommended to create a separate environment to install MGLTools. This is because MGLTools and OpenBabel may not be compatible.
- Put the untailored PDB file under the
examples/
folder and run the following command:
python utils/prepare_receptor4.py -r examples/xxxx_full.pdb -o examples/xxxx_full.pdbqt
- Put the tailored PDB file under
examples/
.
To prepare proposals for HIL-DD, please follow the steps below:
-
Choose a protein pocket of interest either from the test set or from another dataset. If the protein pocket of interest is a member of the CrossDocked test set, refer to this .csv file for the corresponding PDB ID.
-
To sample molecules from the chosen protein pocket, use the following command:
python sampling.py --device cuda --config configs/sampling.yml --pocket_id 4 --num_samples 1000
Make sure to replace the --pocket_id
value with the index of the desired pocket.
Run this command 13 times to generate 13 result files. These result files will be used to select good and bad samples.
Note that if you don't mind the samples overlapping among the 12 preference injections, you can run the command only 3 times.
-
Move all the result files from
logs_sampling/datetime/sample-results/datetime.pt
to a new folder namedtmp/samples_pocket4
. -
Calculate the metrics for the samples using the following command:
python cal_metric_one_pocket.py tmp/samples_pocket4
- Select the good and bad molecules using the command:
python select_proposals.py tmp/samples_pocket4 tmp/samples_pocket4_proposals
In the select_proposals.py
file, you can specify the lower and upper thresholds for various preferences such as
Vina score, bond angle, bond length, benzene ring, large ring, and dihedral angle deviation. By default, the thresholds
for Vina score are -7 and -9. For more details, please refer to the last lines of the select_proposals.py
file.
The minimum number of positive and negative samples is determined by config.pref.num_positive_samples x config.pref.proposal_factor
and config.pref.num_negative_samples x config.pref.proposal_factor
, respectively.
To train ERFM, use the following command:
python train_ERFM.py --device cuda --config configs/config_ERFM.yml
To sample with a pretrained ERFM for all 100 pockets in the CrossDocked test set, run the following command:
python sampling.py --device cuda --config configs/sampling.yml
To sample with a protein pocket that is not in the CrossDocked test set, make sure to place your PDB file under the examples/
directory.
Then, execute the following command:
python sampling4pocket.py --device cuda --config configs/sampling.yml --pdb_path examples/2V3R.pdb
If you need to calculate the binding affinity, ensure that you have the complete protein pocket file in the examples/
directory.
Then, run the command as shown below:
python sampling4pocket.py --device cuda --config configs/sampling.yml --pdb_path examples/2V3R.pdb --receptor_path examples/2V3R_full.pdbqt
To finetune a pretrained ERFM, use the following command:
python HIL_DD_pref.py --device cuda --config configs/config_pref.yml
HIL-DD is licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
[1]. Liu, Xingchao, et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR (2023).
[2]. Liu, Qiang. "Rectified flow: A marginal preserving approach to optimal transport." arXiv preprint arXiv:2209.14577.
[3]. Satorras, Victor Garcia, et al. "E (n) equivariant graph neural networks." ICML (2021).
[4]. Francoeur, Paul G et al. "Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design." Journal of chemical information and modeling. (2020).