This is the supporting code for: Grisoni F., Moret M., Lingwood R., Schneider G., "Bidirectional Molecule Generation with Recurrent Neural Networks". Journal of Chemical Information and Modeling (2020). Available here.
You can use this repository for the generation of SMILES with bidirectional recurrent neural networks (RNNs). In addition to the methods' code, several pre-trained models for each approach are included.
The following methods are implemented:
- Bidirectional Molecule Design by Alternate Learning (BIMODAL), designed for SMILES generation – see Grisoni et al. 2020.
- Synchronous Forward Backward RNN (FB-RNN), based on Mou et al. 2016.
- Neural Autoregressive Distribution Estimator (NADE), re-adapted for SMILES generation from Berglund et al. 2015.
- Forward RNN, i.e., unidirectional RNN for SMILES generation.
This repository can be cloned with the following command:
git clone https://github.com/ETHmodlab/BIMODAL
To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can install the virtual environment:
cd path/to/repository/
conda env create -f brnn.yml
To activate the dedicated environment:
conda activate brnn
Your code should now be ready to use!
In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (Kd/I/IC50/EC50 <1μM), for 10 epochs.
To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:
from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)
Parameters:
- experiment_name (str): name of the experiment with pre-trained model you want to sample from (you can find pre-trained models in evaluation/)
- stor_dir (str): directory where the models are stored. The sampled SMILES will also be saved there (if write_csv=True)
- N (int): number of SMILES to sample
- T (float): sampling temperature
- fold (list of int): number of folds to use for sampling
- epoch (list of int): epoch(s) to use for sampling
- valid (bool): if set to True, only generate valid SMILES are accepted (increases the sampling time)
- novel (bool): if set to True, only generate novel SMILES (increases the sampling time)
- unique (bool): if set to True, only generate unique SMILES are provided (increases the sampling time)
- write_csv (bool): if set to True, the .csv file of the generated smiles will be exported in the specified directory.
Notes:
- For the provided pre-trained models, only fold=[1] and epoch=[9] are provided.
- The list of available models and their description are provided in evaluation/model_names.md
Alternatively, if you want to pre-train a model on your own data, you will need to execute three steps: (i) data processing (ii) training and (iii) evaluation. Please be aware that you will need the access to a GPU to pre-train your own model as this is a computationally intensive step.
Data can be processed by using preprocessing/main_preprocessor.py:
from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)
Parameters:
- filename_in (str): name of the file containing the SMILES strings (.csv or .tar.xz)
- model_type (str): name of the chosen generative method
- starting_point (str): starting point type ('fixed' or 'random')
- augmentation(int): augmentation folds [Default = 1]
Notes:
- In preprocessing/main_preprocessor.py you will find info regarding advanced options for pre-processing (e.g., stereochemistry, canonicalization, etc.)
- Please note that the pre-treated data will have to be stored in data/.
Training requires a parameter file (.ini) with a given set of parameters. You can find examples for all models in experiments/, and further details about the parameters below:
Section | Parameter | Description | Comments |
---|---|---|---|
Model | model | Type | ForwardRNN, FBRNN, BIMODAL, NADE |
hidden_units | Number of hidden units | Suggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL | |
generation | To be defined only for NADE (other models defined through preprocessing) | fixed, random | |
Data | data | Name of data file | Has to be located in data/ |
encoding_size | Number of different SMILES tokens | 55 | |
molecular_size | Length of string with padding | See preprocessing | |
missing_token | To add in the parameter file only for NADE | M | |
Training | epochs | Number of epochs | Suggested value: 10 |
learning_rate | Learning rate | Suggested value: 0.001 | |
n_folds | Folds in cross-validation | See below: More than 1 for cross_validation, 1 to use only one fold of the data for validation | |
batch_size | Batch size | Suggested value: 128 | |
Evaluation | samples | Number of generated SMILES after each epoch | |
temp | Sampling temperature | Suggested value: 0.7 | |
starting_token | Starting token for sampling | G for all models except NADE, which requires a sequence consisting of missing values (see publication) |
Note:
- Be aware that value such as the number of tokens or the missing token for NADE have to be defined as in the example above. We kept those as parameters such that you can easily change them if you wish to use this code for other applications.
Options for training:
- Cross-validation:
from trainer import Trainer
t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.cross_validation(stor_dir = '../evaluation/', restart = False)
- Single run: 1/n_folds of data used for validation
from trainer import Trainer
t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.single_run(stor_dir = '../evaluation/', restart = False)
Parameters:
- experiment_name : Name of parameter file (.ini)
- stor_dir: Directory where outputs can be found
- restart: If true, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)
You can do the evaluation of the outputs of your experiment with the evaluation/main_evaluator.py with the following possibilities:
from evaluation import Evaluator
stor_dir = '../evaluation/'
e = Evaluator(experiment_name = 'BIMODAL_fixed_512')
# Plot training and validation loss within one figure
e.eval_training_validation(stor_dir=stor_dir)
# Plot percentage of novel, valid and unique SMILES
e.eval_molecule(stor_dir=stor_dir)
Parameters:
- experiment_name: Name parameter file (.ini)
- stor_dir: Directory where outputs can be found
Note:
- the losses plot can be found, in that case, in '{experiment_name}/statistic/all_statistic.png'
- the novel, valid and unique SMILES plot can be found, in that case, in '../evaluation/{experiment_name}/molecules/novel_valid_unique_molecules.png'
Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.
You can start the sampling procedure with model/main_fine_tuner.py
Section | Parameter | Description | Comments |
---|---|---|---|
Model | model | Type | ForwardRNN, FBRNN, BIMODAL, NADE |
hidden_units | Number of hidden units | Suggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL | |
generation | Only NADE (other models defined through preprocessing) | fixed, random | |
Data | data | Name of data file | Has to be located in data/ |
encoding_size | Number of different SMILES tokens | 55 | |
molecular_size | Length of string with padding | See preprocessing | |
missing_token | To add in the parameter file only for NADE | M | |
Training | epochs | Number of epochs | Suggested value: 10 |
learning_rate | Learning rate | Suggested value: 0.001 | |
batch_size | Batch size | Suggested value: 128 | |
Evaluation | samples | Number of generated SMILES after each epoch | |
temp | Sampling temperature | Suggested value: 0.7 | |
starting_token | Starting token for sampling | G for all models except NADE, which requires a sequence consisting of missing values (see publication) | |
Fine-Tuning | start_model | Name of pre-trained model to be used for fine-tuning |
To fine-tune a model, you can run:
t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)
Parameters:
- experiment_name: Name parameter file (.ini)
- stor_dir: Directory where outputs can be found
- restart: If True, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)
Note:
- The batch size should not exceed the number of SMILES that you have in your fine-tuning file (taking into account the data augmentation).
- Robin Lingwood (https://github.com/robinlingwood)
- Francesca Grisoni (https://github.com/grisoniFr)
- Michael Moret (https://github.com/michael1788)
See also the list of contributors who participated in this project.
This code is licensed under a Creative Commons Attribution 4.0 International License.
If you use this code (or parts thereof), please cite it as:
@article{grisoni2020,
title={Bidirectional Molecule Generation with Recurrent Neural Networks},
author={Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
journal={Journal of Chemical Information and Modeling},
volume={60},
number={3},
pages={1175–1183},
year={2020},
doi = {10.1021/acs.jcim.9b00943},
url = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
publisher={ACS Publications}
}
A preprint (not peer-reviewed) version of the original manuscript is available as a pdf in this repository (preprint folder). This document is the unedited version of a Submitted Work that was subsequently accepted for publication in the Journal of Chemical Information and Modeling, copyright © American Chemical Society after peer review. To access the final edited and published work see https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943.