BARTSmiles is a chemical language model based on BART, trained on 1.7 billion SMILES strings from ZINC20 dataset.
BARTSmiles can be fine-tuned on chemical property prediction and generative tasks, including chemical reaction prediction and retrosynthesis. BARTSmiles allows to get multiple state-of-the-art results.
You can use huggingface model from here.
Clone BARTSmiles repo in the root directory:
git clone https://github.com/YerevaNN/BARTSmiles.git
Setup a conda environment:
conda env create --file=./BARTSmiles/environment.yml
conda activate bartsmiles
Clone and install Fairseq in the root directory:
cd ./
git clone https://github.com/facebookresearch/fairseq.git
cd ./fairseq
pip install --editable ./
You need to add add_if_not_exist=False
in this row:
tokens = self.task.source_dictionary.encode_line(bpe_sentence, append_eos=False, add_if_not_exist=False)
of this file:
./fairseq/fairseq/models/bart/hub_interface.py
NOTE! If you don't add this fairseq will be added new tokens in vocab of every unknown token instead of <unk>.
Download BARTSmiles pre-trained model and the vocabulary:
cd ./
mkdir -p ./chemical/tokenizer
cd ./chemical/tokenizer
wget http://public.storage.yerevann.com/BARTSmiles/chem.model
wget http://public.storage.yerevann.com/BARTSmiles/chem.vocab.fs
cd ./
mkdir ./chemical/checkpoints/evaluation_data
cd ./chemical/checkpoints
wget http://public.storage.yerevann.com/BARTSmiles/pretrained.pt
mv ./BARTSmiles/data_name ./chemical/checkpoints/evaluation_data
cd ./BARTSmiles/
dict.txt is the vocab file without special tokens. You need to provide the structure of data_name directories.
from fairseq.models.bart import BARTModel
model = f"./chemical/checkpoints/evaluation_data/data_name/processed/input0"
bart = BARTModel.from_pretrained(model, checkpoint_file = f'./chemical/checkpoints/pretrained.pt',
bpe="sentencepiece",
sentencepiece_model=f"./chemical/tokenizer/chem.model")
Extract the last layer's features:
last_layer_features = bart.extract_features(bart.encode(smiles))
or you can use this file for batches:
python ./BARTSmiles/utils/extract_features.py --path [ the path where your BARTSmiles folder is located] --dataset-name esol --batch-size 32 --output-path [ where you want to locate the outputs]
- Download and preprocess MoleculeNet datasets: Use the following command from the BARTSmiles folder:
python preprocess/process_datasets.py --dataset-name esol --is-MoleculeNet True --root [the path where your BARTSmiles folder is located]
This will create folders in ./chemical/checkpoints/evaluation_data/esol
directory:
esol
│
├───esol
│ train_esol.csv
│ valid_esol.csv
│ test_esol.csv
│
│
├───processed
│ │
│ ├───input0
│ │ dict.txt
│ │ preprocess.log
│ │ test.bin
│ │ train.bin
│ │ valid.bin
│ │ test.idx
│ │ valid.idx
│ │ train.idx
│ │
│ └───label
│ dict.txt
│ preprocess.log
│ test.bin
│ valid.bin
│ train.bin
│ test.idx
│ valid.idx
│ train.idx
│ test.label
│ valid.label
│ train.label
│
│
├───raw
| test.input
| test.target
| valid.input
| valid.target
| train.input
| train.target
|
|
|
└───tokenized
test.input
valid.input
train.input
- Generate the grid of training hyperparameters by running the script
./BARTSmiles/fine-tuning/generate_grid_bartsmiles.py
. This will write grid search parameters in./BARTSmiles/fine-tuning/grid_search.csv
file.
Command for the regression tasks:
python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name esol --single-task True --dataset-size 1128 --is-Regression True
Command for the classification tasks having a single subtask:
python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name BBBP --single-task True --dataset-size 2039
Command for a specific subtask of a multilabel classification task:
python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name Tox21 --subtasks 12 --single-task False --dataset-size 7831
All required parameters for training are in grid_search.csv and you can start the training.
-
Login to your wandb Befor start the training you have to login in wandb for tracking the trainings. For login you can follow: https://docs.wandb.ai/ref/cli/wandb-login
-
Train the models using the following command:
mkdir ./chemical/log
python fine-tuning/train_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --disk [the path where you want to store your checkpoints] >> ./chemical/log/esol.log
This will produce a checkpoint in disk/clintox_1_bs_16_dropout_0.1_lr_5e-6_totalNum_739_warmup_118/
folder.
- You will write wandb urls in
./BARTSmiles/evaluation/wandb_url.csv
file example:
url
gayanec/Fine_Tune_clintox_0/6p76cyzr
- Perform Stochastic Weight Averaging and evaluate from
./BARTSmiles/evaluation
using the following command.
python evaluation/evaluate_swa_bartsmiles.py --root [the path where your BARTSmiles folder is located] --disk [the path will your checkpoints be located] --dataset-type [dataset type: train, valid or test]
This will produce a log file with output and averaged checkpoints respectively in ./chemical/log/
and disk/clintox_1_bs_16_dropout_0.1_lr_5e-6_totalNum_739_warmup_118/
folders.
If you want to fine-tune another dataset you have to add deatails in datasets.json files and your preprocessing code in ./preprocess/process_datasets.py
file in line 103. The key must not contain the '_' symbol unless the following symbols are numbers.
@article{chilingaryan2022bartsmiles, title={Bartsmiles: Generative masked language models for molecular representations}, author={Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen}, journal={arXiv preprint arXiv:2211.16349}, year={2022} }