This repo contains the official implementation of the paper Generating pi-Functional Molecules Using STGG+ with Active Learning. Our method combines STGG+ with active-learning (STGG+AL).
Our method is based on STGG+, an improvement over the original STGG method by Sungsoo Ahn et al (2022).
Please refer to the STGG+ repository for more details on the code hyperparameters.
Figure: Molecule from the Conjugated-xTB dataset versus molecule generated by STGG+AL
You need to get a (free) neptune account and modify the YOUR_API_KEY and YOUR_PROJECT_KEY for neptune initialization in train_condgenerator.py and train_generator.py.
You need to change every mention of 'CHANGE_TO_YOUR_DIR' to a directory on your computer in src/train_condgenerator.py, src/train_generator.py, src/props/xtb/stda_xtb.py, src/utils/mol_to_coord.py, sTDA-xTB/comp_xtb4stascore.py, sTDA-xTB/mol_to_coord.py, src/experiments/xtb_finetune_active_learning_fosc_IR.sh, src/experiments/xtb_finetune_active_learning_fosc.sh, and , src/experiments/xtb_pretrain.sh.
You must install all the requirements below and build the vocabulary and valencies for each dataset.
## Make env from scratch (replace 'your_dir' with your directory)
module load python/3.10
module load cuda/11.8
python -m venv CHANGE_TO_YOUR_DIR/stgg_active_learning
source CHANGE_TO_YOUR_DIR/stgg_active_learning/bin/activate
pip install --upgrade pip setuptools wheel
pip install --upgrade --pre torch=2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install lightning neptune
pip install torch-geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu118.html
pip install cython molsets rdkit pomegranate==0.14.8 pyyaml scikit-learn pandas numpy networkx
pip install fcd_torch
git clone https://github.com/AlexiaJM/moses_fixed # fix two annoying bugs in MOSES, which is not updated anymore
cd moses_fixed
python setup.py install
# install xtb (based on https://hackmd.io/@o_wZpkUYSNeE_lvbb5NqGQ/ryrpH49M5)
cd CHANGE_TO_YOUR_DIR
wget https://github.com/grimme-lab/xtb/releases/download/v6.7.1/xtb-6.7.1-linux-x86_64.tar.xz
tar xvf xtb--linux-x86_64.tar.xz
export PATH=$PATH:your_dir/xtb-dist/bin
rm -r xtb4stda
git clone https://github.com/grimme-lab/xtb4stda.git
mkdir xtb4stda/exe
cd xtb4stda/exe
wget https://github.com/grimme-lab/stda/releases/download/v1.6.3/xtb4stda
wget https://github.com/grimme-lab/stda/releases/download/v1.6.3/stda_v1.6.3
chmod +x *
export XTB4STDAHOME=CHANGE_TO_YOUR_DIR/xtb4stda
export PATH=$PATH:$XTB4STDAHOME/exe
The dataset is located in resources/data/xtb. The data can be loaded using pandas in python as follows (make sure that you are in the resources/data/xtb folder):
import pandas as pd
import glob
df = pd.concat(map(pd.read_csv, sorted(glob.glob("random_generation_stda_xtb_*.csv"))))
or you can load it from HuggingFace:
from datasets import load_dataset
dataset = load_dataset('SamsungSAILMontreal/Conjugated-xTB_2M_molecules')
The tools to approximate the two properties (absorption wavelength and oscillator strength) can be found in sTDA-xTB/comp_xtb4stascore.py. Note make sure to change the src\props\properties.py Here is an example on how to use it in python (make sure that you are in the sTDA-xTB folder):
from comp_xtb4stascore import xtb4stascore
stdaxtb = xtb4stascore(constraint="none")
smiles = ['Fc1cc(F)c(F)c(-c2c[nH]c(-c3ccc4c(c3)C3(c5ccccc5-4)c4ccccc4-c4ccc(-c5cccc6n[nH]nc56)cc43)c2)c1F',
'N#Cc(ccc1c2ccc(N(c3ccccc3)c4ccccc4)cc2)c5c1nc(c(cccc6)c6c7c8ccc(n(c9c%10cccc9)c%11c%10[nH]c%12c%11cccc%12)c7)c8n5',
'O=C1C(=Cn2c3ccc(-c4ccc5c6c(ccc(-c7cc8cccc9c%10cccc%11cccc(c(c7)c89)c%11%10)c46)-c4nccnc4-5)cc3c3c4sc5ccccc5c4ccc32)C(=O)c2ccccc21',
'F[B-]1(F)n2c(cc(-c3cc4ccccn4c3)c2-c2cn3ccnc3c(-c3cc4cc5sccc5cc4s3)n2)C=C2C=CC=[N+]21',
'C#Cc1ccc2c3cccc4cc(-n5c6ccccc6c6c7ccccc7sc65)cc(c5c(-c6nc7c8ccccc8c8ccccc8c7[nH]6)ccc1c25)c43', ]
props = stdaxtb(smiles)
print(props)
To reproduce the examples from the paper, you can run examples from experiments/exps.sh.
If you find the code useful, please consider citing our work:
@misc{jolicoeurmartineau2025activelearningstgg,
title={Generating $\pi$-Functional Molecules Using STGG+ with Active Learning},
author={Alexia Jolicoeur-Martineau and Yan Zhang and Boris Knyazev and Aristide Baratin and Cheng-Hao Liu},
year={2025},
eprint={2502.14842},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
and the STGG+ paper:
@misc{jolicoeurmartineau2024anyproperty,
title={Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees},
author={Alexia Jolicoeur-Martineau and Aristide Baratin and Kisoo Kwon and Boris Knyazev and Yan Zhang},
year={2024},
eprint={2407.09357},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
and the original STGG paper:
@inproceedings{ahn2022spanning,
title={Spanning Tree-based Graph Generation for Molecules},
author={Sungsoo Ahn and Binghong Chen and Tianzhe Wang and Le Song},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=w60btE_8T2m}
}
Note that this code is based on the original STGG code, which can be found in the Supplementary Material section of https://openreview.net/forum?id=w60btE_8T2m.