STGG+AL: Generating pi-Functional Molecules Using STGG+ with Active Learning

This repo contains the official implementation of the paper Generating pi-Functional Molecules Using STGG+ with Active Learning. Our method combines STGG+ with active-learning (STGG+AL).

Our method is based on STGG+, an improvement over the original STGG method by Sungsoo Ahn et al (2022).

Please refer to the STGG+ repository for more details on the code hyperparameters.

Figure: Molecule from the Conjugated-xTB dataset versus molecule generated by STGG+AL

Prerequisites

1. setting up neptune (only for logging training runs)

You need to get a (free) neptune account and modify the YOUR_API_KEY and YOUR_PROJECT_KEY for neptune initialization in train_condgenerator.py and train_generator.py.

2. Setting up the directories

You need to change every mention of 'CHANGE_TO_YOUR_DIR' to a directory on your computer in src/train_condgenerator.py, src/train_generator.py, src/props/xtb/stda_xtb.py, src/utils/mol_to_coord.py, sTDA-xTB/comp_xtb4stascore.py, sTDA-xTB/mol_to_coord.py, src/experiments/xtb_finetune_active_learning_fosc_IR.sh, src/experiments/xtb_finetune_active_learning_fosc.sh, and , src/experiments/xtb_pretrain.sh.

3. Setting up the environment

You must install all the requirements below and build the vocabulary and valencies for each dataset.

## Make env from scratch (replace 'your_dir' with your directory)
module load python/3.10 
module load cuda/11.8
python -m venv CHANGE_TO_YOUR_DIR/stgg_active_learning
source CHANGE_TO_YOUR_DIR/stgg_active_learning/bin/activate
pip install --upgrade pip setuptools wheel
pip install --upgrade --pre torch=2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install lightning neptune
pip install torch-geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu118.html
pip install cython molsets rdkit pomegranate==0.14.8 pyyaml scikit-learn pandas numpy networkx
pip install fcd_torch
git clone https://github.com/AlexiaJM/moses_fixed # fix two annoying bugs in MOSES, which is not updated anymore
cd moses_fixed
python setup.py install

# install xtb (based on https://hackmd.io/@o_wZpkUYSNeE_lvbb5NqGQ/ryrpH49M5)
cd CHANGE_TO_YOUR_DIR
wget https://github.com/grimme-lab/xtb/releases/download/v6.7.1/xtb-6.7.1-linux-x86_64.tar.xz
tar xvf xtb--linux-x86_64.tar.xz
export PATH=$PATH:your_dir/xtb-dist/bin
rm -r xtb4stda
git clone https://github.com/grimme-lab/xtb4stda.git
mkdir xtb4stda/exe
cd xtb4stda/exe
wget https://github.com/grimme-lab/stda/releases/download/v1.6.3/xtb4stda
wget https://github.com/grimme-lab/stda/releases/download/v1.6.3/stda_v1.6.3
chmod +x *
export XTB4STDAHOME=CHANGE_TO_YOUR_DIR/xtb4stda
export PATH=$PATH:$XTB4STDAHOME/exe

Conjugated-xTB dataset

The dataset is located in resources/data/xtb. The data can be loaded using pandas in python as follows (make sure that you are in the resources/data/xtb folder):

import pandas as pd
import glob
df = pd.concat(map(pd.read_csv, sorted(glob.glob("random_generation_stda_xtb_*.csv"))))

or you can load it from HuggingFace:

from datasets import load_dataset
dataset = load_dataset('SamsungSAILMontreal/Conjugated-xTB_2M_molecules')

sTDA-xTB (approximation of the absorption wavelength and oscillator strength)

The tools to approximate the two properties (absorption wavelength and oscillator strength) can be found in sTDA-xTB/comp_xtb4stascore.py. Note make sure to change the src\props\properties.py Here is an example on how to use it in python (make sure that you are in the sTDA-xTB folder):

from comp_xtb4stascore import xtb4stascore
stdaxtb = xtb4stascore(constraint="none")
smiles = ['Fc1cc(F)c(F)c(-c2c[nH]c(-c3ccc4c(c3)C3(c5ccccc5-4)c4ccccc4-c4ccc(-c5cccc6n[nH]nc56)cc43)c2)c1F',
          'N#Cc(ccc1c2ccc(N(c3ccccc3)c4ccccc4)cc2)c5c1nc(c(cccc6)c6c7c8ccc(n(c9c%10cccc9)c%11c%10[nH]c%12c%11cccc%12)c7)c8n5',
          'O=C1C(=Cn2c3ccc(-c4ccc5c6c(ccc(-c7cc8cccc9c%10cccc%11cccc(c(c7)c89)c%11%10)c46)-c4nccnc4-5)cc3c3c4sc5ccccc5c4ccc32)C(=O)c2ccccc21',
          'F[B-]1(F)n2c(cc(-c3cc4ccccn4c3)c2-c2cn3ccnc3c(-c3cc4cc5sccc5cc4s3)n2)C=C2C=CC=[N+]21',
          'C#Cc1ccc2c3cccc4cc(-n5c6ccccc6c6c7ccccc7sc65)cc(c5c(-c6nc7c8ccccc8c8ccccc8c7[nH]6)ccc1c25)c43', ]
props = stdaxtb(smiles)
print(props)

STGG with Active-learning (STGG+AL)

To reproduce the examples from the paper, you can run examples from experiments/exps.sh.

References

If you find the code useful, please consider citing our work:

@misc{jolicoeurmartineau2025activelearningstgg,
      title={Generating $\pi$-Functional Molecules Using STGG+ with Active Learning}, 
      author={Alexia Jolicoeur-Martineau and Yan Zhang and Boris Knyazev and Aristide Baratin and Cheng-Hao Liu},
      year={2025},
      eprint={2502.14842},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

and the STGG+ paper:

@misc{jolicoeurmartineau2024anyproperty,
      title={Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees}, 
      author={Alexia Jolicoeur-Martineau and Aristide Baratin and Kisoo Kwon and Boris Knyazev and Yan Zhang},
      year={2024},
      eprint={2407.09357},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

and the original STGG paper:

@inproceedings{ahn2022spanning,
title={Spanning Tree-based Graph Generation for Molecules},
author={Sungsoo Ahn and Binghong Chen and Tianzhe Wang and Le Song},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=w60btE_8T2m}
}

Note that this code is based on the original STGG code, which can be found in the Supplementary Material section of https://openreview.net/forum?id=w60btE_8T2m.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
resource		resource
sTDA-xTB		sTDA-xTB
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STGG+AL: Generating pi-Functional Molecules Using STGG+ with Active Learning

Prerequisites

1. setting up neptune (only for logging training runs)

2. Setting up the directories

3. Setting up the environment

Conjugated-xTB dataset

sTDA-xTB (approximation of the absorption wavelength and oscillator strength)

STGG with Active-learning (STGG+AL)

References

About

Releases

Packages

Languages

License

SamsungSAILMontreal/STGG-AL

Folders and files

Latest commit

History

Repository files navigation

STGG+AL: Generating pi-Functional Molecules Using STGG+ with Active Learning

Prerequisites

1. setting up neptune (only for logging training runs)

2. Setting up the directories

3. Setting up the environment

Conjugated-xTB dataset

sTDA-xTB (approximation of the absorption wavelength and oscillator strength)

STGG with Active-learning (STGG+AL)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages