Skip to content

Latest commit

 

History

History
98 lines (66 loc) · 4.78 KB

README.md

File metadata and controls

98 lines (66 loc) · 4.78 KB

3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

arXiv

This repository is an official PyTorch implementation of "3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation" (COLM 2024)

image image

Environment Preparation

Run the following command to create a new anaconda environment 3MDiffusion:

conda env create -f environment.yml

or download the molca.tar.gz conda environment in the link and run the following command:

tar -xzf molca.tar.gz -C [path of your conda's environment]

Download the pretrained Text encoder for 3M Diffusion:

Put epoch=49.pt into the folder all_checkpoints/stage1 and graphcl_80.pth into polymers/gin_pretrained through the link.

Training for VAE

Filter small molecule

This folder contains the molecule generation script. The polymer generation experiment in the paper can be reproduced through the following steps:

python preprocess_filter.py --input_file ../data/ChEBI-20_data/train.txt --output_file ../data/ChEBI-20_data/train_filter.txt 
python preprocess_filter.py --input_file ../data/ChEBI-20_data/test.txt --output_file ../data/ChEBI-20_data/test_filter.txt 

Motif Extraction

Extract substructure vocabulary from a given set of molecules:

mkdir vocab_chebi
python get_vocab.py --min_frequency 100 --ncpu 8 --input_file ../data/ChEBI-20_data/train_filter.txt --output_file ./vocab_chebi/

The --min_frequency means to discard any large motifs with lower than 100 occurances in the dataset. The discarded motifs will be decomposed into simple rings and bonds. Change --ncpu to specify the number of jobs for multiprocessing.

Data Preprocessing

Preprocess the dataset using the vocabulary extracted from the first step:

python preprocess.py --train ../data/ChEBI-20_data/train_filter.txt --vocab ./vocab_chebi/ --ncpu 8 
mkdir train_processed
mv tensor* train_processed/

Training

Train the generative model with KL regularization weight beta=0.1 and VAE latent dimension 24. You can change it by --beta and --latent_size argument.

mkdir -p ckpt/tmp
python vae_train.py --train train_processed/ --vocab ./vocab_chebi/ --save_dir ckpt/tmp

Training for Diffusion Model

cd polymers 

python main.py --adam_weight_decay 0.00001 --num_train_steps 100000 --batch_size 64 --tx_dim 256 --tx_depth 8 --objective pred_x0 --num_samples 1000 --scale_shift --beta_schedule linear --loss_type l2   --wandb_name train_100_smi_d8_decoder --timesteps 100 --sampling_timesteps 50 --text_hidden_dim 256 --train ./train_processed_chebi/ --vocab ./vocab_chebi_30/ --model ./ckpt/tmp-chebi-clip/model.49 --lr 0.001 --epochs 500 --test ../data/ChEBI-20_data/test_filter.txt --output_dir ./results_chebi/

Evaluation

We provide the example for inference of ChEBI-20 dataset.

To repreoduce the results, you firstly need to download five files from the link. Put epoch=49.pt into the folder all_checkpoints/stage1, graphcl_80.pth into polymers/gin_pretrained, model-winoise100_train_decoder.pt into polymers/results_chebi, model.49 into ckpt/tmp-chebi-clip/ and tensors-0.pkl into train_processed_chebi.

Then you can run the following code for inference on ChEBI-20 dataset:

cd polymers

python evaluate_diffusion.py --adam_weight_decay 0.00001 --num_train_steps 100000 --batch_size 64 --tx_dim 256 --tx_depth 8 --objective pred_x0 --num_samples 1000 --scale_shift --beta_schedule linear --loss_type l2   --wandb_name train_100_smi_d8_decoder --timesteps 100 --sampling_timesteps 50 --text_hidden_dim 256 --train ./train_processed_chebi/ --vocab ./vocab_chebi_30/ --model ./ckpt/tmp-chebi-clip/model.49 --lr 0.001 --epochs 500 --test ../data/ChEBI-20_data/test_filter.txt --output_dir ./results_new/ --resume_dir ./results_chebi/

Citation

If you find this work useful, please cite our paper:

@inproceedings{zhu20243m,
  title={3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation},
  author={Zhu, Huaisheng and Xiao, Teng and Honavar, Vasant G},
  booktitle={First Conference on Language Modeling},
  year={2024}
}