This repository is an official PyTorch implementation of "3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation" (COLM 2024)
![image](https://private-user-images.githubusercontent.com/53974866/312905385-746a4bd0-36d7-4a1a-bf65-1bb79f211331.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg4MTM0NTgsIm5iZiI6MTczODgxMzE1OCwicGF0aCI6Ii81Mzk3NDg2Ni8zMTI5MDUzODUtNzQ2YTRiZDAtMzZkNy00YTFhLWJmNjUtMWJiNzlmMjExMzMxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA2VDAzMzkxOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY0MmJiYjcyZjYwM2YxM2FmMzE3N2Q4ZmE0NDZlNmM2Zjc4NGE3NTlhZjVjNjc4YzVmNTdmMmU0MmJkOTY2NWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.riNHbNB1SB4ZQvEiQtePcW5-KPkx4N5Rz-K-0s27_KM)
![image](https://private-user-images.githubusercontent.com/53974866/312905671-d26672ea-d7e4-401d-92d3-013cc53134b8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg4MTM0NTgsIm5iZiI6MTczODgxMzE1OCwicGF0aCI6Ii81Mzk3NDg2Ni8zMTI5MDU2NzEtZDI2NjcyZWEtZDdlNC00MDFkLTkyZDMtMDEzY2M1MzEzNGI4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA2VDAzMzkxOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ3ZmNhNjRmNzcxM2YxMjQ3OWEyMmE4OTA1NTgzMTA1MjI2NGQ1YjAxYThjMjAwZWEzMzEwMmFhYzU5MDg1M2YmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.eEmuEVz6LeFjVdqaV4E-XxGkKbH7fdAiq6yxBzpEqMM)
Run the following command to create a new anaconda environment 3MDiffusion:
conda env create -f environment.yml
or download the molca.tar.gz
conda environment in the link and run the following command:
tar -xzf molca.tar.gz -C [path of your conda's environment]
Put epoch=49.pt
into the folder all_checkpoints/stage1 and graphcl_80.pth
into polymers/gin_pretrained through the link.
This folder contains the molecule generation script. The polymer generation experiment in the paper can be reproduced through the following steps:
python preprocess_filter.py --input_file ../data/ChEBI-20_data/train.txt --output_file ../data/ChEBI-20_data/train_filter.txt
python preprocess_filter.py --input_file ../data/ChEBI-20_data/test.txt --output_file ../data/ChEBI-20_data/test_filter.txt
Extract substructure vocabulary from a given set of molecules:
mkdir vocab_chebi
python get_vocab.py --min_frequency 100 --ncpu 8 --input_file ../data/ChEBI-20_data/train_filter.txt --output_file ./vocab_chebi/
The --min_frequency
means to discard any large motifs with lower than 100 occurances in the dataset. The discarded motifs will be decomposed into simple rings and bonds. Change --ncpu
to specify the number of jobs for multiprocessing.
Preprocess the dataset using the vocabulary extracted from the first step:
python preprocess.py --train ../data/ChEBI-20_data/train_filter.txt --vocab ./vocab_chebi/ --ncpu 8
mkdir train_processed
mv tensor* train_processed/
Train the generative model with KL regularization weight beta=0.1 and VAE latent dimension 24. You can change it by --beta
and --latent_size
argument.
mkdir -p ckpt/tmp
python vae_train.py --train train_processed/ --vocab ./vocab_chebi/ --save_dir ckpt/tmp
cd polymers
python main.py --adam_weight_decay 0.00001 --num_train_steps 100000 --batch_size 64 --tx_dim 256 --tx_depth 8 --objective pred_x0 --num_samples 1000 --scale_shift --beta_schedule linear --loss_type l2 --wandb_name train_100_smi_d8_decoder --timesteps 100 --sampling_timesteps 50 --text_hidden_dim 256 --train ./train_processed_chebi/ --vocab ./vocab_chebi_30/ --model ./ckpt/tmp-chebi-clip/model.49 --lr 0.001 --epochs 500 --test ../data/ChEBI-20_data/test_filter.txt --output_dir ./results_chebi/
We provide the example for inference of ChEBI-20 dataset.
To repreoduce the results, you firstly need to download five files from the link. Put epoch=49.pt
into the folder all_checkpoints/stage1
, graphcl_80.pth
into polymers/gin_pretrained
, model-winoise100_train_decoder.pt
into polymers/results_chebi
, model.49
into ckpt/tmp-chebi-clip/
and tensors-0.pkl
into train_processed_chebi
.
Then you can run the following code for inference on ChEBI-20 dataset:
cd polymers
python evaluate_diffusion.py --adam_weight_decay 0.00001 --num_train_steps 100000 --batch_size 64 --tx_dim 256 --tx_depth 8 --objective pred_x0 --num_samples 1000 --scale_shift --beta_schedule linear --loss_type l2 --wandb_name train_100_smi_d8_decoder --timesteps 100 --sampling_timesteps 50 --text_hidden_dim 256 --train ./train_processed_chebi/ --vocab ./vocab_chebi_30/ --model ./ckpt/tmp-chebi-clip/model.49 --lr 0.001 --epochs 500 --test ../data/ChEBI-20_data/test_filter.txt --output_dir ./results_new/ --resume_dir ./results_chebi/
If you find this work useful, please cite our paper:
@inproceedings{zhu20243m,
title={3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation},
author={Zhu, Huaisheng and Xiao, Teng and Honavar, Vasant G},
booktitle={First Conference on Language Modeling},
year={2024}
}