LDMol

Official GitHub repository for LDMol, a latent text-to-molecule diffusion model. The details can be found in the following paper:

LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space. (arxiv 2024)

LDMol not only can generate molecules according to the given text prompt, but it's also able to perform various downstream tasks including molecule-to-text retrieval and text-guided molecule editing.

The model checkpoint and data are too heavy to be included in this repo and can be found in here.

Requirements

Run conda env create -f requirements.yaml and it will generate a conda environment named ldmol.

Inference

Check out the arguments in the script files to see more details.

1. text-to-molecule generation

zero-shot: The model gets a hand-written text prompt.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 inference_demo.py --num-samples 100 --ckpt ./Pretrain/checkpoint_ldmol.pt --prompt="This molecule includes benzoyl group." --cfg-scale=5

benchmark dataset: The model performs text-to-molecule generation on ChEBI-20 test set. The evaluation metrics will be printed at the end.

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_t2m.py --ckpt ./Pretrain/checkpoint_ldmol_chebi20.pt --cfg-scale=3.5

2. molecule-to-text retrieval

The model performs molecule-to-text retrieval on the given dataset. --level controls the quality of the query text(paragraph/sentence). --n-iter is the number of function evaluations of our model.

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_retrieval_m2t.py --ckpt ./Pretrain/checkpoint_ldmol.pt --dataset="./data/PCdes/test.txt" --level="paragraph" --n-iter=10

3. text-guided molecule editing

The model performs a DDS-style text-guided molecule editing. --source-text should describe the --input-smiles. --target-text is your desired molecule description.

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_dds.py --ckpt ./Pretrain/checkpoint_ldmol.pt --input-smiles="C[C@H](CCc1ccccc1)Nc1ccc(C#N)cc1F" --source-text="This molecule contains fluorine." --target-text="This molecule contains bromine."

Acknowledgement

The code for DiT diffusion model is based on & modified from the official code of DiT.
The code for BERT with cross-attention layers xbert.py and schedulers are modified from the one in ALBEF.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
diffusion		diffusion
scheduler		scheduler
LICENSE		LICENSE
README.md		README.md
config_decoder.json		config_decoder.json
config_encoder.json		config_encoder.json
dataset.py		dataset.py
download.py		download.py
inference_dds.py		inference_dds.py
inference_demo.py		inference_demo.py
inference_retrieval_m2t.py		inference_retrieval_m2t.py
inference_t2m.py		inference_t2m.py
metrics.py		metrics.py
models.py		models.py
requirements.yaml		requirements.yaml
sandbox.py		sandbox.py
train_autoencoder.py		train_autoencoder.py
train_encoder.py		train_encoder.py
train_ldmol.py		train_ldmol.py
utils.py		utils.py
vocab_bpe_300_sc.txt		vocab_bpe_300_sc.txt
xbert.py		xbert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDMol

Requirements

Inference

Acknowledgement

About

Releases

Packages

Languages

License

jinhojsk515/ldmol

Folders and files

Latest commit

History

Repository files navigation

LDMol

Requirements

Inference

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages