Chemeleon is a text-guided diffusion model designed for crystal structure generation. The tool allows users to explore and generate crystal structures either through natural language descriptions or by specifying target compositions and navigating chemical systems.
Text-Guided Generation
: Generate crystal structures from descriptive text inputs, enabling intuitive exploration of material properties.Composition-Based Generation
: Create structures with specific chemical compositions, aiding targeted material discovery.Chemical System Navigation
: Systematically explore complex chemical systems by specifying elements of interest.
A demo version of Chemeleon is available at Chemeleon App.
Note
The demo is currently in beta and hosted on Lightning Studio. The server will automatically shut down when not in use and may take approximately 2 minutes to restart.
For local deployment, you can run the application on your machine with the following command:
streamlit run app/steamlit_app.py
- Python 3.10 or later
conda create -n chemeleon python=3.10
conda activate chemeleon
- Pytorch >= 1.12 (install from the official website suitable for your environment)
Note
It is recommended to install PyTorch prior to installing chemeleon
to avoid potential issues with GPU support.
The training of chemeleon models was implemented using PyTorch 2.1.0.
chemeleon
can be installed from PyPI or the source code.
pip install chemeleon
git clone https://github.com/hspark1212/chemeleon.git
cd chemeleon
pip install -e .
To verify the installation, run the code in the demo.ipynb notebook.
Chemeleon
provides both a Python API and a command-line interface (CLI) for generating crystal structures based on text prompts, target specific compositions and navigating chemical systems.
Let's start with a simple example using the Python API. You can run this code in the demo.ipynb notebook:
import chemeleon
from chemeleon.visualize import Visualizer
# Load default model checkpoint (general text types)
chemeleon = chemeleon.load_general_text_model()
# Set parameters
n_samples = 5
n_atoms = 6
text_inputs = "A crystal structure of LiMnO4 with orthorhombic symmetry"
# Generate crystal structure
atoms_list = chemeleon.sample(text_inputs, n_atoms, n_samples)
# Visualize the generated crystal structure
visualizer = Visualizer(atoms_list)
visualizer.view(index=1)
Generate crystal structures from textual descriptions, enabling intuitive material design.
Note
The current version of Chemeleon v0.1.1 was trained with textual description consisting of composition (e.g. LiMnO4) and crystal system (e.g. orthorhombic, hexagonal, cubic). The model was trained with general type of text inputs created by LLMs (OpenAI API).
chemeleon sample prompt --text-input "A crystal structure of LiMnO4 with orthorhombic symmetry" --n-atoms 6 --n-samples 3 --save-dir results/prompt
In this example, Chemeleon
generates 3 (--n-samples
) crystal structures based on the text prompt provided (--text-input
). Each structure will contain 6 (--n-atoms
) atoms in the unit cell. The results are saved in the results/prompt (--save-dir
) directory.
Options
--text-input
or-t
: Textual description of the crystal structure to generate.--n-atoms
: Number of atoms in the generated crystal structure.--n-samples
: Number of crystal structures to generate.--save-dir
or-s
: Directory to save the generated crystal structures.
Tip
- Ensure your text prompt includes both the composition and the crystal symmetry for better results.
--n-atoms
parameter should be consistent with the stoichiometry of the composition provided.
Generate crystal structures based on specific chemical compositions, aiding in targeted material discovery.
Note
For the target composition, Chemeleon uses a composition-based model trained with textual descriptions of compositions in a strict format (e.g., Ti1 O2).
chemeleon sample composition --target-composition "TiO2" --n-samples 100 --max-natoms 40 --max-factor 13 --save-dir results/TiO2
In this example, Chemeleon
generates structures for multiple stoichiometries of the target composition TiO2 (--target-composition
), up to a maximum Z (multiplication) factor of 13 (--max-factor
).
This means it will generate structures for Ti1O2, Ti2O4, ..., Ti13O26, ensuring the total number of atoms does not exceed 40 atoms (--max-natoms
).
For each stoichiometry, 100 structures (--n-samples
) are generated. Duplicate structures are removed using Pymatgen's StructureMatcher
, and unique structures are saved in the results/TiO2 (--save-dir
) directory.
Options
Options:--target-composition
or-t
: The target chemical composition (e.g., TiO2).--n-samples
: Number of structures to generate (default: 100).--max-natoms
: Maximum number of atoms allowed in a structure (default: 40).--max-factor
: Maximum Z (multiplication) factor for the composition's stoichiometry (default: 13).--save-dir
or-s
: Directory to save the generated structures (default: results/TiO2).
Tip
- Chemeleon will generate structures for stoichiometries from the base composition up to the specified --max-factor.
- If you want to generate structures for a single stoichiometry, set --max-natoms to the maximum number of atoms in the stoichiometry and --max-factor to 1.
- After generation, duplicate structures are filtered out to retain only unique structures.
Systematically explore complex chemical systems by specifying elements of interest, narrowing down the search space using chemical filters provided by SMACT.
Note
For navigating chemical systems, Chemeleon uses a composition-based model trained with textual descriptions of compositions in a strict format (e.g., Ti1 O2).
chemeleon navigate system --elements Zn,Ti,O --n-samples 100 --max-stoich 8 --max-natoms 40 --max-factor 8 --save-dir results/navigate
In this example, Chemeleon
explores chemical systems containing zinc (Zn), titanium (Ti), and oxygen (O) elements (--elements
). It enumerates all possible compositions with up to 8 stoichiometric coefficient (--max-stoich
), following the chemical filters provided by SMACT. Then, it generates 100 structures (--n-samples
) for each composition, ensuring that the total number of atoms in each structure does not exceed 40 atoms (--max-natoms
) and the Z factor is less than 8 (--max-factor
). After generation, duplicate structures are filtered out to retain only unique structures using Pymatgen's StructureMatcher. The unique structures are saved in the results/navigate directory (--save-dir
).
Options
--elements
or-e
: Elements to include in the chemical system (e.g., Zn, Ti, O).--max-stoich
: Maximum stoichiometric coefficient for the composition (default: 8).--n-samples
: Number of samples to generate for each composition (default: 100).--max-natoms
: Maximum number of atoms allowed in a structure (default: 40).--max-factor
: Maximum Z (multiplication) factor for the composition's stoichiometry (default: 8).--save-dir
or-s
: Directory to save the generated structures (default: results/navigate).
Tip
- The specified elements are used to generate all possible compositions within the stoichiometric range up to --max-stoich.
smact_validity
function from SMACT is used to filter out invalid compositions.- After generation, duplicate structures are filtered out to retain only unique structures.
The training dataset, mp-40
, consists of crystal structures obtained from the Materials Project, each with fewer than 40 atoms per unit cell. Textual descriptions of these materials are generated using the OpenAI API.
A detailed guide for preprocessing this dataset can be found in the Jupyter notebook located at:
data/mp-40/data_preparation.ipynb
.
The processed dataset is split into training, validation, and test sets, saved in:
data/mp-40/train.csv
data/mp-40/val.csv
data/mp-40/test.csv
The first step involves training the Crystal CLIP
, which is a Transformer-encoder based (BERT) model trained with contrastive learning to align text embeddings with 3D structural embeddings derived from graph neural networks (GNNs).
python run_crystal_clip.py with clip_prompt
This command runs the training of Crystal CLIP with general text inputs (e.g. "A crystal structure of LiMnO4 with orthorhombic symmetry").
Alternatively, the text encoder can be trained with different types of text inputs:
- Composition-based inputs:
clip_composition
(e.g., Li1 Mn1 O4) - Formatted text inputs:
clip_composition_crystal_system
(e.g., "composition: LiMnO4, crystal system: orthorhombic")
Note
The detailed configuration can be found at config.py
.
After training the text encoder, the next step is to train the denoising diffusion model. This model generates crystal structures conditioned on text embeddings derived from Crystal CLIP
. Conditional sampling is implemented using a classifier-free guidance scheme.
python run.py with chemeleon_clip_prompt
Like the text encoder, the diffusion model can be trained with different text input types:
- General text:
chemeleon_clip_prompt
- Composition-based text:
chemeleon_clip_composition
- Formatted text:
chemeleon_clip_composition_crystal_system
Note
The detailed configuration can be found at config.py
.
Once training is complete, the model can be evaluated by sampling crystal structures based on textual descriptions in the test set data/mp-40/test.csv
. The model generates 20 structures for each textual description.
To run the evaluation script, use:
python chemeleon/scripts/evaluate.py with model_path=`path/to/model`
This script will output the following metrics:
valid_samples
: Number of valid samples generated.unique_samples
: Number of unique samples generated.structure_matching
: Percentage of structures that match the ground truth.meta_stable
: Percentage of structures that have lower energy than the ground truth.composition_matching
: Percentage of structures that have the same composition as the ground truth.crystal_system_matching
: Percentage of structures that have the same crystal system as the ground truth.lattice_system_matching
: Percentage of structures that have the same lattice system as the ground truth.
Tip
The result of the evaluation for Chemeleon v0.1.1 can be found at WandB.
If you find our work helpful, please refer to the following publication: "Exploration of crystal chemical space using text-guided generative artificial intelligence" Chemxiv (2024)
@article{chemeleon,
title={Exploration of crystal chemical space using text-guided generative artificial intelligence},
author={Park, Hyunsoo and Onwuli, Anthony and Walsh, Aron},
journal={ChemRxiv},
doi={10.26434/chemrxiv-2024-rw8p5},
year={2024}
}
This project is licensed under the MIT License, developed by Hyunsoo Park as part of the Materials Design Group at Imperial College London. See the LICENSE file for more details.