SSEC-JHU bluephos

BluePhos: An automated pipeline optimizing the synthesis and analysis of blue phosphorescent materials.

BluePhos Pipeline Introduction

Overview

The BluePhos pipeline is an automated computational tool streamlining the development and analysis of blue phosphorescent materials. It blends computational chemistry with machine learning to adeptly predict and hone the properties of essential compounds in light-emitting tech.

Workflow Evolution

The BluePhos pipeline functions like an automated assembly line, with a structured yet adaptable workflow that distributes tasks efficiently across computing resources. It optimizes batch processing and resource allocation, processing molecules individually for streamlined operation.

The current version of the pipeline comprises the following sequential tasks:

Ligand Generation Task: It commences by ingesting aromatic boronic acids and aromatic halides, generating ligand molecules via Suzuki coupling reactions.
SMILES to SDF Conversion Task: Molecular structures encoded in SMILES strings are converted into SDF files, facilitating in-depth chemical data manipulation.
Neural Network (NN) Task: This phase involves the extraction and engineering of features from each ligand. These features are processed through a trained graph neural network to predict the ligand's z-score, indicative of synthetic potential.

Planned enhancements include:

Optimization Geometry Task: Aiming to optimize molecular geometries, ensuring that the ligands adopt energetically favorable conformations.
Density Functional Theory (DFT) Calculation Task: Set to apply DFT calculations to optimized geometries for in-depth quantum mechanical insights into the ligands' electronic properties.

Setup and Installation

Step 1: Clone the GitHub Repository

git clone https://github.com/ssec-jhu/bluephos.git

Step 2: Set Up the Runtime Environment

Navigate to the Bluephos directory and create the blue_env environment using Conda:

cd bluephos
conda env create -f blue_env.yml
conda activate blue_env

After cloning, navigate to the project directory:

cd bluephos

Running the Pipeline

Usage

python bluephos_pipeline.py [options]

Command-Line Arguments

Argument	Required	Type	Default	Description
--halides	No	String	None	Path to the CSV file containing halides data. Required when no input directory or ligand SMILES CSV file is specified.
--acids	No	String	None	Path to the CSV file containing boronic acids data. Required when no input directory or ligand SMILES CSV file is specified.
--features	Yes	String	None	Path to the element feature file used for neural network predictions.
--train	Yes	String	None	Path to the train stats file used to normalize input data.
--weights	Yes	String	None	Path to the full energy model weights file for the neural network.
--input_dir	No	String	None	Directory containing input parquet files for rerun mode. Used when mode 3 is not specified.
--out-dir	No	String	None	Directory where the pipeline's output files will be saved. If not specified, defaults to the current directory.
--t_nn	No	Float	1.5	Threshold for the neural network 'z' score. Candidates with an absolute 'z' score below this threshold will be considered.
--t_ste	No	Float	1.9	Threshold for 'ste' (Singlet-Triplet Energy gap). Candidates with an absolute 'ste' value below this threshold will be considered.
--t_dft	No	Float	2.0	Threshold for 'dft' (dft_energy_diff). Candidates with an absolute 'dft' value below this threshold will be considered.
--ligand_smiles	No	String	None	Path to the ligand SMILE file containing ligand SMILES data. If provided, mode 3 is used.
--no_xtb	No	Bool	True	Disable xTB optimization. Defaults (no this flag) to xTB optimization enabled; use --no_xtb to disable it.

The BluePhos Discovery Pipeline now supports three modes of input data:

Generate data from Halides and Acids CSV files: This mode is used when no input directory or ligand SMILES CSV file is specified. It generates ligand pairs from the provided halides and acids CSV files.
Rerun data from parquet files: This mode is used when an input directory is specified. It reruns the pipeline using existing parquet files for ligand data.
Input data from a ligand SMILES CSV file: This mode is prioritized if a ligand SMILES CSV file is provided. It directly processes ligands from the SMILES data.

The priority order for these modes is 3 > 2 > 1, meaning:

  -If a ligand SMILES CSV file (--ligand_smiles) is provided, the pipeline operates in mode 3.
  -If an input directory (--input_dir) is specified, and no ligand SMILES CSV file is provided, the pipeline operates in mode 2.
  -If neither a ligand SMILES CSV file nor an input directory is provided, the pipeline defaults to mode 1.

Example Commands

Generating Ligand Pairs and Running the Full Pipeline (Mode1)
If you want to generate ligand pairs from halides and acids files and run the full pipeline, you must specify the paths to the halides and acids files:

python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5

Rerunning the Pipeline with Existing Parquet Files (Mode2)
If you have already run the pipeline for the ligands and want to rerun it for refiltering or recalculating the ligands based on previous results:

python bluephos_pipeline.py --input_dir path/to/parquet_directory --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5

Using Ligand SMILES CSV File (Mode 3)

python bluephos_pipeline.py --ligand_smiles path/to/ligand_smiles.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5

Specifying Different Thresholds for NN and STE
You can adjust the thresholds for the neural network 'z' score and the xTB standard error (STE) as needed:

python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --t_nn 2.0 --t_ste 2.5

Using a Different DFT Package
By default, the pipeline uses the ORCA DFT package, but you can switch to ASE (to be implemented later) if preferred:

python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --dft_package ase

Disable xTB optimiazation
By default, the geometries optimization task uses the xTB package.However you can disable it by running:

python bluephos_pipeline.py --halides path/to/halides.csv --acids path/to/acids.csv --features path/to/features.csv --train path/to/train_stats.csv --weights path/to/model_weights.h5 --no_xtb

Execute the BluePhos pipeline within a tox environment for a consistent and reproducible setup:

tox -e run-pipeline -- --halide /path/to/aromatic_halides.csv --acid /path/to/aromatic_boronic_acids.csv --feature /path/to/element_features.csv --train /path/to/train_stats.csv --weight /path/to/model_weights.pt -o /path/to/output_dir/

Replace /path/to/... with the actual paths to your datasets and parameter files.

Example Usage with Test Data

To run the pipeline using example data provided in the repository:

tox -e run-pipeline -- --halide ./tests/input/aromatic_halides_with_id.csv --acid ./tests/input/aromatic_boronic_acids_with_id.csv --feature ./bluephos/parameters/element_features.csv --train ./bluephos/parameters/train_stats.csv --weight ./bluephos/parameters/full_energy_model_weights.pt -o .

This command uses test data to demonstrate the pipeline's functionality, ideal for initial testing and familiarization.

Result

Note:

The default output (-o or --output) dataframe is stored in Parquet format due to its efficient storage, faster data access, and enhanced support for complex data structures.
The pipeline's results are organized by task, with filtered-out data stored in specific subdirectories within the /output directory. For example:
-The filtered-out data from the NN task is stored in /NN_filter_out.
-For the XTB task, the filtered-out data is saved in /XTB_filter_out.
-For the final DFT task, the results are divided into two directories: /DFT_filter_in for filtered-in data and /DFT_filter_out for filtered-out data.

The Parquet file can be accessed in several ways:

Using Pandas

Pandas can be used to read and analyze Parquet files.

import pandas as pd
df = pd.read_parquet('08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet')
print(df.describe())

Using DuckDB

DuckDB provides an efficient way to query Parquet files directly using SQL syntax.

import duckdb as ddb
query_result = ddb.query('''SELECT * FROM '08ca147e-f618-11ee-b38f-eab1f408aca3-8.parquet' LIMIT 10''')
print(query_result.to_df())

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to contribute to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github		.github
bluephos		bluephos
docs		docs
githooks		githooks
pipeline-example		pipeline-example
requirements		requirements
tests		tests
.!10919!.DS_Store		.!10919!.DS_Store
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
blue_env.yml		blue_env.yml
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSEC-JHU bluephos

BluePhos Pipeline Introduction

Overview

Workflow Evolution

Setup and Installation

Step 1: Clone the GitHub Repository

Step 2: Set Up the Runtime Environment

Running the Pipeline

Usage

Command-Line Arguments

The BluePhos Discovery Pipeline now supports three modes of input data:

The priority order for these modes is 3 > 2 > 1, meaning:

Example Commands

Execute the BluePhos pipeline within a tox environment for a consistent and reproducible setup:

Example Usage with Test Data

Result

Note:

The Parquet file can be accessed in several ways:

Using Pandas

Using DuckDB

Contributing

About

Releases

Packages

Contributors 8

Languages

License

ssec-jhu/bluephos

Folders and files

Latest commit

History

Repository files navigation

SSEC-JHU bluephos

BluePhos Pipeline Introduction

Overview

Workflow Evolution

Setup and Installation

Step 1: Clone the GitHub Repository

Step 2: Set Up the Runtime Environment

Running the Pipeline

Usage

Command-Line Arguments

The BluePhos Discovery Pipeline now supports three modes of input data:

The priority order for these modes is 3 > 2 > 1, meaning:

Example Commands

Execute the BluePhos pipeline within a tox environment for a consistent and reproducible setup:

Example Usage with Test Data

Result

Note:

The Parquet file can be accessed in several ways:

Using Pandas

Using DuckDB

Contributing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages