Skip to content

Commit

Permalink
Merge branch 'main' into 58-test-python-313
Browse files Browse the repository at this point in the history
  • Loading branch information
c-w-feldmann authored Oct 7, 2024
2 parents d396871 + ebd9887 commit e2f0855
Show file tree
Hide file tree
Showing 64 changed files with 2,608 additions and 647 deletions.
Binary file added .github/molpipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 4 additions & 5 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
pip install pylint
- name: Analysing the code with pylint
run: |
pylint -d C0301,R0913,W1202 $(git ls-files '*.py') --ignored-modules "rdkit"
pylint -d C0301,R0913,W1202 $(git ls-files '*.py') --ignored-modules "rdkit" --max-positional-arguments 10
mypy:
runs-on: ubuntu-latest
steps:
Expand All @@ -34,7 +34,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install "numpy<2.0.0"
pip install mypy
mypy . || exit_code=$?
mypy --install-types --non-interactive
Expand Down Expand Up @@ -151,7 +150,7 @@ jobs:
pip install isort
- name: Analysing the code with isort
run: |
isort --profile black .
isort --profile black --check-only .
test_basis:
needs:
Expand Down Expand Up @@ -182,7 +181,7 @@ jobs:
- name: Run unit-tests
run: |
# Run only the core test suite in the tests directory.
coverage run -m unittest discover tests
coverage run --source=molpipeline,tests -m unittest discover tests
# Create a coverage report. Fail if the coverage is below 85%. Exclude extra packages from the report.
coverage report --fail-under=85 --omit="*chemprop*","*/*chemprop*/*"
Expand All @@ -208,7 +207,7 @@ jobs:
- name: Run unit-tests for chemprop
run: |
# Run only the chemprop test suite.
coverage run -m unittest discover test_extras/test_chemprop
coverage run --source=molpipeline,tests -m unittest discover test_extras/test_chemprop
# Create a coverage report. Fail if the coverage is below 85%. Include only chemprop files in the report.
coverage report --fail-under=85 --include="*chemprop*","*/*chemprop*/*"
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ __pycache__
molpipeline.egg-info/
lib/
build/
lightning_logs/

83 changes: 69 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,56 @@
# MolPipeline
MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.
MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.

<p align="center"><img src=".github/molpipeline.png" height="250"/></p>

## Background

The open-source package [scikit-learn](https://scikit-learn.org/) provides a large variety of machine
The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
prepend custom data processing steps to the machine learning model.
`MolPipeline` extends this concept to the field of chemoinformatics by
wrapping default functionalities of [RDKit](https://www.rdkit.org/), such as reading and writing SMILES strings
`MolPipeline` extends this concept to the field of cheminformatics by
wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
or calculating molecular descriptors from a molecule-object.

A notable difference to the `Pipeline` class of scikit-learn is that the Pipline from `MolPipeline` allows for
instances to fail during processing without interrupting the whole pipeline.
Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules
or some descriptors might not be calculable for certain molecules.
MolPipeline aims to provide:

- Automated end-to-end processing from molecule data sets to deployable machine learning models.
- Scalable parallel processing and low memory usage through instance-based processing.
- Standard pipeline building blocks for flexibly building custom pipelines for various
cheminformatics tasks.
- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
SMILES string that could not be parsed correctly).
- Integrated and self-contained pipeline serialization for easy deployment and tracking
in version control.

## Publications

The publication is freely available [here](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036).
[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
\
Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)

Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
fingerprint-based models, 2024
\
Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)

## Installation
```commandline
pip install molpipeline
```

## Usage
## Documentation

The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.

A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).

See the [notebooks](notebooks) folder for basic and advanced examples of how to use Molpipeline.
## Quick Start

A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the [notebook](notebooks/01_getting_started_with_molpipeline.ipynb)):
### Model building

Create a fingerprint-based prediction model:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
Expand Down Expand Up @@ -58,8 +79,42 @@ pipeline.predict(["CCC"])
# output: array([0.29])
```

Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines,
like clustering for scaffold splits (see also the [notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb)):
### Feature calculation

Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
be calculated like this:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToRDKitPhysChem

pipeline_physchem = Pipeline(
[
("auto2mol", AutoToMol()),
(
"physchem",
MolToRDKitPhysChem(
standardizer=None,
descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
),
),
],
n_jobs=-1,
)
physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
physchem_matrix
# output: array([[72.066, 0. , 0. ],
# [88.065, 20.23 , 1. ]])
```

MolPipeline provides further features and descriptors from RDKit,
for example Morgan (binary/count) fingerprints and MACCS keys.
See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.

### Clustering

Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.

```python
from molpipeline.estimators import MurckoScaffoldClustering
Expand Down
64 changes: 62 additions & 2 deletions molpipeline/abstract_pipeline_elements/any2mol/string2mol.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@

import abc

from molpipeline.abstract_pipeline_elements.core import AnyToMolPipelineElement
from molpipeline.utils.molpipeline_types import OptionalMol
from molpipeline.abstract_pipeline_elements.core import (
AnyToMolPipelineElement,
InvalidInstance,
)
from molpipeline.utils.molpipeline_types import OptionalMol, RDKitMol


class StringToMolPipelineElement(AnyToMolPipelineElement, abc.ABC):
Expand Down Expand Up @@ -43,3 +46,60 @@ def pretransform_single(self, value: str) -> OptionalMol:
OptionalMol
RDKit molecule if representation was valid, else InvalidInstance.
"""


class SimpleStringToMolElement(StringToMolPipelineElement, abc.ABC):
"""Transforms string representation to RDKit Mol objects."""

def pretransform_single(self, value: str) -> OptionalMol:
"""Transform string to molecule.
Parameters
----------
value: str
string representation.
Returns
-------
OptionalMol
Rdkit molecule if valid string representation, else None.
"""
if value is None:
return InvalidInstance(
self.uuid,
f"Invalid representation: {value}",
self.name,
)

if not isinstance(value, str):
return InvalidInstance(
self.uuid,
f"Not a string: {value}",
self.name,
)

mol: RDKitMol = self.string_to_mol(value)

if not mol:
return InvalidInstance(
self.uuid,
f"Invalid representation: {value}",
self.name,
)
mol.SetProp("identifier", value)
return mol

@abc.abstractmethod
def string_to_mol(self, value: str) -> RDKitMol:
"""Transform string representation to molecule.
Parameters
----------
value: str
string representation
Returns
-------
RDKitMol
Rdkit molecule if valid representation, else None.
"""
75 changes: 10 additions & 65 deletions molpipeline/abstract_pipeline_elements/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,21 +97,23 @@ class ABCPipelineElement(abc.ABC):

def __init__(
self,
name: str = "ABCPipelineElement",
name: Optional[str] = None,
n_jobs: int = 1,
uuid: Optional[str] = None,
) -> None:
"""Initialize ABCPipelineElement.
Parameters
----------
name: str
name: Optional[str], optional (default=None)
Name of PipelineElement
n_jobs: int
Number of cores used for processing.
uuid: Optional[str]
Unique identifier of the PipelineElement.
"""
if name is None:
name = self.__class__.__name__
self.name = name
self.n_jobs = n_jobs
if uuid is None:
Expand Down Expand Up @@ -182,12 +184,12 @@ def get_params(self, deep: bool = True) -> dict[str, Any]:
"uuid": self.uuid,
}

def set_params(self, **parameters: dict[str, Any]) -> Self:
def set_params(self, **parameters: Any) -> Self:
"""As the setter function cannot be assessed with super(), this method is implemented for inheritance.
Parameters
----------
parameters: dict[str, Any]
parameters: Any
Parameters to be set.
Returns
Expand Down Expand Up @@ -338,15 +340,15 @@ class TransformingPipelineElement(ABCPipelineElement):

def __init__(
self,
name: str = "ABCPipelineElement",
name: Optional[str] = None,
n_jobs: int = 1,
uuid: Optional[str] = None,
) -> None:
"""Initialize ABCPipelineElement.
Parameters
----------
name: str
name: Optional[str], optional (default=None)
Name of PipelineElement
n_jobs: int
Number of cores used for processing.
Expand Down Expand Up @@ -377,12 +379,12 @@ def parameters(self) -> dict[str, Any]:
return self.get_params()

@parameters.setter
def parameters(self, **parameters: dict[str, Any]) -> None:
def parameters(self, **parameters: Any) -> None:
"""Set the parameters of the object.
Parameters
----------
parameters: dict[str, Any]
parameters: Any
Object parameters as a dictionary.
Returns
Expand Down Expand Up @@ -616,25 +618,6 @@ class MolToMolPipelineElement(TransformingPipelineElement, abc.ABC):
_input_type = "RDKitMol"
_output_type = "RDKitMol"

def __init__(
self,
name: str = "MolToMolPipelineElement",
n_jobs: int = 1,
uuid: Optional[str] = None,
) -> None:
"""Initialize MolToMolPipelineElement.
Parameters
----------
name: str
Name of the PipelineElement.
n_jobs: int
Number of cores used for processing.
uuid: Optional[str]
Unique identifier of the PipelineElement.
"""
super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)

def transform(self, values: list[OptionalMol]) -> list[OptionalMol]:
"""Transform list of molecules to list of molecules.
Expand Down Expand Up @@ -700,25 +683,6 @@ class AnyToMolPipelineElement(TransformingPipelineElement, abc.ABC):

_output_type = "RDKitMol"

def __init__(
self,
name: str = "AnyToMolPipelineElement",
n_jobs: int = 1,
uuid: Optional[str] = None,
) -> None:
"""Initialize AnyToMolPipelineElement.
Parameters
----------
name: str
Name of the PipelineElement.
n_jobs: int
Number of cores used for processing.
uuid: Optional[str]
Unique identifier of the PipelineElement.
"""
super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)

def transform(self, values: Any) -> list[OptionalMol]:
"""Transform list of instances to list of molecules.
Expand Down Expand Up @@ -756,25 +720,6 @@ class MolToAnyPipelineElement(TransformingPipelineElement, abc.ABC):

_input_type = "RDKitMol"

def __init__(
self,
name: str = "MolToAnyPipelineElement",
n_jobs: int = 1,
uuid: Optional[str] = None,
) -> None:
"""Initialize MolToAnyPipelineElement.
Parameters
----------
name: str
Name of the PipelineElement.
n_jobs: int
Number of cores used for processing.
uuid: Optional[str]
Unique identifier of the PipelineElement.
"""
super().__init__(name=name, n_jobs=n_jobs, uuid=uuid)

@abc.abstractmethod
def pretransform_single(self, value: RDKitMol) -> Any:
"""Transform the molecule, but skip parameters learned during fitting.
Expand Down
Loading

0 comments on commit e2f0855

Please sign in to comment.