Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing support for partial charge I/O in SDF #281

Merged
merged 46 commits into from
Apr 11, 2020
Merged

Conversation

j-wags
Copy link
Member

@j-wags j-wags commented Apr 18, 2019

  • Closes Add support for loading partial charges from SDF tags using new standard #250 by adding support for partial charge I/O in SDF. The partial charges are stored as a property in the SDF molecule block under the tag <atom.dprop.PartialCharge>
  • Closes to/from_openeye can not distinguish None vs. all-zero partial charges #524
  • If an OFFMol's partial_charges attribute is set to None (the default value), calling to_openeye will now produce a OE molecule with partial charges set to nan. This would previously produce an OE molecule with partial charges of 0.0, which was a loss of information, since it wouldn't be clear whether the original OFFMol's partial charges were all-zero or None. OpenEye toolkit wrapper methods such as from_smiles and from_file now produce OFFMols with partial_charges = None when appropriate (previously these would produce OFFMols with all-zero charges).
  • Per the new SDF partial charge specification adopted by RDKit, Molecule.to_rdkit now sets partial charges on the RDAtom's PartialCharges property (this was previously set on the partial_charges property). If the OFFMol's partial_charges attribute is None, this property will not be defined.
  • Enforce the behavior during SDF I/O that a SDF may contain multiple MOLECULES, but that the OFF Toolkit will NEVER assume that it contains multiple CONFORMERS of the same molecule. This is an important distinction, since otherwise there is ambiguity around whether properties of one entry in a SDF are shared among several molecule blocks or not (More info here). If the user requests the OFF Toolkit to write a multi-conformer Molecule to SDF, only the first conformer will be written. For more fine-grained control of writing properties, conformers, and partial charges, users will need to call Molecule.to_rdkit or Molecule.to_openeye and use the flexibility offered by those packages.
  • Due to different constraints placed on the data types allowed by external toolkits, we make our best effort to preserve offmol.properties when converting molecules to other packages, but users should be aware that no guarantee of data integrity is made. The only data format for keys and values in the offmol.property dict that we will try to support through a roundtrip to another toolkit's Molecule object is string.
  • Adds tests
  • Updates release notes

Status

  • Ready for review
  • Ready for merge

@lgtm-com
Copy link

lgtm-com bot commented Jan 29, 2020

This pull request introduces 1 alert when merging 5c2a31c into 5d97594 - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@codecov-io
Copy link

codecov-io commented Jan 29, 2020

Codecov Report

Merging #281 into master will increase coverage by 11.85%.
The diff coverage is 100%.

@lgtm-com
Copy link

lgtm-com bot commented Feb 9, 2020

This pull request introduces 1 alert when merging e48f05c into ee0f715 - view on LGTM.com

new alerts:

  • 1 for Suspicious unused loop iteration variable

@@ -501,9 +493,66 @@ def to_file(self, molecule, file_path, file_format):
ofs = oechem.oemolostream(file_path)
openeye_format = getattr(oechem, 'OEFormat_' + file_format)
ofs.SetFormat(openeye_format)

# OFFTK strictly treats SDF as a single-conformer format.
# We need to override OETK's behavior here if the user is saving a multiconformer molecule.
Copy link
Collaborator

@jthorton jthorton Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused with this comment as it sounds like we can save multiconformer molecule into the sdf but we throw away all but the first conformer, I think in the spec you say we only write the first conformer to file as well. So maybe change this to something like
# remove all but the first conformer when writing to SDF as we only support single conformer format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I think I misunderstood what you mean, so we can write an SDF with multiple molecules in where each molecule could be a conformer of the same molecule, so in a round trip test we won't quite get the input molecule back as a molecule with N conformers but we get back N molecules that the user can then condense if they want to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- I've added that comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we can NOT write an SDF with multiple conformers/molecules. Though, users are free to concatenate several single-molecule SDFs to get a "multi-conformer" or "multi-molecule" SDF.


# Then, we take any SD data pairs that were on the oemol, and copy them on to "this_conf_oemcmol".
# These SD pairs will be populated if we're dealing with a single-conformer SDF.
for dp in oechem.OEGetSDDataPairs(oemol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I follow this bit we are saying if it is a single conformer SDF the tags are in the oemol and will be transferred in the first loop else if its a multi conformer SDF the first loop will have no tags as they are instead in the conf molecule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It's super confusing, but this logic is intended to handle both scenarios.

@jthorton
Copy link
Collaborator

Overall this looks great with really good test coverage. I was slightly confused following the spec and implementation of what would happen when writing a multi conformer molecule to SDF but I think I follow now. Nothing blocking.

@mattwthompson mattwthompson self-requested a review April 3, 2020 18:39
Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to have progressed to cover several features - awesome! I think this is good to go once a few little things I found are resolved.

General comments:

  1. Is the basic round-tripping of partial charges covered? RDKit and OpenEye I figure are already covered, but is the disk round-trip covered elsewhere somewhere? Like
>>> from openforcefield.topology.molecule import Molecule
>>> from openforcefield.tests.test_forcefield import create_ethanol
>>> import numpy as np
>>> eth = create_ethanol()
>>> ref_charges = eth.partial_charges
>>> np.allclose(ref_charges, Molecule.from_openeye(eth.to_openeye()).partial_charges)
True
>>> # Repeat with RDKit and .sdf file
  1. Is there anything in the RDKit block of tests that is fundamentally different than the OpenEye block? I took a more careful look at the latter.

  2. There is a conflict between this state of the API and what is plausibly a future state, in particular with respect to reading in multiple conformers and/or molecules from an SDF file, and how to handle the case of different conformers having different properties. For example, in this patch will write only the first conformer of a multi-conformer OpenFF Molecule, whereas in the future we that may be different. What is our plan here?

openforcefield/utils/toolkits.py Show resolved Hide resolved
openforcefield/utils/toolkits.py Outdated Show resolved Hide resolved
openforcefield/tests/test_toolkits.py Outdated Show resolved Hide resolved
@pytest.mark.skipif(not OpenEyeToolkitWrapper.is_available(), reason='OpenEye Toolkit not available')
def test_write_sdf_charges(self):
"""Test OpenEyeToolkitWrapper for writing partial charges to a sdf file"""
from openforcefield.tests.test_forcefield import create_ethanol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import (and a few elsewhere) seems to be duplicated from the top of the file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed in b26f19e.


@pytest.mark.skipif(not OpenEyeToolkitWrapper.is_available(), reason='OpenEye Toolkit not available')
def test_write_sdf_no_charges(self):
"""Test OpenEyeToolkitWrapper for importing a charges from a sdf file"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring should be updated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Fixed this and the corresponding RDKit test docstring in 476cdbd

Comment on lines 408 to 409
"""Test OpenEyeToolkitWrapper for performing a round trip of a molecule with partial charge to and from
a sdf file"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be updated to clarify that it's only testing the properties part of things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. Updated in 9f6b550.

openforcefield/tests/test_toolkits.py Outdated Show resolved Hide resolved
@j-wags
Copy link
Member Author

j-wags commented Apr 9, 2020

Thanks @mattwthompson and @jthorton for the careful reviews -- I really appreciate the attention to detail. Having some fresh pairs of eyes really helped catch things I was totally glossing over!

Is the basic round-tripping of partial charges covered?

We cover partial charge roundtrips for SDF in the new tests in this PR, and we previously tested for round tripping with OE/RDMols in tests like test_to_from_openeye_core_props_filled and test_to_from_rdkit_core_props_filled. "Core properties" are how I tried to distinguish "functional" attributes of a molecule, and their counterpart is "secondary" properties (anything that gets stuffed into the offmol.properties dict). A naive attempt at spec equivalence between toolkits can be found here: #135 (when I have time, I'll bake this into the developer docs as well)

Is there anything in the RDKit block of tests that is fundamentally different than the OpenEye block? I took a more careful look at the latter.

I tried to make the tests as identical as possible between the two. In a future refactor, we could likely use pytest.parametrize to loop over the exact same tests for both.

There is a conflict between this state of the API and what is plausibly a future state, in particular with respect to reading in multiple conformers and/or molecules from an SDF file, and how to handle the case of different conformers having different properties. For example, in this patch will write only the first conformer of a multi-conformer OpenFF Molecule, whereas in the future we that may be different. What is our plan here?

Thanks for describing this so succinctly.

So, the status as of this posting is that:

  • this branch can read multiple molecules from SDF, no problem.
  • this branch can not collapse multiple conformers of the same molecule from an SDF into a single, multi-conformer molecule
  • this branch can write a single conformer of a molecule to SDF
  • this branch can not write multiple molecules, or multiple conformers of one molecule, to SDF
  • this branch can successfully preserve the core properties of a molecule (except conformers other than the first) through an SDF round trip

Our functionality is limited with regard to "multi-conformer" SDFs because our molecule object model is fundamentally at odds with "multiconformer" SDF. In a multi-molecule SDF, different molecules/conformers can have different property values under the same property name (for example, each conformer could have different partial charges). We can not reconcile this with our Molecule object, in which no data can be attached to individual conformers.

In the future, we'd like to be in a state where we can read and write multi-conformer SDFs. This will require a considerable amount of thought and design, and we may find that the shortest path to reliable and unsurprising behavior is a fundamental redesign of our Molecule class. This is not trivial -- I haven't found any such functionality in RDKit, and the only reference to it is an untouched issue from 2016: rdkit/rdkit#1125. OpenEye has an entire design docs section specifically discussing behavior around multiconformer SDFs.

If, in the future, we do implement support for multiconformer SDFs, we could implement a reverse-compatible API as follows:

  • offmol.to_file('multiconformer.sdf', write_multiple_conformers=True, file_format='sdf')
  • Molecule.from_file('multiconformer.sdf', collapse_conformers=True, file_format='sdf')

By default, the write_multiple_conformers and collapse_conformers kwargs could be False, preserving the current behavior.

In the short term, we can save an OFFMol to disk by pickling the results of offmol.to_dict. In the long term, we can work to implement the API above.

Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I found earlier has been fixed, and what you laid out seems like a reasonable path forward to how the future API will interact with these changes. I think this is good to go (once tests pass)

@j-wags
Copy link
Member Author

j-wags commented Apr 11, 2020

The failing tests are just the DDOS protection on the DOI link -- All other tests are passing. The DOI link issue can be handled separately. I'm going to merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants