-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing support for partial charge I/O in SDF #281
Conversation
This pull request introduces 1 alert when merging 5c2a31c into 5d97594 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging e48f05c into ee0f715 - view on LGTM.com new alerts:
|
…no charge and 0 charge in OETK I/O
@@ -501,9 +493,66 @@ def to_file(self, molecule, file_path, file_format): | |||
ofs = oechem.oemolostream(file_path) | |||
openeye_format = getattr(oechem, 'OEFormat_' + file_format) | |||
ofs.SetFormat(openeye_format) | |||
|
|||
# OFFTK strictly treats SDF as a single-conformer format. | |||
# We need to override OETK's behavior here if the user is saving a multiconformer molecule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused with this comment as it sounds like we can save multiconformer molecule into the sdf but we throw away all but the first conformer, I think in the spec you say we only write the first conformer to file as well. So maybe change this to something like
# remove all but the first conformer when writing to SDF as we only support single conformer format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I think I misunderstood what you mean, so we can write an SDF with multiple molecules in where each molecule could be a conformer of the same molecule, so in a round trip test we won't quite get the input molecule back as a molecule with N conformers but we get back N molecules that the user can then condense if they want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I've added that comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we can NOT write an SDF with multiple conformers/molecules. Though, users are free to concatenate several single-molecule SDFs to get a "multi-conformer" or "multi-molecule" SDF.
|
||
# Then, we take any SD data pairs that were on the oemol, and copy them on to "this_conf_oemcmol". | ||
# These SD pairs will be populated if we're dealing with a single-conformer SDF. | ||
for dp in oechem.OEGetSDDataPairs(oemol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I follow this bit we are saying if it is a single conformer SDF the tags are in the oemol
and will be transferred in the first loop else if its a multi conformer SDF the first loop will have no tags as they are instead in the conf
molecule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. It's super confusing, but this logic is intended to handle both scenarios.
Overall this looks great with really good test coverage. I was slightly confused following the spec and implementation of what would happen when writing a multi conformer molecule to SDF but I think I follow now. Nothing blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to have progressed to cover several features - awesome! I think this is good to go once a few little things I found are resolved.
General comments:
- Is the basic round-tripping of partial charges covered? RDKit and OpenEye I figure are already covered, but is the disk round-trip covered elsewhere somewhere? Like
>>> from openforcefield.topology.molecule import Molecule
>>> from openforcefield.tests.test_forcefield import create_ethanol
>>> import numpy as np
>>> eth = create_ethanol()
>>> ref_charges = eth.partial_charges
>>> np.allclose(ref_charges, Molecule.from_openeye(eth.to_openeye()).partial_charges)
True
>>> # Repeat with RDKit and .sdf file
-
Is there anything in the RDKit block of tests that is fundamentally different than the OpenEye block? I took a more careful look at the latter.
-
There is a conflict between this state of the API and what is plausibly a future state, in particular with respect to reading in multiple conformers and/or molecules from an SDF file, and how to handle the case of different conformers having different properties. For example, in this patch will write only the first conformer of a multi-conformer OpenFF
Molecule
, whereas in the future we that may be different. What is our plan here?
@pytest.mark.skipif(not OpenEyeToolkitWrapper.is_available(), reason='OpenEye Toolkit not available') | ||
def test_write_sdf_charges(self): | ||
"""Test OpenEyeToolkitWrapper for writing partial charges to a sdf file""" | ||
from openforcefield.tests.test_forcefield import create_ethanol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import (and a few elsewhere) seems to be duplicated from the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Removed in b26f19e.
|
||
@pytest.mark.skipif(not OpenEyeToolkitWrapper.is_available(), reason='OpenEye Toolkit not available') | ||
def test_write_sdf_no_charges(self): | ||
"""Test OpenEyeToolkitWrapper for importing a charges from a sdf file""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring should be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Fixed this and the corresponding RDKit test docstring in 476cdbd
"""Test OpenEyeToolkitWrapper for performing a round trip of a molecule with partial charge to and from | ||
a sdf file""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be updated to clarify that it's only testing the properties
part of things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. Updated in 9f6b550.
Co-Authored-By: Matt Thompson <[email protected]>
Co-Authored-By: Matt Thompson <[email protected]>
Thanks @mattwthompson and @jthorton for the careful reviews -- I really appreciate the attention to detail. Having some fresh pairs of eyes really helped catch things I was totally glossing over!
We cover partial charge roundtrips for SDF in the new tests in this PR, and we previously tested for round tripping with OE/RDMols in tests like
I tried to make the tests as identical as possible between the two. In a future refactor, we could likely use
Thanks for describing this so succinctly. So, the status as of this posting is that:
Our functionality is limited with regard to "multi-conformer" SDFs because our molecule object model is fundamentally at odds with "multiconformer" SDF. In a multi-molecule SDF, different molecules/conformers can have different property values under the same property name (for example, each conformer could have different partial charges). We can not reconcile this with our Molecule object, in which no data can be attached to individual conformers. In the future, we'd like to be in a state where we can read and write multi-conformer SDFs. This will require a considerable amount of thought and design, and we may find that the shortest path to reliable and unsurprising behavior is a fundamental redesign of our Molecule class. This is not trivial -- I haven't found any such functionality in RDKit, and the only reference to it is an untouched issue from 2016: rdkit/rdkit#1125. OpenEye has an entire design docs section specifically discussing behavior around multiconformer SDFs. If, in the future, we do implement support for multiconformer SDFs, we could implement a reverse-compatible API as follows:
By default, the In the short term, we can save an OFFMol to disk by pickling the results of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I found earlier has been fixed, and what you laid out seems like a reasonable path forward to how the future API will interact with these changes. I think this is good to go (once tests pass)
The failing tests are just the DDOS protection on the DOI link -- All other tests are passing. The DOI link issue can be handled separately. I'm going to merge this PR. |
<atom.dprop.PartialCharge>
partial_charges
attribute is set toNone
(the default value), callingto_openeye
will now produce a OE molecule with partial charges set tonan
. This would previously produce an OE molecule with partial charges of 0.0, which was a loss of information, since it wouldn't be clear whether the original OFFMol's partial charges were all-zero orNone
. OpenEye toolkit wrapper methods such asfrom_smiles
andfrom_file
now produce OFFMols withpartial_charges = None
when appropriate (previously these would produce OFFMols with all-zero charges).Molecule.to_rdkit
now sets partial charges on the RDAtom'sPartialCharges
property (this was previously set on thepartial_charges
property). If the OFFMol's partial_charges attribute is None, this property will not be defined.Molecule
to SDF, only the first conformer will be written. For more fine-grained control of writing properties, conformers, and partial charges, users will need to callMolecule.to_rdkit
orMolecule.to_openeye
and use the flexibility offered by those packages.offmol.properties
when converting molecules to other packages, but users should be aware that no guarantee of data integrity is made. The only data format for keys and values in theoffmol.property
dict that we will try to support through a roundtrip to another toolkit's Molecule object isstring
.Status