Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support reading/writing atom props from SD files #2297

Merged
merged 23 commits into from
Mar 3, 2019

Conversation

greglandrum
Copy link
Member

This still needs documentation in the "RDKit Book", but the basic idea is to allow automatic extraction of atomic property values from SDF data fields.
Here's a sample SDF showing the format:

test-mol
     RDKit  2D

  3  3  0  0  0  0  0  0  0  0999 V2000
    0.8660    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4330    0.7500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4330   -0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  3  1  1  0
M  END
>  <atom.dprop.PartialCharge>  (1) 
0.008 -0.314 0.008

>  <atom.iprop.NumHeavyNeighbors>  (1) 
2 2 2

>  <atom.prop.AtomLabel>  (1) 
C1 N2 C3

>  <atom.bprop.IsCarbon>  (1) 
1 0 1

>  <atom.prop.PartiallyMissing>  (1) 
one n/a three

>  <atom.iprop.PartiallyMissingInt>  (1) 
[?] 2 2 ?
$$$$

This does not happen by default when processing SDFs; you have to call supp.SetProcessPropertyLists(True) in order to have it go.

@greglandrum greglandrum added this to the 2019_03_1 milestone Feb 27, 2019
boost::token_compress_on);
if (tokens.size() < mol.getNumAtoms()) {
BOOST_LOG(rdWarningLog)
<< "Property list " << pn << " too short, only " << tokens.size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a bit clearer

Atom Property List ... found, require properties for X atoms.  Ignoring it.

<< " elements found. Ignoring it." << std::endl;
continue;
}
std::string mv = missingValueMarker;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the standard missingValueMarker n/a?

So with this code, one can't write: [n/a] 1 2 since the token can only be one character. Also, what if the form is:

[] 1 2

This would make the missing token "]"

I would take everything between [ and ] for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was incorrect about what substr was doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add regression test for atom prop with multiple chars

[n/a] 1 2 [n/a]

<< std::endl;
}
unsigned int atomid = i - first_token;
mol.getAtomWithIdx(atomid)->setProp(atompn, apv);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make it clear in the spec that "missing values" don't assign properties at all. This will be important in the docs.

}
inline void processMolPropertyLists(
ROMol &mol, const std::string &missingValueMarker = "n/a") {
applyMolListPropsToAtoms<std::string>(mol, "atom.prop.", missingValueMarker);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might think about moving the loop through getPropList out of the apply:

for(auto prop:mol.getPropList()) {
  if (prop.find("atom.prop.") == 0 ...)
     applyMolListPropsToAtoms<std::string>(mol, prop, missingValueMarker);
}

This would remove some work (3 extra loops and checks) and perhaps make the code a bit clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this would make the code less flexible.
It would save a bit of work, but this is not going to end up being a measurable fraction of the time required to construct a molecule, so I think that's ok

@@ -106,13 +106,21 @@ class RDKIT_FILEPARSERS_EXPORT ForwardSDMolSupplier : public MolSupplier {
virtual ROMol *next();
virtual bool atEnd();

void setProcessPropertyLists(bool val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this not to be the default (true)?

self.assertTrue(m.HasProp("atom.prop.AtomLabel"))
self.assertFalse(m.GetAtomWithIdx(0).HasProp("AtomLabel"))

sio = BytesIO(self.sdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the python api, I feel that exposing the property processing to python outside of the forward mol supplier might not be such a bad idea.

@bp-kelley
Copy link
Contributor

@greglandrum I added my comments, I think a few things need to be resolved, but it mostly looks god.

Copy link
Contributor

@bp-kelley bp-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parser should accept

[n/a] 1 2 n/a

which it currently doesn't (see comments).

We can get rid of some loops in the atom prop handling by removing the loops out of the lowest function.

<< " elements found. Ignoring it." << std::endl;
continue;
}
std::string mv = missingValueMarker;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add regression test for atom prop with multiple chars

[n/a] 1 2 [n/a]

first_token = 1;
}
if(mv.empty()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants