-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GLYCAM06j-1 forcefield conversion #156
Conversation
@jchodera @mikemhenry @rafwiewiora : could you help me identify why there are no glycan residues in |
Not sure if it helps or not but this site seems to have the parameter files: |
We managed to get to the root of the issue, and I will try a full conversion today! |
I was able to address the "primary showstopper" (i.e. getting glycan residues into the ffxml, described in detail above) with the help of John's comment. What I did:
This generated After generating the ffxml, I was getting errors like:
But there should be a third line for Here is the function:
And I added the following lines after this line:
After implementing this fix, I re-ran
My temporary fix: At this line, I replace When I re-run with this fix, I can compare the omm vs amber energies:
Notice there are two PeriodicTorsionForces in the amber energies..? And the NonbondedForce energies differ by a lot.. |
amber/convert_amber.py
Outdated
@@ -1247,6 +1258,68 @@ def validate_merged_lipids(ffxml_name, leaprc_name): | |||
os.unlink(f[1]) | |||
if verbose: print('Lipids energy validation for %s done!' % ffxml_name) | |||
|
|||
def validate_glyco_protein(ffxml_name, leaprc_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use this test idea?
@zhang-ivy Can you paste the XML file you generated as a gist? |
@jchodera : Here are the system xmls ( |
Oh, sorry, I was hoping to see the ffxml! |
@jchodera : The ffxml I generated is included in this PR: https://github.com/openmm/openmmforcefields/blob/b73b6b0e420a737bb315da79b1ba4a66397caf33/amber/ffxml/GLYCAM_06j-1_trunc.xml |
@peastman helped me identify that the reason why there is an energy discrepancy between the amber vs openmm-generated test systems is because the prefixes are not correct in I think I'll want to modify this function such that it iterates through all atom types in GLYCAM_06j.dat and if the atom type is present in ff14SB, add the @peastman and @jchodera : Please let me know what other considerations I need to take into account in order to write the appropriate changes necessary to address this prefix issue. |
Let me try to go over all the issues we need to consider. Hopefully I'm not missing any! Prefixes were a compromise to deal with differences in how OpenMM and Amber think about atom types. OpenMM expects them to be globally unique. If you load two files together, they can't each define a type with the same name. But Amber sometimes has multiple files listing the same type. The solution was to add prefixes like Each atom type definition also defines a class, which is the same name but without the prefix:
When defining parameters, they can be defined either by openmmforcefields/amber/ffxml/protein.ff14SB.xml Line 2823 in ca9319f
and openmmforcefields/amber/ffxml/DNA.bsc1.xml Line 1113 in ca9319f
These are basically the same definition, but one using protein types and the other using DNA types. This allows you to load either just the protein force field or just the DNA force field, and it has all parameters it needs. Or you can load both of them together, and you don't have to worry about duplicate or conflicting definitions. In principle we could separate all the common definitions into a different file that would specify them by class rather than type. That would be a substantial change though. All of this works fine for protein, DNA, and RNA. Those are separate molecules so they can each be parametrized independently. But glycans are bonded to other molecules. A single bond/angle/torsion might include both protein and glycan atom types. We need to figure out how to handle that. Atom types appear in one other place. ParmEd/ParmEd#1149 added a list of atom types that should have 1-4 interactions scaled differently. That list appears in an embedded script generated by ParmEd. We could make the conversion script modify them the same way it modifies types in other places, but it might be cleaner to modify ParmEd so it can generate types with correct prefixes right from the start. |
While working on this, I ran into what I think is a bug in the glycam force field.
However, the atom type NH is not defined. This is the only place it is mentioned anywhere in the file. It also is not a standard type used in the protein, DNA, or RNA force fields. |
Based on our meeting today, I created a script to post-process the XML file. It makes the following assumptions about the input file:
It creates a modified XML file with the following changes.
Here's a first draft of it. import xml.etree.ElementTree as etree
import re
import textwrap
# Define atom types
protein_types = set(["C", "CA", "CB", "CC", "CN", "CR", "CT", "CV", "CW", "C*", "CX", "H", "HC", "H1", "HA", "H4", "H5",
"HO", "HS", "HP", "N", "NA", "NB", "N2", "N3", "O", "O2", "OH", "S", "SH", "CO", "2C", "3C", "C8"])
solvent_types = set(["HW", "OW", "Li+", "Na+", "K+", "Rb+", "Cs+", "F-", "Cl-", "Br-", "I-"])
glycam_types = set()
replacements = {}
for type in protein_types:
replacements[type] = 'protein-'+type
# Process <AtomTypes>
tree = etree.parse('glycam.xml')
root = tree.getroot()
types = root.find('AtomTypes')
for type in types.findall('Type'):
name = type.get('name')
if name in protein_types or name in solvent_types:
types.remove(type)
else:
glycam_types.add(name)
replacements[name] = 'glycam-'+name
type.set('name', replacements[name])
# Process <Residues>
residues = root.find('Residues')
for residue in residues.findall('Residue'):
for atom in residue.findall('Atom'):
atom.set('type', replacements[atom.get('type')])
# Process <HarmonicBondForce>
force = root.find('HarmonicBondForce')
for bond in force.findall('Bond'):
types = [bond.get('type1'), bond.get('type2')]
if any(t in glycam_types for t in types):
bond.set('type1', replacements[types[0]])
bond.set('type2', replacements[types[1]])
else:
force.remove(bond)
# Process <HarmonicAngleForce>
force = root.find('HarmonicAngleForce')
for angle in force.findall('Angle'):
types = [angle.get('type1'), angle.get('type2'), angle.get('type3')]
if any(t in glycam_types for t in types):
angle.set('type1', replacements[types[0]])
angle.set('type2', replacements[types[1]])
angle.set('type3', replacements[types[2]])
else:
force.remove(angle)
# Process <PeriodicTorsionForce>
force = root.find('PeriodicTorsionForce')
for tag in ['Proper', 'Improper']:
for torsion in force.findall(tag):
types = [torsion.get('type1'), torsion.get('type2'), torsion.get('type3'), torsion.get('type4')]
if any(t in glycam_types for t in types):
torsion.set('type1', replacements[types[0]])
torsion.set('type2', replacements[types[1]])
torsion.set('type3', replacements[types[2]])
torsion.set('type4', replacements[types[3]])
else:
force.remove(torsion)
# Process <NonbondedForce>
force = root.find('NonbondedForce')
for atom in force.findall('Atom'):
type = atom.get('type')
if type in glycam_types:
atom.set('type', replacements[type])
else:
force.remove(atom)
# Update the unscaled types in the script
script = root.find('Script')
text = script.text
types = ', '.join('"%s"' % replacements[s] for s in sorted(glycam_types))
types = '\n '.join(textwrap.wrap(types))
pattern = re.compile('unscaled_types = \[(.*\n)*?.*?\]')
script.text = pattern.sub('unscaled_types = ['+types+']', text)
tree.write('modified.xml') |
Thanks @peastman ! Excited to try that out. Here is what I did:
# Read lines from prep file
with open("leaprc_GLYCAM_06j-1_2014-03-14/GLYCAM_06j-1.prep", "r") as f:
lines_in = f.readlines()
# Grab residues, keeping only one enantiomer of each
residues = []
for line in lines_in:
residue = line[:3]
if "INT 0" in line and residue.upper() not in residues: # Keep only one enantiomer
residues.append(residue)
# Make list of lines to write to the tleap.in file
lines = ["loadamberprep GLYCAM_06j-1.prep\n"]
for i in range(200, len(residues), 200):
lines.append("saveOff {" + " ".join(residues[i-200:i]) + "} GLYCAM_06j-1.lib\n")
lines.append("quit\n")
# Write lines
with open("make_glycam_lib_one_enantiomer.in", "w") as f:
f.writelines(lines)
Got error:
|
After fixing the external bonds error above, I realized that the approach we outlined in our meeting (keep only one enantiomer for each glycan residue) will not work. When I run However, if I do not filter out any residues when converting the
It seems that the UYB template actually matches many other residues that do not just differ by upper/lower case as I had initially thought, so my code for keeping only one enantiomer above is not going to work. For now, I'll proceed with testing Peter's fix on the ffxml generated for the small subset of glycan residues (( @peastman and @jchodera : Let me know if you have any thoughts on how to resolve this. If not, maybe we can discuss with the amber dev folks. |
I've just modified the truncated ffxml (with only a small subset of glycan residues) using Peter's code and tested the ffxml on a test system (glycosylated RBD): from simtk.openmm import app
# Load the ffxml
ffxml = [ffxml/protein.ff14SB.xml', "ffxml/GLYCAM_06j-1_trunc_modified.xml"]
if isinstance(ffxml, str):
ff = app.ForceField(ffxml)
else:
ff = app.ForceField(*ffxml)
# Load the test pdb
pdb = app.PDBFile("../test.pdb")
# Create system
system_omm = ff.createSystem(pdb.topology) The last line yields this error:
Here are the files necessary to reproduce the error ( |
It's important to remember that tleap matches by residue name, but OpenMM matches by template. It sounds like we'll need to generate a complete translation of the |
Weird! This sounds like another bug. Let's make sure to add this and the other potential bug in the |
OK, this suggests we will really need to add a mode to ParmEd to filter out identical residue templates on export to ffxml, or perhaps add this to @peastman's post-processing script as a short-term workaround. |
Actually, it seems like
Which other potential bug are you referring to here? |
Yes, I'm just not sure how we want to do this. |
Ions get defined along with water, since their parameters can be different for different water models. Add it to |
Woohoo! |
Sounds good! |
I've finished cleaning up this PR! I've updated the first comment in this PR with a summary of the conversion pipeline and changes required by other repos. Note that I'm still waiting on ParmEd/ParmEd#1184 to be merged. I've also spent the afternoon modifying the
The test now outputs:
I should note that I've set the energy tolerance to 1e-1 so that it passes. @jchodera @peastman : Could you please review and merge when you get a chance? |
Excellent, congratulations on getting this done! |
@peastman Thanks so much for your help!! |
@peastman : We should cut a new release with the glycam functionality once you've had a chance to review and merge this. |
It looks good to me. If all the tests pass, it should be good to merge. (I notice that CI is currently failing due to lack of an OpenEye license key, but that's not related to this.) |
Will fix! |
@peastman : I've updated the OpenEye toolkit license and have triggered all the tests to run again. |
@jchodera : It looks like there's still a problem with the open eye license |
Investigating. |
The issue here is that secrets are not disclosed on PRs opened from forks. I'll merge this and we can debug from here. |
@zhang-ivy I've added you to this repo so you can use branches instead of pushing from forks. That should get around this issue next time. |
Convert
leaprc.GLYCAM_06j-1
to an opemm ffxml.Requires ParmEd/ParmEd#1149
TODO:- [ ] The protein atoms inGLYCAM_06j-1
are namedN
,CA
, etc. Need to decide whether we should useff14SB
to match the existing protein atom names or enforce use ofprotein.ff14SB
(which will require changing the protein atom names inGLYCAM_06j-1
toprotein-N
,protein-CA
etc).- [x] Allowoverride_level
parameter in yaml input toconvert_amber.py
- [ ] Write test for validating energies (openmm vs tleap system) -- I've started this, but its not working. See below.Currently, the primary showstopper for the test is thatffxml/GLYCAM_06j-1.xml
does not contain residues that are actually part of glycans. The only residues that are present are the modified protein residues to which glycans can bond (e.g. NLN, which is the modified version of ASN).For example, here are some of the residues that should be present: [‘UYB’, ‘4YB’, ‘VMB’, ‘2MA’, ‘0YB’, ‘0FA’, ‘0LB’]
This problem shows up as an error when I runpython convert_amber.py -i glycan.yaml
athttps://github.com/zhang-ivy/openmmforcefields/blob/convert-glycans/amber/convert_amber.py#L626
Summary of the conversion pipeline:
openmmforcefields/amber/glycam/
, convert the glycam prep file (GLYCAM_06j-1.prep) into a .lib file (GLYCAM_06j-1.lib)tleap -s -f make_glycam_lib.in > make_glycam_lib.out
In
openmmforcefields/amber/glycam/
, create a modified glycam leaprc file (that replaces loading the .prep with loading the .lib)In
openmmforcefields/amber/glycam/
, create a glycan.yaml and make sure to load the modified leaprc.From
openmmforcefields/amber
, runpython convert_amber.py -i glycam/glycan.yaml --verbose
Summary of changes to existing codebases:
Parmed
Openmmforcefields
OpenMM