-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52
Conversation
Note: I used the 3ELJ Jnk1 structure for this commit because original (PDB ID 2GMX) was NT. |
Codecov Report
Additional details and impacted files |
Would it be possible to get an overview of what the fix process was here? (it'll help review and probably we should have details for this written somewhere) |
@jchodera: should we be including annotations on what the biological unit (e.g. dimer) for each target is? Including these in @ppxasjsm: would like to run these through BioSimSpace; can try to operate on this branch before we merge this PR. |
@j-wags does it make sense to make sure openff's current alpha release can read these files too? |
Absolutely -- I was including it in the draft of the paper but here's a brief overview of the prep in schrodinger:
|
I encountered a minor issue with the ligands sdf files and RDKit. If we concatenate them and try to read them using RDKit, the parsing fails (you only get one molecule). To solve this issue we would have to add a line jump between the last line with protein-ligand-benchmark/data/pde10/02_ligands/lig_2750/crd/lig_2750.sdf Lines 218 to 220 in bd860dd
should become
As far as I can tell, this only happens with RDkit and when concatenating the files (doesn't happen if I use OpenEye to read them). It is a minor annoyance but I wonder if this is something we would like to fix to support workflows that use RDKit and need to concatenate the files. |
@ijpulidos can you open an issue on rdkit about this? Seems like an easy fix and the hackathon is coming up |
@ijpulidos Yeah this is the same issue as #52 (comment) I checked with the format spec and the SDF files are definitely standard compliant, so this is an RDKit failure. |
The SDF spacing issue with |
From @ijpulidos: rdkit/rdkit#5467 |
@ijpulidos I was super convinced I was getting this issue even when standard compliant, but I just double checked and it that doesn't seem to be the case. We should just add in that extra blank line - @bobym do you want me to do this in #58? |
…de prep documentation
@IAlibay : @bobym has pushed her latest prep. Details:
I believe you're already in the loop, but wanted to ask if you're good to review? Is there anything you'd like me to take on? |
@IAlibay @ijpulidos @dotsdl Is this still an issue? If so, can you check with these files from CDK2, which I split in Maestro during export: |
For the few sdfs I've checked for now I think it's fine, or at least the multi-ligand SDF files load fine in RDKit. I'll check the rest as I go along. |
As discussed in the meeting earlier today, I noticed some changes in the number of ligands with the last commit. I run a simple script that checks this by counting the number of
While most of the targets gain ligands (this should be fine), you can see some of them losing one of the ligands, namely The script I run for this is in https://gist.github.com/ijpulidos/3260d15ae2c4ff06afecc7256eef94eb I suggest using the script in an independent and fresh clone of the repo, since I already nuked my local copy by interrupting it in just the "right" instant (seems like git can be susceptible to corruption by interrupting checkouts this way), you would run it using: python ~/workdir/snippets/check_n_ligands_plbenchmarks.py bd860dde c339a12c 2> /dev/null I'm redirecting the stderr to null to avoid printing noise. |
As previously discussed, I tried running all the transformations in the dataset with perses/openmm, and as far as I can see, there are 3 targets that are showing some issues:
|
All three should be addressed by the updates I'm working on. Or at least the PDBs will make it clear with HETATM residues in cases where there are non standard residues. |
Also @ijpulidos I was under the impression that the OpenFE team was tasked with running these? I'd like to avoid duplication of effort (and duplication of John's cluster time) where possible. |
Just noticed that as of the latest commit we no longer have |
@dotsdl Yes this is expected. The short version is that the preparation of the BACE datasets in our previous and other groups' benchmarking studies has been very far off of assay conditions (assay at pH 4-5, prep at pH 7), which we decided was far off enough to have deleterious effects on the protonation states of the proteins and ligands. PDE10 was exclused because the assays were run using rat PDE10 (maybe mouse? either way, not human), while the crystal structure was of human PDE10. We decided to remove these systems for now until we can validate their preparation and assay conditions. I believe the plan was to include them in the next set, where we will be prepping all systems according to their assay conditions rather than at a specific pH. |
Awesome, thanks @bobym! This sufficiently jogged my memory from our previous discussions! |
Removed large files from prep folder -- input and setup scripts still available.
Having discussed this with @dotsdl, there are some issues that need addressing, we will raise relevant issues in a short while. This should lessen the burden of dealing with a giant PR. |
Added newly prepped PDBs in
04_schrodinger
folders for each protein. Fixed invalid protein.pdb files.