Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

Merged
merged 39 commits into from
Oct 7, 2022

Conversation

bobym
Copy link
Collaborator

@bobym bobym commented May 9, 2022

Added newly prepped PDBs in 04_schrodinger folders for each protein. Fixed invalid protein.pdb files.

@bobym bobym changed the title Fix data Fix protein.pdb files May 9, 2022
@bobym
Copy link
Collaborator Author

bobym commented May 9, 2022

Note: I used the 3ELJ Jnk1 structure for this commit because original (PDB ID 2GMX) was NT.

@codecov
Copy link

codecov bot commented May 9, 2022

Codecov Report

Merging #52 (c339a12) into main (fe6f969) will not change coverage.
The diff coverage is n/a.

❗ Current head c339a12 differs from pull request most recent head e5a5390. Consider uploading reports for the commit e5a5390 to get more accurate results

Additional details and impacted files

@IAlibay
Copy link
Collaborator

IAlibay commented May 10, 2022

Would it be possible to get an overview of what the fix process was here? (it'll help review and probably we should have details for this written somewhere)

@IAlibay IAlibay self-requested a review May 10, 2022 05:20
@dotsdl dotsdl linked an issue May 10, 2022 that may be closed by this pull request
@dotsdl dotsdl added this to the Release 0.3.0 milestone May 10, 2022
@dotsdl
Copy link
Member

dotsdl commented May 10, 2022

@jchodera: should we be including annotations on what the biological unit (e.g. dimer) for each target is? Including these in 00 annotations would be the best place for this. The manuscript also needs a column in the table that lists these.

@ppxasjsm: would like to run these through BioSimSpace; can try to operate on this branch before we merge this PR.

@IAlibay
Copy link
Collaborator

IAlibay commented May 10, 2022

@j-wags does it make sense to make sure openff's current alpha release can read these files too?

@bobym
Copy link
Collaborator Author

bobym commented May 10, 2022

Would it be possible to get an overview of what the fix process was here? (it'll help review and probably we should have details for this written somewhere)

Absolutely -- I was including it in the draft of the paper but here's a brief overview of the prep in schrodinger:

The co-crystal structures were prepared using Protein Prep Wizard1 in Schrödinger Suites (version 2021-1). Bond orders were assigned, explicit hydrogens were added, and non-bridging waters (fewer than two H-bonds to non-waters) were removed. Het states were generated for pH 7.4 using Epik2. Hydrogen bonds were assigned and the protein was ionized using PROPKA at pH 7.4. A restrained minimization was performed using the OPLS4 force field3 to a heavy atom RMSD convergence of 0.3Å. For entries with multiple biological assemblies in their PDB, the assembly with the highest degree of coverage and lowest disorder was retained for use. The co-crystalized ligands were then removed from the protein.

@ijpulidos
Copy link
Collaborator

I encountered a minor issue with the ligands sdf files and RDKit.

If we concatenate them and try to read them using RDKit, the parsing fails (you only get one molecule). To solve this issue we would have to add a line jump between the last line with $$$$ and the previous line. For example, the following

> <s_glide_core_constrain_type>
snapped_core_restrain
$$$$

should become

> <s_glide_core_constrain_type> 
 snapped_core_restrain 

 $$$$ 

As far as I can tell, this only happens with RDkit and when concatenating the files (doesn't happen if I use OpenEye to read them). It is a minor annoyance but I wonder if this is something we would like to fix to support workflows that use RDKit and need to concatenate the files.

@richardjgowers
Copy link

@ijpulidos can you open an issue on rdkit about this? Seems like an easy fix and the hackathon is coming up

@IAlibay
Copy link
Collaborator

IAlibay commented Jul 27, 2022

I encountered a minor issue with the ligands sdf files and RDKit.

If we concatenate them and try to read them using RDKit, the parsing fails (you only get one molecule). To solve this issue we would have to add a line jump between the last line with $$$$ and the previous line. For example, the following

> <s_glide_core_constrain_type>
snapped_core_restrain
$$$$

should become

> <s_glide_core_constrain_type> 
 snapped_core_restrain 

 $$$$ 

As far as I can tell, this only happens with RDkit and when concatenating the files (doesn't happen if I use OpenEye to read them). It is a minor annoyance but I wonder if this is something we would like to fix to support workflows that use RDKit and need to concatenate the files.

@ijpulidos Yeah this is the same issue as #52 (comment)

I checked with the format spec and the SDF files are definitely standard compliant, so this is an RDKit failure.

@dotsdl
Copy link
Member

dotsdl commented Aug 2, 2022

The SDF spacing issue with $$$$ may be a consequence of my splitting approach for the all ligand SDF's @bobym produced. Will verify this.

@dotsdl
Copy link
Member

dotsdl commented Aug 2, 2022

From @ijpulidos: rdkit/rdkit#5467

@IAlibay
Copy link
Collaborator

IAlibay commented Aug 2, 2022

@ijpulidos I was super convinced I was getting this issue even when standard compliant, but I just double checked and it that doesn't seem to be the case.

We should just add in that extra blank line - @bobym do you want me to do this in #58?

@dotsdl
Copy link
Member

dotsdl commented Sep 8, 2022

@IAlibay : @bobym has pushed her latest prep. Details:

The methods are in /protein-ligand benchmark/preparation. There's a pdf with an overview, a file with the CLI commands for prep and exportation, and the ligand prep, grid generation, and docking folders for each target that includes the input files, execution scripts, logs, and results. Anyone with Schrodinger would be able to use the input files to setup their own

I believe you're already in the loop, but wanted to ask if you're good to review? Is there anything you'd like me to take on?

@IAlibay
Copy link
Collaborator

IAlibay commented Sep 8, 2022

@dotsdl yes @bobym has indeed kept me in the loop here. I'll review and will implement any #58 fixes as necessary. I'm not back from leave until the 14th so I can't guarantee I'll be able to action things until then.

@bobym
Copy link
Collaborator Author

bobym commented Sep 8, 2022

We should just add in that extra blank line - @bobym do you want me to do this in #58?

@IAlibay @ijpulidos @dotsdl Is this still an issue? If so, can you check with these files from CDK2, which I split in Maestro during export:
cdk2_ligs.zip

@IAlibay
Copy link
Collaborator

IAlibay commented Sep 8, 2022

We should just add in that extra blank line - @bobym do you want me to do this in #58?

@IAlibay @ijpulidos @dotsdl Is this still an issue? If so, can you check with these files from CDK2, which I split in Maestro during export: cdk2_ligs.zip

For the few sdfs I've checked for now I think it's fine, or at least the multi-ligand SDF files load fine in RDKit. I'll check the rest as I go along.

@ijpulidos
Copy link
Collaborator

ijpulidos commented Sep 14, 2022

As discussed in the meeting earlier today, I noticed some changes in the number of ligands with the last commit. I run a simple script that checks this by counting the number of sdf files in the previous commit (revision bd860dde) and comparing that to the number of ligands from the merged sdf file in the latest commit (revision c339a12c), for each target. The results are as follows:

Target: hif2a
    at revision bd860dde: 37, at revision c339a12c: 37
Target: cmet
    at revision bd860dde: 5, at revision c339a12c: 4
Target: syk
    at revision bd860dde: 43, at revision c339a12c: 46
Target: pfkfb3
    at revision bd860dde: 33, at revision c339a12c: 38
Target: mcl1
    at revision bd860dde: 24, at revision c339a12c: 25
Target: eg5
    at revision bd860dde: 27, at revision c339a12c: 28
Target: p38
    at revision bd860dde: 29, at revision c339a12c: 31
Target: pde2
    at revision bd860dde: 20, at revision c339a12c: 21
Target: cdk2
    at revision bd860dde: 10, at revision c339a12c: 9
Target: cdk8
    at revision bd860dde: 31, at revision c339a12c: 32
Target: thrombin
    at revision bd860dde: 11, at revision c339a12c: 10
Target: tnks2
    at revision bd860dde: 27, at revision c339a12c: 26
Target: tyk2
    at revision bd860dde: 13, at revision c339a12c: 14
Target: shp2
    at revision bd860dde: 20, at revision c339a12c: 25
Target: ptp1b
    at revision bd860dde: 22, at revision c339a12c: 22

While most of the targets gain ligands (this should be fine), you can see some of them losing one of the ligands, namely cmet, cdk2, thrombin, and tnks2. This may be okay. @bobym I appreciate your input here since you probably know why this is the case.

The script I run for this is in https://gist.github.com/ijpulidos/3260d15ae2c4ff06afecc7256eef94eb

I suggest using the script in an independent and fresh clone of the repo, since I already nuked my local copy by interrupting it in just the "right" instant (seems like git can be susceptible to corruption by interrupting checkouts this way), you would run it using:

python ~/workdir/snippets/check_n_ligands_plbenchmarks.py bd860dde c339a12c 2> /dev/null

I'm redirecting the stderr to null to avoid printing noise.

@ijpulidos
Copy link
Collaborator

ijpulidos commented Sep 27, 2022

As previously discussed, I tried running all the transformations in the dataset with perses/openmm, and as far as I can see, there are 3 targets that are showing some issues:

  • MCL1: ValueError: No template found for residue 151 (VAL). The set of atoms is similar to NVAL, but it is missing 1 hydrogen atoms.
  • PFKFB3: ValueError: No template found for residue 449 (POP). This might mean your input topology is missing some atoms or bonds, or possibly that you are using the wrong force field.
  • THROMBIN: ValueError: No template found for residue 271 (TYS). This might mean your input topology is missing some atoms or bonds, or possibly that you are using the wrong force field.

@IAlibay
Copy link
Collaborator

IAlibay commented Sep 27, 2022

All three should be addressed by the updates I'm working on. Or at least the PDBs will make it clear with HETATM residues in cases where there are non standard residues.

@IAlibay
Copy link
Collaborator

IAlibay commented Sep 27, 2022

Also @ijpulidos I was under the impression that the OpenFE team was tasked with running these? I'd like to avoid duplication of effort (and duplication of John's cluster time) where possible.

@IAlibay IAlibay mentioned this pull request Sep 27, 2022
9 tasks
@dotsdl
Copy link
Member

dotsdl commented Sep 30, 2022

Just noticed that as of the latest commit we no longer have bace, bace_hunt, bace_p2, and pde10 as targets. I may have missed this in conversation, but wanted to check that this is intended and expected?

@bobym
Copy link
Collaborator Author

bobym commented Oct 3, 2022

Just noticed that as of the latest commit we no longer have bace, bace_hunt, bace_p2, and pde10 as targets. I may have missed this in conversation, but wanted to check that this is intended and expected?

@dotsdl Yes this is expected. The short version is that the preparation of the BACE datasets in our previous and other groups' benchmarking studies has been very far off of assay conditions (assay at pH 4-5, prep at pH 7), which we decided was far off enough to have deleterious effects on the protonation states of the proteins and ligands. PDE10 was exclused because the assays were run using rat PDE10 (maybe mouse? either way, not human), while the crystal structure was of human PDE10. We decided to remove these systems for now until we can validate their preparation and assay conditions. I believe the plan was to include them in the next set, where we will be prepping all systems according to their assay conditions rather than at a specific pH.

@dotsdl
Copy link
Member

dotsdl commented Oct 3, 2022

Awesome, thanks @bobym! This sufficiently jogged my memory from our previous discussions!

Removed large files from prep folder -- input and setup scripts still available.
@dotsdl dotsdl merged commit 262633b into main Oct 7, 2022
@IAlibay
Copy link
Collaborator

IAlibay commented Oct 7, 2022

Having discussed this with @dotsdl, there are some issues that need addressing, we will raise relevant issues in a short while. This should lessen the burden of dealing with a giant PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
7 participants