Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

bobym · 2022-05-09T22:47:25Z

Added newly prepped PDBs in 04_schrodinger folders for each protein. Fixed invalid protein.pdb files.

bobym · 2022-05-09T22:48:48Z

Note: I used the 3ELJ Jnk1 structure for this commit because original (PDB ID 2GMX) was NT.

codecov · 2022-05-09T22:53:37Z

Codecov Report

Merging #52 (c339a12) into main (fe6f969) will not change coverage.
The diff coverage is n/a.

❗ Current head c339a12 differs from pull request most recent head e5a5390. Consider uploading reports for the commit e5a5390 to get more accurate results

Additional details and impacted files

IAlibay · 2022-05-10T05:20:29Z

Would it be possible to get an overview of what the fix process was here? (it'll help review and probably we should have details for this written somewhere)

dotsdl · 2022-05-10T16:13:35Z

@jchodera: should we be including annotations on what the biological unit (e.g. dimer) for each target is? Including these in 00 annotations would be the best place for this. The manuscript also needs a column in the table that lists these.

@ppxasjsm: would like to run these through BioSimSpace; can try to operate on this branch before we merge this PR.

IAlibay · 2022-05-10T16:15:21Z

@j-wags does it make sense to make sure openff's current alpha release can read these files too?

bobym · 2022-05-10T16:48:28Z

Would it be possible to get an overview of what the fix process was here? (it'll help review and probably we should have details for this written somewhere)

Absolutely -- I was including it in the draft of the paper but here's a brief overview of the prep in schrodinger:

The co-crystal structures were prepared using Protein Prep Wizard1 in Schrödinger Suites (version 2021-1). Bond orders were assigned, explicit hydrogens were added, and non-bridging waters (fewer than two H-bonds to non-waters) were removed. Het states were generated for pH 7.4 using Epik2. Hydrogen bonds were assigned and the protein was ionized using PROPKA at pH 7.4. A restrained minimization was performed using the OPLS4 force field3 to a heavy atom RMSD convergence of 0.3Å. For entries with multiple biological assemblies in their PDB, the assembly with the highest degree of coverage and lowest disorder was retained for use. The co-crystalized ligands were then removed from the protein.

ijpulidos · 2022-07-27T16:41:15Z

I encountered a minor issue with the ligands sdf files and RDKit.

If we concatenate them and try to read them using RDKit, the parsing fails (you only get one molecule). To solve this issue we would have to add a line jump between the last line with $$$$ and the previous line. For example, the following

protein-ligand-benchmark/data/pde10/02_ligands/lig_2750/crd/lig_2750.sdf

Lines 218 to 220 in bd860dd

    
           > <s_glide_core_constrain_type> 
        
           snapped_core_restrain 
        
           $$$$

should become

> <s_glide_core_constrain_type> 
 snapped_core_restrain 

 $$$$

As far as I can tell, this only happens with RDkit and when concatenating the files (doesn't happen if I use OpenEye to read them). It is a minor annoyance but I wonder if this is something we would like to fix to support workflows that use RDKit and need to concatenate the files.

richardjgowers · 2022-07-27T16:46:48Z

@ijpulidos can you open an issue on rdkit about this? Seems like an easy fix and the hackathon is coming up

IAlibay · 2022-07-27T19:12:17Z

I encountered a minor issue with the ligands sdf files and RDKit.

If we concatenate them and try to read them using RDKit, the parsing fails (you only get one molecule). To solve this issue we would have to add a line jump between the last line with $$$$ and the previous line. For example, the following

protein-ligand-benchmark/data/pde10/02_ligands/lig_2750/crd/lig_2750.sdf

Lines 218 to 220 in bd860dd

> <s_glide_core_constrain_type>

snapped_core_restrain

$$$$

should become
> <s_glide_core_constrain_type> 
 snapped_core_restrain 

 $$$$ 
As far as I can tell, this only happens with RDkit and when concatenating the files (doesn't happen if I use OpenEye to read them). It is a minor annoyance but I wonder if this is something we would like to fix to support workflows that use RDKit and need to concatenate the files.

@ijpulidos Yeah this is the same issue as #52 (comment)

I checked with the format spec and the SDF files are definitely standard compliant, so this is an RDKit failure.

dotsdl · 2022-08-02T16:16:21Z

The SDF spacing issue with $$$$ may be a consequence of my splitting approach for the all ligand SDF's @bobym produced. Will verify this.

dotsdl · 2022-08-02T16:16:50Z

From @ijpulidos: rdkit/rdkit#5467

IAlibay · 2022-08-02T16:28:23Z

@ijpulidos I was super convinced I was getting this issue even when standard compliant, but I just double checked and it that doesn't seem to be the case.

We should just add in that extra blank line - @bobym do you want me to do this in #58?

…de prep documentation

dotsdl · 2022-09-08T01:09:28Z

@IAlibay : @bobym has pushed her latest prep. Details:

The methods are in /protein-ligand benchmark/preparation. There's a pdf with an overview, a file with the CLI commands for prep and exportation, and the ligand prep, grid generation, and docking folders for each target that includes the input files, execution scripts, logs, and results. Anyone with Schrodinger would be able to use the input files to setup their own

I believe you're already in the loop, but wanted to ask if you're good to review? Is there anything you'd like me to take on?

IAlibay · 2022-09-08T01:16:39Z

@dotsdl yes @bobym has indeed kept me in the loop here. I'll review and will implement any #58 fixes as necessary. I'm not back from leave until the 14th so I can't guarantee I'll be able to action things until then.

bobym · 2022-09-08T14:36:03Z

We should just add in that extra blank line - @bobym do you want me to do this in #58?

@IAlibay @ijpulidos @dotsdl Is this still an issue? If so, can you check with these files from CDK2, which I split in Maestro during export:
cdk2_ligs.zip

IAlibay · 2022-09-08T14:42:49Z

We should just add in that extra blank line - @bobym do you want me to do this in #58?

@IAlibay @ijpulidos @dotsdl Is this still an issue? If so, can you check with these files from CDK2, which I split in Maestro during export: cdk2_ligs.zip

For the few sdfs I've checked for now I think it's fine, or at least the multi-ligand SDF files load fine in RDKit. I'll check the rest as I go along.

ijpulidos · 2022-09-14T03:28:44Z

As discussed in the meeting earlier today, I noticed some changes in the number of ligands with the last commit. I run a simple script that checks this by counting the number of sdf files in the previous commit (revision bd860dde) and comparing that to the number of ligands from the merged sdf file in the latest commit (revision c339a12c), for each target. The results are as follows:

Target: hif2a
    at revision bd860dde: 37, at revision c339a12c: 37
Target: cmet
    at revision bd860dde: 5, at revision c339a12c: 4
Target: syk
    at revision bd860dde: 43, at revision c339a12c: 46
Target: pfkfb3
    at revision bd860dde: 33, at revision c339a12c: 38
Target: mcl1
    at revision bd860dde: 24, at revision c339a12c: 25
Target: eg5
    at revision bd860dde: 27, at revision c339a12c: 28
Target: p38
    at revision bd860dde: 29, at revision c339a12c: 31
Target: pde2
    at revision bd860dde: 20, at revision c339a12c: 21
Target: cdk2
    at revision bd860dde: 10, at revision c339a12c: 9
Target: cdk8
    at revision bd860dde: 31, at revision c339a12c: 32
Target: thrombin
    at revision bd860dde: 11, at revision c339a12c: 10
Target: tnks2
    at revision bd860dde: 27, at revision c339a12c: 26
Target: tyk2
    at revision bd860dde: 13, at revision c339a12c: 14
Target: shp2
    at revision bd860dde: 20, at revision c339a12c: 25
Target: ptp1b
    at revision bd860dde: 22, at revision c339a12c: 22

While most of the targets gain ligands (this should be fine), you can see some of them losing one of the ligands, namely cmet, cdk2, thrombin, and tnks2. This may be okay. @bobym I appreciate your input here since you probably know why this is the case.

The script I run for this is in https://gist.github.com/ijpulidos/3260d15ae2c4ff06afecc7256eef94eb

I suggest using the script in an independent and fresh clone of the repo, since I already nuked my local copy by interrupting it in just the "right" instant (seems like git can be susceptible to corruption by interrupting checkouts this way), you would run it using:

python ~/workdir/snippets/check_n_ligands_plbenchmarks.py bd860dde c339a12c 2> /dev/null

I'm redirecting the stderr to null to avoid printing noise.

ijpulidos · 2022-09-27T14:41:42Z

As previously discussed, I tried running all the transformations in the dataset with perses/openmm, and as far as I can see, there are 3 targets that are showing some issues:

MCL1: ValueError: No template found for residue 151 (VAL). The set of atoms is similar to NVAL, but it is missing 1 hydrogen atoms.
PFKFB3: ValueError: No template found for residue 449 (POP). This might mean your input topology is missing some atoms or bonds, or possibly that you are using the wrong force field.
THROMBIN: ValueError: No template found for residue 271 (TYS). This might mean your input topology is missing some atoms or bonds, or possibly that you are using the wrong force field.

IAlibay · 2022-09-27T15:02:13Z

All three should be addressed by the updates I'm working on. Or at least the PDBs will make it clear with HETATM residues in cases where there are non standard residues.

IAlibay · 2022-09-27T15:03:37Z

Also @ijpulidos I was under the impression that the OpenFE team was tasked with running these? I'd like to avoid duplication of effort (and duplication of John's cluster time) where possible.

dotsdl · 2022-09-30T22:31:24Z

Just noticed that as of the latest commit we no longer have bace, bace_hunt, bace_p2, and pde10 as targets. I may have missed this in conversation, but wanted to check that this is intended and expected?

bobym · 2022-10-03T14:10:35Z

Just noticed that as of the latest commit we no longer have bace, bace_hunt, bace_p2, and pde10 as targets. I may have missed this in conversation, but wanted to check that this is intended and expected?

@dotsdl Yes this is expected. The short version is that the preparation of the BACE datasets in our previous and other groups' benchmarking studies has been very far off of assay conditions (assay at pH 4-5, prep at pH 7), which we decided was far off enough to have deleterious effects on the protonation states of the proteins and ligands. PDE10 was exclused because the assays were run using rat PDE10 (maybe mouse? either way, not human), while the crystal structure was of human PDE10. We decided to remove these systems for now until we can validate their preparation and assay conditions. I believe the plan was to include them in the next set, where we will be prepping all systems according to their assay conditions rather than at a specific pH.

dotsdl · 2022-10-03T23:57:11Z

Awesome, thanks @bobym! This sufficiently jogged my memory from our previous discussions!

Removed large files from prep folder -- input and setup scripts still available.

IAlibay · 2022-10-07T15:50:59Z

Having discussed this with @dotsdl, there are some issues that need addressing, we will raise relevant issues in a short while. This should lessen the burden of dealing with a giant PR.

bobym added 20 commits May 9, 2022 18:37

Add files via upload

99ff078

Add files via upload

df3130a

Add files via upload

c58acef

Add files via upload

9d3dce4

Add files via upload

3cd77fd

Add files via upload

c634f63

Add files via upload

2002356

Add files via upload

5f14b48

Add files via upload

58189ac

Add files via upload

e5c29e5

Add files via upload

607c9bf

Add files via upload

346d6d1

Add files via upload

a4ff3a2

Add files via upload

8bb410d

Add files via upload

7f6f5f7

Add files via upload

82a9f66

Add files via upload

cd766cd

Add files via upload

f9849b1

Add files via upload

c9f38a7

Add files via upload

6b0fa20

bobym changed the title ~~Fix data~~ Fix protein.pdb files May 9, 2022

IAlibay self-requested a review May 10, 2022 05:20

dotsdl linked an issue May 10, 2022 that may be closed by this pull request

protein.pdb files are not valid PDB files #20

Closed

dotsdl added this to the Release 0.3.0 milestone May 10, 2022

IAlibay mentioned this pull request Jul 28, 2022

Incorrect symmetry in mapping of TNSK2 ligand pair? choderalab/perses#1078

Closed

Gilli08 mentioned this pull request Aug 8, 2022

Protonation state of HIS248 in HIF2A needs to be adapted #64

Open

Yoshanuikabundi mentioned this pull request Sep 1, 2022

Fix runtime issues in toolkit showcase openforcefield/openff-toolkit#1391

Merged

dotsdl assigned bobym Sep 6, 2022

Push CLI-prepped protein structures, split non-metal cofactors, inclu…

c339a12

…de prep documentation

IAlibay mentioned this pull request Sep 27, 2022

Generate benchmark data #65

Open

9 tasks

Remove large files from preparation folder

e5a5390

Removed large files from prep folder -- input and setup scripts still available.

dotsdl merged commit 262633b into main Oct 7, 2022

This was referenced Oct 10, 2022

Fix PDB files #68

Closed

[WIP] Fix MCL1 capping, curate thrombin ligands for expanded dynamic range #76

Closed

IAlibay mentioned this pull request Jan 10, 2023

Various PDB fixes #87

Merged

12 tasks

ijpulidos mentioned this pull request Jul 26, 2024

Choice of ligands in the repo #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

bobym commented May 9, 2022

bobym commented May 9, 2022

codecov bot commented May 9, 2022 •

edited

Loading

IAlibay commented May 10, 2022

dotsdl commented May 10, 2022 •

edited

Loading

IAlibay commented May 10, 2022

bobym commented May 10, 2022

ijpulidos commented Jul 27, 2022

richardjgowers commented Jul 27, 2022

IAlibay commented Jul 27, 2022

dotsdl commented Aug 2, 2022

dotsdl commented Aug 2, 2022

IAlibay commented Aug 2, 2022

dotsdl commented Sep 8, 2022

IAlibay commented Sep 8, 2022

bobym commented Sep 8, 2022 •

edited

Loading

IAlibay commented Sep 8, 2022

ijpulidos commented Sep 14, 2022 •

edited

Loading

ijpulidos commented Sep 27, 2022 •

edited

Loading

IAlibay commented Sep 27, 2022

IAlibay commented Sep 27, 2022

dotsdl commented Sep 30, 2022

bobym commented Oct 3, 2022

dotsdl commented Oct 3, 2022

IAlibay commented Oct 7, 2022

Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

Fix protein.pdb files; remove GROMACS-specific files; drop LFS usage #52

Conversation

bobym commented May 9, 2022

bobym commented May 9, 2022

codecov bot commented May 9, 2022 • edited Loading

Codecov Report

IAlibay commented May 10, 2022

dotsdl commented May 10, 2022 • edited Loading

IAlibay commented May 10, 2022

bobym commented May 10, 2022

ijpulidos commented Jul 27, 2022

richardjgowers commented Jul 27, 2022

IAlibay commented Jul 27, 2022

dotsdl commented Aug 2, 2022

dotsdl commented Aug 2, 2022

IAlibay commented Aug 2, 2022

dotsdl commented Sep 8, 2022

IAlibay commented Sep 8, 2022

bobym commented Sep 8, 2022 • edited Loading

IAlibay commented Sep 8, 2022

ijpulidos commented Sep 14, 2022 • edited Loading

ijpulidos commented Sep 27, 2022 • edited Loading

IAlibay commented Sep 27, 2022

IAlibay commented Sep 27, 2022

dotsdl commented Sep 30, 2022

bobym commented Oct 3, 2022

dotsdl commented Oct 3, 2022

IAlibay commented Oct 7, 2022

codecov bot commented May 9, 2022 •

edited

Loading

dotsdl commented May 10, 2022 •

edited

Loading

bobym commented Sep 8, 2022 •

edited

Loading

ijpulidos commented Sep 14, 2022 •

edited

Loading

ijpulidos commented Sep 27, 2022 •

edited

Loading