Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDBx Parser #297

Open
TimothyStiles opened this issue Mar 21, 2023 · 10 comments
Open

PDBx Parser #297

TimothyStiles opened this issue Mar 21, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request hard A major or complex undertaking medium priority The default priority for a new issue.
Milestone

Comments

@TimothyStiles
Copy link
Collaborator

TimothyStiles commented Mar 21, 2023

We're getting structural but first we need a parser for pdbx so we can scrape and parse all of protein databank.

I've seen a couple of go parsers for PDBx but am unsure of their quality. In the end we need to be able to parse a lot of these files:

https://www.wwpdb.org/deposition/preparing-pdbx-mmcif-files

@TimothyStiles TimothyStiles converted this from a draft issue Mar 21, 2023
@rkrishnasanka
Copy link
Contributor

rkrishnasanka commented Mar 29, 2023

Here's the actual file format - http://www.wwpdb.org/documentation/file-format

I might be interested in taking this on in a month or so. Wouldn't mind outlining what needs to get done first.

One of the underlying formats is STAR - https://pubs.acs.org/doi/10.1021/ci00019a005

@carreter
Copy link
Collaborator

carreter commented Aug 8, 2023

@rkrishnasanka Pinged you over on the Discord about this, pinging you here as well - I'm thinking of picking this up and was wondering if you'd made any progress or would like to collaborate!

@carreter
Copy link
Collaborator

carreter commented Aug 30, 2023

What actually is PDBx/mmCIF, anyway?

Did a bit more research on this, and it seems like the underlying syntax for PDBx/mmCIF is CIF v1.1, which is a proper subset of STAR and a glorified way of storing key:value pairs.

On top of this syntax exist the Dictionary Definition Languages, which allow for the description of "dictionaries" that grant domain-specific meaning (+ validation) to the data items stored in a CIF file. Seems like DDL is a self-validating format, which is pretty neat! There are two competing DDL versions currently used to store the PDBx/mmCIF, DDLm and DDL2. Seems like DDLm is a superset of DDL2, so it's probably worth targeting DDLm in our efforts.

So, in summary: PDBx/mmCIF's syntax is defined by CIF v1.1, and its semantics are defined by DDLm (the syntax for which is again CIF v1.1, and semantics for which is again defined by DDLm (yay recursion!)).

Action Items

It seems to me that the next two tasks are clear:

  • Write a CIF v1.1-compliant parser
  • Write a validator that can take a DDLm dictionary and apply it to a parsed CIF v1.1 file

Where to from there?

Based on my understanding, it would then be possible to write code generation tools that take a DDLm dictionary and generate the proper Go structs to represent the data. The alternative would be to manually create Go structs based on the current PDBx/mmCIF dictionary, which seems like a slog that would be prone to error. I have no idea how to go about writing code generation tools though, so I will absolutely need help on this!

See also

See Westbrook et al. 2022 for a nice overview of the current state of the PDBx/mmCIF ecosystem, as well as this tutorial on wwPDB for a brief but less rigorous intro to the PDBx/mmCIF format.

@carreter
Copy link
Collaborator

Update on this: CIF parser is nearing completion, will put a PR up soon™️.

@carreter carreter self-assigned this Sep 16, 2023
@carreter carreter added enhancement New feature or request medium priority The default priority for a new issue. hard A major or complex undertaking labels Sep 16, 2023
@carreter carreter moved this to In Progress in poly development roadmap Sep 23, 2023
@carreter carreter added this to the v1.0 milestone Sep 23, 2023
@TimothyStiles
Copy link
Collaborator Author

@carreter where is this on your roadmap for after this semester? By chance met @ethanholz at a conference the other week who joined our discord, wrote this in zig, and had some pretty good insights.

https://github.com/ethanholz/sonic-pdb-parser

@ethanholz
Copy link

I am very happy to help in whatever way I can! I am a Go dev by trade but started exploring that parser on the side with Zig.

@carreter
Copy link
Collaborator

@carreter where is this on your roadmap for after this semester? By chance met @ethanholz at a conference the other week who joined our discord, wrote this in zig, and had some pretty good insights.

https://github.com/ethanholz/sonic-pdb-parser

Probably not super high priority.

Semester ends on 12/13, gonna relax for a bit then ramp up to full time paid work on poly during January! I'll be focusing on tools for Prof. Weiss's lab, but this will be something I'm still working on during that time.

Currently have a working mmCIF parser, but the DDLm part still needs to be written.

@carreter carreter removed their assignment Dec 19, 2023
@carreter
Copy link
Collaborator

.take

Copy link

Thanks for taking this issue! Let us know if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hard A major or complex undertaking medium priority The default priority for a new issue.
Projects
None yet
Development

No branches or pull requests

4 participants