-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDBx Parser #297
Comments
Here's the actual file format - http://www.wwpdb.org/documentation/file-format I might be interested in taking this on in a month or so. Wouldn't mind outlining what needs to get done first. One of the underlying formats is STAR - https://pubs.acs.org/doi/10.1021/ci00019a005 |
@rkrishnasanka Pinged you over on the Discord about this, pinging you here as well - I'm thinking of picking this up and was wondering if you'd made any progress or would like to collaborate! |
What actually is PDBx/mmCIF, anyway?Did a bit more research on this, and it seems like the underlying syntax for PDBx/mmCIF is CIF v1.1, which is a proper subset of STAR and a glorified way of storing key:value pairs. On top of this syntax exist the Dictionary Definition Languages, which allow for the description of "dictionaries" that grant domain-specific meaning (+ validation) to the data items stored in a CIF file. Seems like DDL is a self-validating format, which is pretty neat! There are two competing DDL versions currently used to store the PDBx/mmCIF, DDLm and DDL2. Seems like DDLm is a superset of DDL2, so it's probably worth targeting DDLm in our efforts. So, in summary: PDBx/mmCIF's syntax is defined by CIF v1.1, and its semantics are defined by DDLm (the syntax for which is again CIF v1.1, and semantics for which is again defined by DDLm (yay recursion!)). Action ItemsIt seems to me that the next two tasks are clear:
Where to from there?Based on my understanding, it would then be possible to write code generation tools that take a DDLm dictionary and generate the proper Go structs to represent the data. The alternative would be to manually create Go structs based on the current PDBx/mmCIF dictionary, which seems like a slog that would be prone to error. I have no idea how to go about writing code generation tools though, so I will absolutely need help on this! See alsoSee Westbrook et al. 2022 for a nice overview of the current state of the PDBx/mmCIF ecosystem, as well as this tutorial on wwPDB for a brief but less rigorous intro to the PDBx/mmCIF format. |
Update on this: CIF parser is nearing completion, will put a PR up soon™️. |
@carreter where is this on your roadmap for after this semester? By chance met @ethanholz at a conference the other week who joined our discord, wrote this in zig, and had some pretty good insights. |
I am very happy to help in whatever way I can! I am a Go dev by trade but started exploring that parser on the side with Zig. |
Probably not super high priority. Semester ends on 12/13, gonna relax for a bit then ramp up to full time paid work on poly during January! I'll be focusing on tools for Prof. Weiss's lab, but this will be something I'm still working on during that time. Currently have a working mmCIF parser, but the DDLm part still needs to be written. |
.take |
Thanks for taking this issue! Let us know if you have any questions! |
We're getting structural but first we need a parser for pdbx so we can scrape and parse all of protein databank.
I've seen a couple of go parsers for PDBx but am unsure of their quality. In the end we need to be able to parse a lot of these files:
https://www.wwpdb.org/deposition/preparing-pdbx-mmcif-files
The text was updated successfully, but these errors were encountered: