Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure repo to avoid need for LFS; drop LFS #53

Closed
dotsdl opened this issue May 10, 2022 · 5 comments · Fixed by #52
Closed

Restructure repo to avoid need for LFS; drop LFS #53

dotsdl opened this issue May 10, 2022 · 5 comments · Fixed by #52
Milestone

Comments

@dotsdl
Copy link
Member

dotsdl commented May 10, 2022

We'd like to drop LFS use if possible to make this repo more friendly to developers. This would likely entail removal of e.g. gromacs-specific files and sticking with PDBs, SDFs, for system representation.

@dotsdl dotsdl added this to the Release 0.4.0 milestone May 10, 2022
@IAlibay
Copy link
Collaborator

IAlibay commented May 10, 2022

So in terms of size of things we would want to keep:
Size of data//01_protein/crd - (non-solvated protein PDBs + cofactor PDBs) - 8.5 M
Size of data/
/02_ligands/*/crd - (SDF ligand files) - 5.2 M

@IAlibay
Copy link
Collaborator

IAlibay commented May 24, 2022

A quick question here - how would we feel about concatenating the ligand SDFs to a single file per system? It would make life a lot easier rather than reading a bunch individually, but maybe there's an alternative use case that would prefer not having everything in the one file?

@j-wags
Copy link
Member

j-wags commented May 24, 2022

I'd recommend keeping them as totally separate files. This will improve interoperability and reliability.

Multi-structure SDFs are a little risky, since there's ambiguity about whether the structures are conformers of the same molecule or totally separate molecules. Different tools make different assumptions here - OpenEye has a whole thing about which behaviors are triggered in which cases and OpenFF just says "multi-conformer SDFs don't exist, they're always separate molecules". I forgot what RDKit does, but my point is that different tools will have different behaviors when confronted with a multi-molecule SDF and it will eventually affect software reliability.

@dotsdl
Copy link
Member Author

dotsdl commented May 31, 2022

We'll handle this in #52. I've already dropped the LFS filters there in 7718225. That will keep files from being handled by LFS once we merge that PR. We'll remove the Gromacs-specific files and simplify the file structure for targets+ligands in that PR as well.

@jchodera
Copy link
Member

jchodera commented Sep 1, 2022

All: Apologies for missing this conversation.

Are there any concrete examples where concatenated SDF files present significant, existential risk to real free energy pipelines? It seems like we could easily address this if needed by delivering multimolecule SDF files, but providing a very simple script to break up ligands into different files.

This would have a huge advantage over maintaining literally thousands of individual SDF files in github, since nearly all tools are happy to process these files together. Processing a bunch of separate SDF files for anything, even visualization, is just a huge pain.

@dotsdl dotsdl closed this as completed in #52 Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants