Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manifests -> more interesting things with metadata #1916

Open
ctb opened this issue Mar 30, 2022 · 1 comment
Open

manifests -> more interesting things with metadata #1916

ctb opened this issue Mar 30, 2022 · 1 comment

Comments

@ctb
Copy link
Contributor

ctb commented Mar 30, 2022

so, belated realization that I suspect others saw coming a mile away, but...

once we can point at & direct load signatures in other databases, then we can do interesting things with metadata.

tl;dr indirection is super cool.

overriding names (and maybe other things)

since we use the md5 column to retrieve sketches, we could rename signatures by simply outputting a manifest with a new name column, and then telling sourmash to take the name from the manifest rather than the signature iteslf.

we could also build a standard "patch" format that removes, deprecates, and/or replaces signatures per ideas in
#985

including taxonomy in manifest-style CSV files

StandaloneManifestIndex should be able to simply ignore extra columns, so there's (almost) no reason not to just provide for manifest+taxonomy columns that can then be used for taxonomic retrieval and so on.

You could further modify commands like sig grep to search even ignored columns, which provides sig grep taxonomy as an extra; e.g. #1868

adding tags

similarly, providing extra columns that could be searched would readily enable tagging and folksonomies (custom ad hoc ontologies).

allowing/using more structured metadata

CSVs are limiting!
a more intriguing idea is to take the concept of a StandaloneManifestIndex for a ride and support a more flexible metadata format that ultimately references md5s.

The simplest version of this would be (in a YAML-like format for readability) -

---
index_location: path/to/zip
md5: c11126d0591db94cd3d1c8568499375f
---

followed by all the other metadata format. Here the only reason to provide an index_location is to make it loadable; you could imagine two extension -

  • first, provide metadata files that contain md5s w/o a specific index location, and then have a generic way in sourmash to cross-product such metadata with a list of locations where it just takes the first signature w/matching md5sum.
  • second, grow that list of locations into a databases format that references globally addressable databases, and then automatically manage local caching of those databases in some standard-ish location.

we could allow for several 'standard' keys for references - for example, 'name' could be another one, if we wanted to refer more broadly to metadata about a sequence.

this would also let us store multiple taxonomies in a single metadata file, although of course we'd want to make that file updateable too, so that we can update it with new taxonomy releases.

(maybe bdbags #991 could be a way to distribute metadata files with databases and then update things semi-automatically?)

other thoughts

this links into/enables other thoughts in other issues like #268,

  • allowing hypothesis annotations directly on sourmash objects
  • referring to DOIs and databases and provenance
@ctb
Copy link
Contributor Author

ctb commented Jan 31, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant