You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
so, belated realization that I suspect others saw coming a mile away, but...
once we can point at & direct load signatures in other databases, then we can do interesting things with metadata.
tl;dr indirection is super cool.
overriding names (and maybe other things)
since we use the md5 column to retrieve sketches, we could rename signatures by simply outputting a manifest with a new name column, and then telling sourmash to take the name from the manifest rather than the signature iteslf.
we could also build a standard "patch" format that removes, deprecates, and/or replaces signatures per ideas in #985
including taxonomy in manifest-style CSV files
StandaloneManifestIndex should be able to simply ignore extra columns, so there's (almost) no reason not to just provide for manifest+taxonomy columns that can then be used for taxonomic retrieval and so on.
You could further modify commands like sig grep to search even ignored columns, which provides sig grep taxonomy as an extra; e.g. #1868
adding tags
similarly, providing extra columns that could be searched would readily enable tagging and folksonomies (custom ad hoc ontologies).
allowing/using more structured metadata
CSVs are limiting!
a more intriguing idea is to take the concept of a StandaloneManifestIndex for a ride and support a more flexible metadata format that ultimately references md5s.
The simplest version of this would be (in a YAML-like format for readability) -
followed by all the other metadata format. Here the only reason to provide an index_location is to make it loadable; you could imagine two extension -
first, provide metadata files that contain md5s w/o a specific index location, and then have a generic way in sourmash to cross-product such metadata with a list of locations where it just takes the first signature w/matching md5sum.
second, grow that list of locations into a databases format that references globally addressable databases, and then automatically manage local caching of those databases in some standard-ish location.
(this would make the metadata files above portable across systems)
we could allow for several 'standard' keys for references - for example, 'name' could be another one, if we wanted to refer more broadly to metadata about a sequence.
this would also let us store multiple taxonomies in a single metadata file, although of course we'd want to make that file updateable too, so that we can update it with new taxonomy releases.
(maybe bdbags #991 could be a way to distribute metadata files with databases and then update things semi-automatically?)
other thoughts
this links into/enables other thoughts in other issues like #268,
allowing hypothesis annotations directly on sourmash objects
referring to DOIs and databases and provenance
The text was updated successfully, but these errors were encountered:
so, belated realization that I suspect others saw coming a mile away, but...
once we can point at & direct load signatures in other databases, then we can do interesting things with metadata.
tl;dr indirection is super cool.
overriding names (and maybe other things)
since we use the md5 column to retrieve sketches, we could rename signatures by simply outputting a manifest with a new name column, and then telling sourmash to take the name from the manifest rather than the signature iteslf.
we could also build a standard "patch" format that removes, deprecates, and/or replaces signatures per ideas in
#985
including taxonomy in manifest-style CSV files
StandaloneManifestIndex
should be able to simply ignore extra columns, so there's (almost) no reason not to just provide for manifest+taxonomy columns that can then be used for taxonomic retrieval and so on.You could further modify commands like
sig grep
to search even ignored columns, which providessig grep taxonomy
as an extra; e.g. #1868adding tags
similarly, providing extra columns that could be searched would readily enable tagging and folksonomies (custom ad hoc ontologies).
allowing/using more structured metadata
CSVs are limiting!
a more intriguing idea is to take the concept of a
StandaloneManifestIndex
for a ride and support a more flexible metadata format that ultimately references md5s.The simplest version of this would be (in a YAML-like format for readability) -
followed by all the other metadata format. Here the only reason to provide an index_location is to make it loadable; you could imagine two extension -
we could allow for several 'standard' keys for references - for example, 'name' could be another one, if we wanted to refer more broadly to metadata about a sequence.
this would also let us store multiple taxonomies in a single metadata file, although of course we'd want to make that file updateable too, so that we can update it with new taxonomy releases.
(maybe bdbags #991 could be a way to distribute metadata files with databases and then update things semi-automatically?)
other thoughts
this links into/enables other thoughts in other issues like #268,
The text was updated successfully, but these errors were encountered: