You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
it's been a busy few days wrangling some basic picklist functionality into submission (#1587) and figuring out how to make it work quickly on large databases. Here is a summary!
a guide to the pull requests
picklists!
#1587 added the basic picklist class and associated command-line functionality to sourmash sig extract.
The next PR, #1588, added picklist functionality to Index.select(...), so that we could take advantage of picklists throughout the codebase. I added picklist command line options to search, gather, prefetch, and compare as part of this PR.
Now that #1588 is merged, we have the basic functionality spread through the code base, but picklists are still relatively slow on large databases. That's where manifests come in.
manifests!
The picklist functionality in #1588 relies on iterating over large collections of signatures, which involves loading each signature, which we really want to avoid.
So, #1590 introduces manifests into Index objects like databases. Manifests are catalogs of the metadata for all signatures in the Index. Their current fields are:
With this information, manifests support all of the arguments to selectors, i.e. if a manifest is available, db.select(...) can entirely avoid loading any signatures. PR #1590 introduces sourmash sig manifest to create manifests, supports manifests for selection (including with picklists) and adds manifest functionality to Zipfile collections.
So, #1590 makes it possible to do extremely fast subsetting of very large databases, as long as they have manifests.
(As a further demonstration of manifests, #1597 adds manifest creation and storage to SBTs.)
how does this figure into greyhound?
greyhound is our codename for the functionality that adds massively parallel search to databases - see #1226. It relies on using Rust to do multithreaded search of large sequence collections.
based on a chat with @luizirber, the current challenge in implementing greyhound in the main code base is this:
we neither want to move all the specialized Index APIs into Rust, nor do we want to pass a lot of signature objects through the Python/Rust FFI layer. The option that we think is preferable is to support one or more Storage classes - ZipStorage in particular - and then pass lists of internal locations from Python to Rust, and do the signature loading and so on in Rust.
And that's where #1598 comes in - #1598 refactors ZipFileLinearCollection to use ZipStorage underneath, and also adds associated manifest creation functionality when creating .zip collections. To my understanding, if and when #1598 is merged, we will be able to "just" pass the name of the Zip file and the internal locations of signatures to search over to Rust, and parallelized search will be possible 🎉
feedback and questions welcome :)
The text was updated successfully, but these errors were encountered:
ctb
changed the title
the picklist manifest: picklists, manifests, and greyhound
the picklist manifesto: picklists, manifests, and greyhound
Jun 18, 2021
it's been a busy few days wrangling some basic picklist functionality into submission (#1587) and figuring out how to make it work quickly on large databases. Here is a summary!
a guide to the pull requests
picklists!
#1587 added the basic picklist class and associated command-line functionality to
sourmash sig extract
.The next PR, #1588, added picklist functionality to
Index.select(...)
, so that we could take advantage of picklists throughout the codebase. I added picklist command line options tosearch
,gather
,prefetch
, andcompare
as part of this PR.Now that #1588 is merged, we have the basic functionality spread through the code base, but picklists are still relatively slow on large databases. That's where manifests come in.
manifests!
The picklist functionality in #1588 relies on iterating over large collections of signatures, which involves loading each signature, which we really want to avoid.
So, #1590 introduces manifests into
Index
objects like databases. Manifests are catalogs of the metadata for all signatures in theIndex
. Their current fields are:With this information, manifests support all of the arguments to selectors, i.e. if a manifest is available,
db.select(...)
can entirely avoid loading any signatures. PR #1590 introducessourmash sig manifest
to create manifests, supports manifests for selection (including with picklists) and adds manifest functionality to Zipfile collections.So, #1590 makes it possible to do extremely fast subsetting of very large databases, as long as they have manifests.
(As a further demonstration of manifests, #1597 adds manifest creation and storage to SBTs.)
how does this figure into greyhound?
greyhound is our codename for the functionality that adds massively parallel search to databases - see #1226. It relies on using Rust to do multithreaded search of large sequence collections.
based on a chat with @luizirber, the current challenge in implementing greyhound in the main code base is this:
we neither want to move all the specialized
Index
APIs into Rust, nor do we want to pass a lot of signature objects through the Python/Rust FFI layer. The option that we think is preferable is to support one or moreStorage
classes -ZipStorage
in particular - and then pass lists of internal locations from Python to Rust, and do the signature loading and so on in Rust.And that's where #1598 comes in - #1598 refactors
ZipFileLinearCollection
to useZipStorage
underneath, and also adds associated manifest creation functionality when creating .zip collections. To my understanding, if and when #1598 is merged, we will be able to "just" pass the name of the Zip file and the internal locations of signatures to search over to Rust, and parallelized search will be possible 🎉feedback and questions welcome :)
The text was updated successfully, but these errors were encountered: