Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the picklist manifesto: picklists, manifests, and greyhound #1599

Closed
ctb opened this issue Jun 16, 2021 · 2 comments
Closed

the picklist manifesto: picklists, manifests, and greyhound #1599

ctb opened this issue Jun 16, 2021 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jun 16, 2021

it's been a busy few days wrangling some basic picklist functionality into submission (#1587) and figuring out how to make it work quickly on large databases. Here is a summary!

a guide to the pull requests

picklists!

#1587 added the basic picklist class and associated command-line functionality to sourmash sig extract.

The next PR, #1588, added picklist functionality to Index.select(...), so that we could take advantage of picklists throughout the codebase. I added picklist command line options to search, gather, prefetch, and compare as part of this PR.

Now that #1588 is merged, we have the basic functionality spread through the code base, but picklists are still relatively slow on large databases. That's where manifests come in.

manifests!

The picklist functionality in #1588 relies on iterating over large collections of signatures, which involves loading each signature, which we really want to avoid.

So, #1590 introduces manifests into Index objects like databases. Manifests are catalogs of the metadata for all signatures in the Index. Their current fields are:

('internal_location',
'md5', 'md5short', 'ksize', 'moltype', 'num',
'scaled', 'n_hashes', 'with_abundance',
'name', 'filename')

With this information, manifests support all of the arguments to selectors, i.e. if a manifest is available, db.select(...) can entirely avoid loading any signatures. PR #1590 introduces sourmash sig manifest to create manifests, supports manifests for selection (including with picklists) and adds manifest functionality to Zipfile collections.

So, #1590 makes it possible to do extremely fast subsetting of very large databases, as long as they have manifests.

(As a further demonstration of manifests, #1597 adds manifest creation and storage to SBTs.)

how does this figure into greyhound?

greyhound is our codename for the functionality that adds massively parallel search to databases - see #1226. It relies on using Rust to do multithreaded search of large sequence collections.

based on a chat with @luizirber, the current challenge in implementing greyhound in the main code base is this:

we neither want to move all the specialized Index APIs into Rust, nor do we want to pass a lot of signature objects through the Python/Rust FFI layer. The option that we think is preferable is to support one or more Storage classes - ZipStorage in particular - and then pass lists of internal locations from Python to Rust, and do the signature loading and so on in Rust.

And that's where #1598 comes in - #1598 refactors ZipFileLinearCollection to use ZipStorage underneath, and also adds associated manifest creation functionality when creating .zip collections. To my understanding, if and when #1598 is merged, we will be able to "just" pass the name of the Zip file and the internal locations of signatures to search over to Rust, and parallelized search will be possible 🎉

feedback and questions welcome :)

@ctb ctb changed the title the picklist manifest: picklists, manifests, and greyhound the picklist manifesto: picklists, manifests, and greyhound Jun 18, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 19, 2021

something that is kind of supported by manifests with little extra effort - #268 - folksonomy and tagging!

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

I think this can be closed - greyhound is not yet implemented on the Rust side, but all the machinery is there. 🎉

@ctb ctb closed this as completed Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant