Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safer ksize selection in signature::Select #3026

Closed
bluegenes opened this issue Feb 22, 2024 · 2 comments · Fixed by #3028
Closed

safer ksize selection in signature::Select #3026

bluegenes opened this issue Feb 22, 2024 · 2 comments · Fixed by #3028
Labels

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Feb 22, 2024

Protein k-mer sizes are k=k*3 internally, but k in a manifest.

Insignature::Select, we account for this discrepancy like this:

valid = if let Some(ksize) = selection.ksize() {
    let k = s.ksize() as u32;
    k == ksize || k == ksize * 3

So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to is_protein or other relevant properties unless we load the minhash.

This is fine for most common uses, b/c

  1. If using a manifest, manifest selection works on k, so we should only be using signature::select with signatures of the right ksize already
  2. Even if not using manifest, we rarely use k-mer sizes that are 3x other ksizes we use.

But we could be safer about it anyway.

Just after the ksize check, we load the minhash to check scaled and downsample if needed. We could combine these checks and use the minhash to determine molecule type (via matching on encodings::HashFunctions, i think)

@bluegenes bluegenes added the rust label Feb 22, 2024
@ctb
Copy link
Contributor

ctb commented Feb 22, 2024

First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).

So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to is_protein or other relevant properties unless we load the minhash.

The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 22, 2024

First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).

agreed! #3028 🙂

The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?

Yes - in manifest::select, which is used on collections, we select on moltype.
subset = collection.select(&selection) produces a pared-down collection with matching params (and scaled values that can be downsampled). After this (or without this, for non-collection sigs) we do select at the signature level.


Ok, this was actually simpler than I was thinking! Sigs have hash_function, which in the JSON signature is always 0.murmur64. However, mh.hash_function is generated using the molecule type in the JSON, so it contains this info and we can match on it.

bluegenes added a commit that referenced this issue Feb 23, 2024
Check signature `.hash_function` to determine whether or not we need to
make protein ksize corrections.

* Fixes #3026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants