Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change --protein k-size meaning #574

Closed
taylorreiter opened this issue Dec 13, 2018 · 6 comments
Closed

Change --protein k-size meaning #574

taylorreiter opened this issue Dec 13, 2018 · 6 comments

Comments

@taylorreiter
Copy link
Contributor

taylorreiter commented Dec 13, 2018

Currently, when the --protein flag is used and a DNA sequence is used as input, the k-size represents the number of nucleotides that go into the hash, not the number of amino acids that are represented in the hash. This is not intuitive, and so I suggest we do it differently where the k-size represents the number of amino acids.

@olgabot thoughts?

https://github.com/dib-lab/sourmash/blob/4aab62f65fb08044e9a43ec49331b65be8b5ae15/sourmash/kmer_min_hash.hh#L172

@ctb
Copy link
Contributor

ctb commented Jan 13, 2019

At the very least we should make sure it's properly documented ;)

@olgabot
Copy link
Collaborator

olgabot commented Mar 26, 2019

Yes this is a major point of confusion when I present.. I agree that it should be the k-mer size of the number of amino acids

@bluegenes
Copy link
Contributor

Following up on this:

I think that translated k=33 (e.g. from rna sequence) and protein k=11 (from protein sequence) should be comparable with sourmash compare. I don't think we do any checking for functionally equivalent protein ksizes, unless I've missed it?

Secondary issue: If you want to build both nucleotide and (translated) protein signatures with the same command/to the same file, the ksize must be divisible by 3 in order to work (e.g. k 31 will not work). Most of our utilities enable multiple-sigs-per-file design (one sig file per fasta) - it'd be nice not to limit the ksizes here.

One solution could be to add a "protein-ksizes" command line argument / variable. To maintain current functionality, if protein-ksizes are not provided, continue calculating them as k/3, but calculate the k/3 number in the command-line wrapper, rather than the underlying minhash code. This enables the k/3 number (e.g. k=11) rather than k (e.g. k= 33) to be kept with the signature, facilitating protein comparisons.

Thoughts @ctb @taylorreiter @luizirber @olgabot ?

A proof-of-concept/workingish implementation of the idea is here. Ignore the added signature utility in sourmash/sig/__main__.py- that solved the minor issue without addressing the major issue!

@ctb
Copy link
Contributor

ctb commented Dec 18, 2019

Hot takes --

  • adding a protein-ksizes parameter sounds complicated...
  • I agree that translated k=33 (e.g. from rna sequence) and protein k=11 (from protein sequence) should be comparable!
  • error messages are good, we should add more checking and better error messages before doing anything else!

I'd actually be a fan of splitting up the protein and DNA signature calculation; seems like the use cases are generally not that overlapping. #751 may also be relevant here.

@luizirber
Copy link
Member

I'd actually be a fan of splitting up the protein and DNA signature calculation; seems like the use cases are generally not that overlapping. #751 may also be relevant here.

re #751: that's what master in doing in Rust. is_protein/dayhoff/hp are disjoint, and there is an enum controlling which type is is (instead of string-typing it). For the Python layer it is still passing as booleans, but is is_protein=True and hp=True, it will unset is_protein.

@ctb
Copy link
Contributor

ctb commented Mar 4, 2021

#1315 fixed the core issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants