-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change --protein
k-size meaning
#574
Comments
At the very least we should make sure it's properly documented ;) |
Yes this is a major point of confusion when I present.. I agree that it should be the k-mer size of the number of amino acids |
Following up on this: I think that translated k=33 (e.g. from rna sequence) and protein k=11 (from protein sequence) should be comparable with Secondary issue: If you want to build both nucleotide and (translated) protein signatures with the same command/to the same file, the ksize must be divisible by 3 in order to work (e.g. k 31 will not work). Most of our utilities enable multiple-sigs-per-file design (one sig file per fasta) - it'd be nice not to limit the ksizes here. One solution could be to add a "protein-ksizes" command line argument / variable. To maintain current functionality, if protein-ksizes are not provided, continue calculating them as Thoughts @ctb @taylorreiter @luizirber @olgabot ? A proof-of-concept/workingish implementation of the idea is here. Ignore the added signature utility in |
Hot takes --
I'd actually be a fan of splitting up the protein and DNA signature calculation; seems like the use cases are generally not that overlapping. #751 may also be relevant here. |
re #751: that's what master in doing in Rust. is_protein/dayhoff/hp are disjoint, and there is an enum controlling which type is is (instead of string-typing it). For the Python layer it is still passing as booleans, but is |
#1315 fixed the core issue here. |
Currently, when the
--protein
flag is used and a DNA sequence is used as input, the k-size represents the number of nucleotides that go into the hash, not the number of amino acids that are represented in the hash. This is not intuitive, and so I suggest we do it differently where the k-size represents the number of amino acids.@olgabot thoughts?
https://github.com/dib-lab/sourmash/blob/4aab62f65fb08044e9a43ec49331b65be8b5ae15/sourmash/kmer_min_hash.hh#L172
The text was updated successfully, but these errors were encountered: