Change `--protein` k-size meaning #574

taylorreiter · 2018-12-13T18:07:40Z

Currently, when the --protein flag is used and a DNA sequence is used as input, the k-size represents the number of nucleotides that go into the hash, not the number of amino acids that are represented in the hash. This is not intuitive, and so I suggest we do it differently where the k-size represents the number of amino acids.

@olgabot thoughts?

https://github.com/dib-lab/sourmash/blob/4aab62f65fb08044e9a43ec49331b65be8b5ae15/sourmash/kmer_min_hash.hh#L172

The text was updated successfully, but these errors were encountered:

ctb · 2019-01-13T14:21:09Z

At the very least we should make sure it's properly documented ;)

olgabot · 2019-03-26T01:03:46Z

Yes this is a major point of confusion when I present.. I agree that it should be the k-mer size of the number of amino acids

bluegenes · 2019-12-18T06:31:22Z

Following up on this:

I think that translated k=33 (e.g. from rna sequence) and protein k=11 (from protein sequence) should be comparable with sourmash compare. I don't think we do any checking for functionally equivalent protein ksizes, unless I've missed it?

Secondary issue: If you want to build both nucleotide and (translated) protein signatures with the same command/to the same file, the ksize must be divisible by 3 in order to work (e.g. k 31 will not work). Most of our utilities enable multiple-sigs-per-file design (one sig file per fasta) - it'd be nice not to limit the ksizes here.

One solution could be to add a "protein-ksizes" command line argument / variable. To maintain current functionality, if protein-ksizes are not provided, continue calculating them as k/3, but calculate the k/3 number in the command-line wrapper, rather than the underlying minhash code. This enables the k/3 number (e.g. k=11) rather than k (e.g. k= 33) to be kept with the signature, facilitating protein comparisons.

Thoughts @ctb @taylorreiter @luizirber @olgabot ?

A proof-of-concept/workingish implementation of the idea is here. Ignore the added signature utility in sourmash/sig/__main__.py- that solved the minor issue without addressing the major issue!

ctb · 2019-12-18T14:34:53Z

Hot takes --

adding a protein-ksizes parameter sounds complicated...
I agree that translated k=33 (e.g. from rna sequence) and protein k=11 (from protein sequence) should be comparable!
error messages are good, we should add more checking and better error messages before doing anything else!

I'd actually be a fan of splitting up the protein and DNA signature calculation; seems like the use cases are generally not that overlapping. #751 may also be relevant here.

luizirber · 2019-12-19T01:10:58Z

I'd actually be a fan of splitting up the protein and DNA signature calculation; seems like the use cases are generally not that overlapping. #751 may also be relevant here.

re #751: that's what master in doing in Rust. is_protein/dayhoff/hp are disjoint, and there is an enum controlling which type is is (instead of string-typing it). For the Python layer it is still passing as booleans, but is is_protein=True and hp=True, it will unset is_protein.

ctb · 2021-03-04T14:21:25Z

#1315 fixed the core issue here.

ctb mentioned this issue May 25, 2020

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed

ctb closed this as completed Mar 4, 2021

ctb mentioned this issue May 15, 2021

summary: further improvements to protein handling in sourmash #1525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `--protein` k-size meaning #574

Change `--protein` k-size meaning #574

taylorreiter commented Dec 13, 2018 •

edited

Loading

ctb commented Jan 13, 2019

olgabot commented Mar 26, 2019

bluegenes commented Dec 18, 2019

ctb commented Dec 18, 2019

luizirber commented Dec 19, 2019

ctb commented Mar 4, 2021

Change --protein k-size meaning #574

Change --protein k-size meaning #574

Comments

taylorreiter commented Dec 13, 2018 • edited Loading

ctb commented Jan 13, 2019

olgabot commented Mar 26, 2019

bluegenes commented Dec 18, 2019

ctb commented Dec 18, 2019

luizirber commented Dec 19, 2019

ctb commented Mar 4, 2021

Change `--protein` k-size meaning #574

Change `--protein` k-size meaning #574

taylorreiter commented Dec 13, 2018 •

edited

Loading