Store hash to kmer map #21

Adamtaranto · 2024-09-10T09:36:57Z

A common use of kmer counters is to dump kmers and their counts to a text file that can be used in other tools. To do this we need to track hash to kmer pairs.

Suggested solution:

Init KmerCountTable objects with a second hashmap that maps hashes to their corresponding canonical kmer sequence.
Add new hash:kmer pairs as kmers are consumed.

To save time and space we could disable this feature by default.

# Enable kmer tracking
new_table = oxli.KmerCountTable(ksize=3, raw=True)

Alternative solutions:

Could we just use the canonical kmers as keys?

Issues:

If a kmer is added with .count_hash then the kmer sequence is unknown. (maybe just report unpaired hashed as warning when retrieving kmers?)
If adding KmerCountTables together we would need to also add hash:kmer pairs to the new object (or just add all new hashes into the output table?)
Tables might get large if consuming large noisy datasets with a high kmer length. (Make kmer tracking optional?)

@ctb thoughts?

The text was updated successfully, but these errors were encountered:

Adamtaranto · 2024-09-11T04:46:20Z

Kmer extraction from strings in the Consume and count functions is handled by sourmash::signature::SeqToHashes.

To avoid duplication in kmer extraction, it might be better to pull that function into this project and modify it to return a (kmer,hash) pair. Then we can optionally store the hash:kmer mapping as a second table in the KmerCountTable class.

@ctb if that sounds sensible, would you mind stripping down the signature module from sourmash so it just has the bits needed for SeqToHashes? I think I can handle updating the output and storing the pairs in Oxli.

ctb · 2024-09-13T12:23:01Z

What about having a second class that handles this, and wraps the first class invisibly? We could support the same API and provide the same tests for many situations. Then users could choose.

It should (in my experience) be a completely optional and generally discouraged feature, because it will consume a lot of memory and people think they need it far more than they do 😜

I am somewhat against the code duplication involved in extracting code from sourmash, largely because that project is well tested and well maintained and I don't want to duplicate that work ;). Can always revisit in future if needs become divergent (although then I might advocate for adding necessary functionality to sourmash...)

Note that sourmash also supports various translated and protein k-mer hashing, which we aren't making use of yet, but could.

Adamtaranto · 2024-09-13T13:29:22Z

I like the idea of wrapping SeqToHashes in a second class and tacking the kmer on to the end. I can see how to do that for single kmers (i.e. in hash_kmer()) but not sure how it would work when processing long sequences with consume().

Agree that kmer tracking should be disabled by default.

ctb · 2024-09-18T13:22:08Z

kmers_and_hashes here, https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/sourmash/minhash.py#L393, shows how to build the associated set of k-mers from a sequence.

I'd like to suggest we implement that in Rust in oxli, and then I can backport it into sourmash. I think my rust fu is up to it, finally ;). Thoughts?

ctb · 2024-09-18T15:43:37Z

rough draft in #40.

Note that because of the DNA reverse-comp canonicalization performed by the hashing, we only have L-K+1 hashes and hence the current code only emits "forward" k-mers, from the top strand. There is probably value in supporting reverse complements too. The code to do so is in there, just commented out.

The least confusing thing to do is to leave it as-is, but we might want to support other use cases.

One option: support revcomp via a bool option, so that L-K+1 tuples of (kmer_rc, hashval) are returned.

Another option: optionally or always return both forward and reverse complement k-mers, with the corresponding canonical hashes.

Another option: support non-canonical k-mer calculation in this method. I suspect that will be confusing to users.

Thoughts?

Adamtaranto · 2024-09-19T04:02:16Z

I'd expect output to be (canonical kmer, hash).

Use something like this:

# Import oxli
from oxli import KmerCountTable

# New table
kct = KmerCountTable(ksize=3, raw=True)

# Check hash for  AAA / TTT
kct.hash_kmer("TTT")
#>>>10679328328772601858
kct.hash_kmer("AAA")
#>>>10679328328772601858

# Count kmer
kct.count("TTT")
#>>> 1

# Lookup kmer using hash, return canonical kmer
kct.unhash(10679328328772601858)
#>>> "AAA"

ctb · 2024-09-19T13:10:57Z

perfect, thanks!

ctb · 2024-09-25T13:50:54Z

#40 is up for review - it doesn't actually implement the full hash to kmer map, but it provides the necessary code in a method kmers_and_hashes.

I think next steps could/would include:

turn the Rust code into an iterator so it can be reused for this purpose;
making the code faster by only computing revcomp once;

but, for now, I feel like #40 is a solid step forward and could usefully be merged.

Adamtaranto · 2024-09-25T23:50:04Z

Cool, I will try make it more iteratory.

Will also see about integrating it into the count and consume functions to populate a hash:kmer map if raw option is set when creating the KmerCountTable.

Adamtaranto added the enhancement New feature or request label Sep 10, 2024

Adamtaranto mentioned this issue Sep 15, 2024

Add .dump() method #24

Closed

ctb mentioned this issue Sep 18, 2024

MRG: add kmers_and_hashes method to get canonical kmers + hashes #40

Merged

Adamtaranto self-assigned this Sep 19, 2024

Adamtaranto assigned ctb Sep 20, 2024

Adamtaranto mentioned this issue Oct 12, 2024

MRG: Make kmers_and_hashes iterable #70

Merged

8 tasks

Adamtaranto closed this as completed in #70 Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store hash to kmer map #21

Store hash to kmer map #21

Adamtaranto commented Sep 10, 2024

Adamtaranto commented Sep 11, 2024

ctb commented Sep 13, 2024

Adamtaranto commented Sep 13, 2024

ctb commented Sep 18, 2024

ctb commented Sep 18, 2024

Adamtaranto commented Sep 19, 2024

ctb commented Sep 19, 2024

ctb commented Sep 25, 2024

Adamtaranto commented Sep 25, 2024

Store hash to kmer map #21

Store hash to kmer map #21

Comments

Adamtaranto commented Sep 10, 2024

Adamtaranto commented Sep 11, 2024

ctb commented Sep 13, 2024

Adamtaranto commented Sep 13, 2024

ctb commented Sep 18, 2024

ctb commented Sep 18, 2024

Adamtaranto commented Sep 19, 2024

ctb commented Sep 19, 2024

ctb commented Sep 25, 2024

Adamtaranto commented Sep 25, 2024