-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store hash to kmer map #21
Comments
Kmer extraction from strings in the To avoid duplication in kmer extraction, it might be better to pull that function into this project and modify it to return a @ctb if that sounds sensible, would you mind stripping down the signature module from sourmash so it just has the bits needed for SeqToHashes? I think I can handle updating the output and storing the pairs in Oxli. |
What about having a second class that handles this, and wraps the first class invisibly? We could support the same API and provide the same tests for many situations. Then users could choose. It should (in my experience) be a completely optional and generally discouraged feature, because it will consume a lot of memory and people think they need it far more than they do 😜 I am somewhat against the code duplication involved in extracting code from sourmash, largely because that project is well tested and well maintained and I don't want to duplicate that work ;). Can always revisit in future if needs become divergent (although then I might advocate for adding necessary functionality to sourmash...) Note that sourmash also supports various translated and protein k-mer hashing, which we aren't making use of yet, but could. |
I like the idea of wrapping Agree that kmer tracking should be disabled by default. |
I'd like to suggest we implement that in Rust in oxli, and then I can backport it into sourmash. I think my rust fu is up to it, finally ;). Thoughts? |
rough draft in #40. Note that because of the DNA reverse-comp canonicalization performed by the hashing, we only have L-K+1 hashes and hence the current code only emits "forward" k-mers, from the top strand. There is probably value in supporting reverse complements too. The code to do so is in there, just commented out. The least confusing thing to do is to leave it as-is, but we might want to support other use cases. One option: support revcomp via a bool option, so that L-K+1 tuples of Another option: optionally or always return both forward and reverse complement k-mers, with the corresponding canonical hashes. Another option: support non-canonical k-mer calculation in this method. I suspect that will be confusing to users. Thoughts? |
I'd expect output to be (canonical kmer, hash). Use something like this: # Import oxli
from oxli import KmerCountTable
# New table
kct = KmerCountTable(ksize=3, raw=True)
# Check hash for AAA / TTT
kct.hash_kmer("TTT")
#>>>10679328328772601858
kct.hash_kmer("AAA")
#>>>10679328328772601858
# Count kmer
kct.count("TTT")
#>>> 1
# Lookup kmer using hash, return canonical kmer
kct.unhash(10679328328772601858)
#>>> "AAA" |
perfect, thanks! |
#40 is up for review - it doesn't actually implement the full hash to kmer map, but it provides the necessary code in a method I think next steps could/would include:
but, for now, I feel like #40 is a solid step forward and could usefully be merged. |
Cool, I will try make it more iteratory. Will also see about integrating it into the count and consume functions to populate a hash:kmer map if raw option is set when creating the KmerCountTable. |
A common use of kmer counters is to dump kmers and their counts to a text file that can be used in other tools. To do this we need to track hash to kmer pairs.
Suggested solution:
Init
KmerCountTable
objects with a second hashmap that maps hashes to their corresponding canonical kmer sequence.Add new hash:kmer pairs as kmers are consumed.
To save time and space we could disable this feature by default.
Alternative solutions:
Issues:
.count_hash
then the kmer sequence is unknown. (maybe just report unpaired hashed as warning when retrieving kmers?)@ctb thoughts?
The text was updated successfully, but these errors were encountered: