Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Feat/hash algs #1

Merged
merged 25 commits into from
Sep 19, 2023
Merged

[feat] Feat/hash algs #1

merged 25 commits into from
Sep 19, 2023

Conversation

St4NNi
Copy link
Owner

@St4NNi St4NNi commented Aug 19, 2023

Draft for hash algorithms

This PR adds different hash algorithms to the sketch command. This includes:
xxhash, ahash and legacy murmur3.

Ahash uses the ahash fallback algorithm and is only suitable for k < 32 since it internaly relies on a u64 bit-kmer representation. The raw hashing performance of ahash is 10-30x faster compared to existing hashing algorithms [0.3ns vs 3-8ns per iteration!]
The PR will be merged as soon as all new methods are sufficiently tested (and documented).
Caveat: This currently needs rust nightly to build properly, the final release will try to remove this constraint.

Changes

  • Implemented lots of new features to sketch
  • Added sourmash compatible outputs (import other kinds of fracminhash etc? sourmash-bio/sourmash#2710)
  • Added different hashing algorithms
  • Added bit-kmer support for k < 32 and regular canonical kmers for larger kmers
  • Update README.md
  • Added basic tests for hashing performance and bit-distributions including a crude Kolmogorov–Smirnov test to test against equal distribution

Sourmash outputs

You can use jam-rs to create sourmash compatible outputs, a basic example looks like this:

$ jam sketch <INPUT> -o <OUTPUT> -k 50 --fscale <SCALE> --format sourmash --algorithm [default, ahash, xxhash, murmur3]

Please be aware that this will produce differing results for k < 32 if you compare the output to sourmash generated outputs, since this internally uses a bit-kmer encoding before hashing instead of regular ASCII encoding.

  • default algorithms means ahash for k < 32 and xxhash for larger kmers
  • fscale is the compatible parameter to sourmashs scale

So:

$ sourmash sketch dna -p k=50,scaled=1000 <INPUT> -o <OUTPUT>

and

$ jam sketch <INPUT> --fscale 1000 -k 50 --format sourmash --algorithm murmur3 -o <OUTPUT>

will produce similar results.

Feedback

Please feel free to give feedback for additional updates / changes. This is only the first step to improve sketching functionality the improvement of search especially for smaller fragmented sequences in larger sets will come next.

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2023

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

@St4NNi St4NNi marked this pull request as draft August 25, 2023 21:41
@St4NNi St4NNi marked this pull request as ready for review September 19, 2023 19:29
@St4NNi
Copy link
Owner Author

St4NNi commented Sep 19, 2023

This is still WIP, but I have decided to upgrade on this incrementally to keep PRs reviewable !
Stay tuned more updates will follow soon !

@St4NNi St4NNi merged commit 310e3de into main Sep 19, 2023
2 checks passed
@St4NNi St4NNi deleted the feat/hash-algs branch September 19, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants