Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ntHash is not matching the article description? #7

Closed
luizirber opened this issue Sep 11, 2018 · 19 comments
Closed

ntHash is not matching the article description? #7

luizirber opened this issue Sep 11, 2018 · 19 comments
Assignees

Comments

@luizirber
Copy link

Hi!

Last week I started implementing a Rust version of ntHash, and I can't make it match the output of ntHash (in this repo). I ended up following this answer from @karel-brinda on bfx stack overflow to implement a Python version too, and still couldn't match the output.

Then I checked the supplementary materials, and based on the NT64 function defined there implemented NTF64, NTR64 and NTC64... and couldn't match the output. Either I'm missing something obvious, or there is a bug in the optimized version in this repo...

I prepared a repo with a Jupyter notebook analysing some test cases, and you can Binder to have an interactive version to follow. I also added an example of using autocheck and catch to write an oracle test for an optimized and an unoptimized version of ntHash, I can open a PR to add it to this repo and set up Travis to run it.

@mohamadi
Copy link
Collaborator

Hi @luizirber, in the latest version pushed 6 days ago, we have changed the way we're computing the base hash value to have periodicity greater than 64. I think this is why you're not getting the same hash results as what we had in our older versions. Could you please let me know which commit you have used in your experiments?

@mohamadi mohamadi self-assigned this Sep 11, 2018
@luizirber
Copy link
Author

commit 57af16a, the one from 6 days ago

@luizirber
Copy link
Author

Indeed, using ee4e9c3 works!

@luizirber
Copy link
Author

luizirber commented Sep 11, 2018

I will stick with 1.0.4 as a reference. But shouldn't it all keep working as before on master?

@mohamadi
Copy link
Collaborator

That's why you're getting different results. You're using the latest version where we're using a new approach for computing the hash values. Since in the past versions we had the issue of periodicity of 64 which is extended to 1023 using our new idea. Are you familiar with the periodicity problem for rolling hash?

@mohamadi
Copy link
Collaborator

mohamadi commented Sep 11, 2018

@luizirber In future versions I'll make the periodicity to 65K or even 4Billion. This makes the computation much more accurate. Which implementation of rolling hash you're using in khhmer now?

@luizirber
Copy link
Author

But even with a 2-mer the hashes will be different? Are you planning to make a new version that is backward-incompatible?

I was trying it out for sourmash, but @standage was investigating ntHash to use it in khmer

@mohamadi
Copy link
Collaborator

@luizirber sorry, I didn't get your point. you mean the new version hash value vs older version are not the same? or there is incompatibility between hash values in the new version?

@luizirber
Copy link
Author

the new version hash (master) is different than the older version (1.0.4, for example).

@mohamadi
Copy link
Collaborator

It should be different, because the base hash function is different now. Let me give detailed explanation.

@mohamadi
Copy link
Collaborator

mohamadi commented Sep 12, 2018

In the old version the hash value for a given k-mer s was computed using:

hVal = rol(s[0],k-1) ^ rol(s[1], k-2) ^ ... ^ s[k-1]

As an example suppose k = 65, then:

hval = rol(s[0],0) ^ rol(s[1],63) ^ rol(s[2],62) ^ ... ^ rol(s[64], 0)

The bold part is the periodicity issue which results in more collisions when k>64 since for positions where k%64 are the same we are cancelling out the corresponding bases if they are the same or we have the same hash values for palindromic combination, i.e.:

roll_hash(ACGGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAAGCTGAGGTGAGTGGACTGCCTGAGCTCC)
=
roll_hash(CCGGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAAGCTGAGGTGAGTGGACTGCCTGAGCTCA)

This will increase the collision rate. we call this the periodicity issue and when k > 64 we see a slight increase in false positive rates.

Now, in the new version we addressed this by changing the definition of rol and ror to new functions.

@mohamadi
Copy link
Collaborator

@luizirber In you current khmer check to see if you have this issue which I think it should be the case for khmer as well since you're using https://github.com/lemire/rollinghashcpp

@luizirber
Copy link
Author

I see, thanks!

Do you plan to name the new scheme ntHash2, or something like this? It will be very confusing if both are ntHash...

@mohamadi
Copy link
Collaborator

mohamadi commented Sep 12, 2018

@luizirber As I thought this is an issue if you're using rolling hash, rollinghashcpp, in khmer:

#include "cyclichash.h"

using namespace std;

int main(int argc, char** argv) {
    std::string seq1 = "AAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAG";
    std::string seq2 = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAA";
    const uint n(65);
    const uint L(64);
    CyclicHash<uint64> hf(n,L );
    cerr << hf.hash(seq1) << endl;
    hf.reset();
    cerr << hf.hash(seq2) << endl;
    return 0;
}

Output:

[hmohamadi@hpcg01 rollinghashcpp]$ g++ -o rolltest rolltest.cpp 
[hmohamadi@hpcg01 rollinghashcpp]$ ./rolltest 
15732650108616394752
15732650108616394752
[hmohamadi@hpcg01 rollinghashcpp]$

@mohamadi
Copy link
Collaborator

@luizirber Is it better to name it ntHash2? or the new commit would be OK if people in the community want to use it?

@luizirber
Copy link
Author

luizirber commented Sep 12, 2018

Speaking from experience, people will be confused if it has the same name but you can get different hash values depending on what day you downloaded it =]

For example, with sourmash I'm building sketches for public genomic databases, and I can't compare them if the hashes are not generated in the same way. This might not be an issue for people that discard the hashed data after analysis, but for me it's essential.

@mohamadi
Copy link
Collaborator

@luizirber sure let me update README and give detailed explanation why the rolling hash functions have the problem of periodicity and how we can resolve it and have a new version.

In sourmash, do you have an idea about the max sequence length you're going to hash?

@mohamadi
Copy link
Collaborator

Hi @luizirber, I just released the ntHash version 2 highlighting in the release note the difference between outputs.

@luizirber
Copy link
Author

For sourmash we don't see going further than 64-mers (more detailed explanation), so ntHash 1 would be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants