Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function #12875

Merged
merged 8 commits into from
Mar 13, 2023

Conversation

davidwendt
Copy link
Contributor

Description

Fixes passing seed parameter to MurmurHash3_32 in cudf::hash() function. The MurmurHash3_32 algorithm takes a seed value which helps provide variation in hash results. The seed parameter was not being passed to the algorithm through the element_hasher but only used in the hash_combine functions. This resulted in only small variations in the hash results. The following example illustrates:

{
  cudf::test::strings_column_wrapper const strings_col({"hello world"});
  auto const input1 = cudf::table_view({strings_col});
  for (int i = 0; i < 5; ++i) {
    auto output1 = cudf::hash(input1, cudf::hash_id::HASH_MURMUR3, i);
    std::cout << i << ": ";
    cudf::test::print(output1->view());
  }
}

The output for the 5 hashes with associated seed values:

seed hash
0    4241098952
1    4241099017
2    4241099082
3    4241099147
4    4241099213

A MurmurHash3_32 algorithm would produce the following variations if given the seed values:

seed hash
0    1586663183
1    1128525090
2    3382554948
3    1761283998
4    1862001904

This PR passes the seed value through to the internal algorithm. Although the results do not exactly match the standard MurmurHash3_32 the variations are much improved:

seed hash
0    4241098952
1    3782960922
2    1742023551
3    120752660
4    221470638

This variation is important for some machine learning operations.

Some gtests have hardcoded the hash values and were updated to match the new results.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt self-assigned this Mar 2, 2023
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 2, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thank you for the fix, and apologies that this slipped past me when writing/reviewing this previously.

@ayushdg
Copy link
Member

ayushdg commented Mar 9, 2023

Can confirm that with this pr, I'm seeing a much larger variation and the results are a lot more in line with what I needed for my use case.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 10, 2023
@davidwendt davidwendt marked this pull request as ready for review March 10, 2023 13:26
@davidwendt davidwendt requested a review from a team as a code owner March 10, 2023 13:26
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving (previously draft).

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 7bc4a7e into rapidsai:branch-23.04 Mar 13, 2023
@davidwendt davidwendt deleted the hash-string-variance branch March 13, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants