Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minhash support for MurmurHash3_x64_128 #13796

Merged
merged 27 commits into from
Aug 21, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Aug 1, 2023

Description

Adds nvtext::minhash64 to libcudf and the Cython/Python changes to call it.
The MurmurHash3_x64_128 is called and only the first uint64 value is used.

The libcudf API was changed to remove the hash_id parameter since it was incompatible with the seed types.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Aug 1, 2023
@davidwendt davidwendt self-assigned this Aug 1, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 1, 2023
@davidwendt davidwendt changed the title Add minhash support for MurmurHash3_x64_128 Add minhash support for MurmurHash3_x64_128 Aug 2, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 10, 2023
@davidwendt davidwendt marked this pull request as ready for review August 10, 2023 18:17
@davidwendt davidwendt requested review from a team as code owners August 10, 2023 18:17
@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Aug 10, 2023
cpp/benchmarks/text/minhash.cpp Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions / suggestions. I think this will be good after this round of feedback!

cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
@davidwendt davidwendt changed the title Add minhash support for MurmurHash3_x64_128 Add minhash support for MurmurHash3_x64_128 Aug 14, 2023
@davidwendt davidwendt changed the title Add minhash support for MurmurHash3_x64_128 Add minhash support for MurmurHash3_x64_128 Aug 14, 2023
@davidwendt davidwendt changed the title Add minhash support for MurmurHash3_x64_128 Add minhash support for MurmurHash3_x64_128 Aug 14, 2023
@davidwendt davidwendt requested a review from bdice August 15, 2023 14:02
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
@davidwendt davidwendt changed the title Add minhash support for MurmurHash3_x64_128 Add minhash support for MurmurHash3_x64_128 Aug 18, 2023
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions, but looks good.

cudf::size_type width = 4,
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
cudf::numeric_scalar<uint32_t> seed = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch to hardcoding the value here instead of using the constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the type may be different. I'd rather be clear that the default seed is actually 0 and would not want to change that if the rest of libcudf decided on a different default. Hopefully that is ok.

*/
std::unique_ptr<cudf::column> minhash64(
cudf::strings_column_view const& input,
cudf::numeric_scalar<uint64_t> seed = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before, why the different seed choice? Is it because of the potential for unsafe casts depending on the type of DEFAULT_HASH_SEED (currently OK since it's uint32_t)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type is different here and I'd rather the 2 functions be consistent over any need to use the constant def.

cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 261bcb2 into rapidsai:branch-23.10 Aug 21, 2023
@davidwendt davidwendt deleted the fea-minhash64 branch August 21, 2023 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants