-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minhash support for MurmurHash3_x64_128 #13796
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions / suggestions. I think this will be good after this round of feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions, but looks good.
cudf::size_type width = 4, | ||
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
cudf::numeric_scalar<uint32_t> seed = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch to hardcoding the value here instead of using the constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the type may be different. I'd rather be clear that the default seed is actually 0 and would not want to change that if the rest of libcudf decided on a different default. Hopefully that is ok.
*/ | ||
std::unique_ptr<cudf::column> minhash64( | ||
cudf::strings_column_view const& input, | ||
cudf::numeric_scalar<uint64_t> seed = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as before, why the different seed choice? Is it because of the potential for unsafe casts depending on the type of DEFAULT_HASH_SEED
(currently OK since it's uint32_t
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the type is different here and I'd rather the 2 functions be consistent over any need to use the constant def.
/merge |
Description
Adds
nvtext::minhash64
to libcudf and the Cython/Python changes to call it.The
MurmurHash3_x64_128
is called and only the firstuint64
value is used.The libcudf API was changed to remove the
hash_id
parameter since it was incompatible with the seed types.Checklist