-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Serial murmur3 hash with configurable seed #6781
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
@@ -642,12 +642,14 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition( | |||
std::unique_ptr<column> hash(table_view const& input, | |||
hash_id hash_function, | |||
std::vector<uint32_t> const& initial_hash, | |||
uint32_t seed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the initial_hash
not sufficient for the seed
? Can it be made to be sufficient? I'd like to avoid having both initial_hash
and seed
as it is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can definitely substitute. I'll include an assertion that it should be a single value in the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the difference. Previously initial_hash
requires a value per column. seed
is just a single value. Hm, maybe both are okay then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change is simple, so it's really a question of what is most intuitive to a user. I had originally confused initial hash with seed values -- which is why I split it off -- but I'm also worried about adding too many arguments to the hash function. I think an argument of a single seed value is generic enough to include though.
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6781 +/- ##
===============================================
- Coverage 81.94% 81.94% -0.01%
===============================================
Files 96 96
Lines 16164 16166 +2
===============================================
+ Hits 13246 13247 +1
- Misses 2918 2919 +1
Continue to review full report at Codecov.
|
Instead of using [WIP], please create a draft PR and set the "in progress" label. When it's ready for review, add the corresponding label, and click "ready for review". This way reviewers are not notified until you are ready. |
* @param columns array of columns to hash, must have identical number of rows. | ||
* @return the new ColumnVector of 32 character hex strings representing each row's hash value. | ||
*/ | ||
public static ColumnVector serial32BitMurmurHash3(int seed, ColumnVector... columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input parameters need to be ColumnView
instead of ColumnVector
.
return new ColumnVector(hash(columnViews, HashType.HASH_SERIAL_MURMUR3.getNativeId(), new int[0], seed)); | ||
} | ||
|
||
public static ColumnVector serial32BitMurmurHash3(ColumnVector... columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get java docs for this? Also just like for the above function we need ColumnView instead of ColumnVector as the input.
jobject j_object, | ||
jlongArray column_handles, | ||
jint hash_function_id) { | ||
jobject j_object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation appears to be off.
cpp/src/hash/hashing.cu
Outdated
for (int col_index = 0; col_index < device_input.num_columns(); col_index++) { | ||
hash_result = cudf::type_dispatcher( | ||
device_input.column(col_index).type(), | ||
element_hasher_with_seed<MurmurHash3_32, true>{hash_result, hash_result}, | ||
device_input.column(col_index), | ||
row_index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be better done as a thrust::reduce
(or transform_reduce
) with a thrust::seq
exec policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured to use thrust::tabulate
and a lambda with a thrust::reduce
if (has_nulls && col.is_null(row_index)) { return _null_hash; } | ||
|
||
return hash_function<T>{_seed}(col.element<T>(row_index)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a heads up on handling nested types here. One way to add support without requiring recursive examination here would be:
-
Preprocess any nested type column into a uint32_t column of pre-hashed values.
-
Substitute that preprocessed column into the table view in place of the nested type. You'd also have to know not to invoke hash_function<> here and just return the value directly.
-
Depending what it means to hash something like a List<List<Struct<int, float>, int, List>>>> etc you could then potentially generate this preprocessed column using the standard nested type technique of processing each level of nesting as a separate chunk of GPU work, and then recursing on the CPU. A good example of this being something like:
std::unique_ptr<column> concatenate(std::vector<column_view> const& columns,
The plausibility great depends on what it even means to hash data like this, of course.
Note : not suggesting that should go into this PR. Just how it might work when we get to it.
auto output_view = output->mutable_view(); | ||
|
||
if (has_nulls(input)) { | ||
thrust::tabulate(rmm::exec_policy(stream)->on(stream.value()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
Co-authored-by: GALI PREM SAGAR <[email protected]>
@rapidsai/ops can you explain why the automerge didn't work here? Why is github saying merging is blocked? I don't see any remaining required reviews. |
@harrism, the automerger did its job correctly here. It shouldn't merge any PRs that aren't all green. It looks like the PR still requires a |
* @return the new ColumnVector of 32 character hex strings representing each row's hash value. | ||
*/ | ||
public static ColumnVector serial32BitMurmurHash3(ColumnView columns[]) { | ||
return serial32BitMurmurHash3(0, columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a stupid question but why 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is the default seed that cuDF uses, had previously been hard coded a few layers deep with no configurability. Trying to retain and match cuDF existing behavior as consistently as possible.
try (ColumnVector v0 = ColumnVector.fromBoxedInts(0, 100, null, null, Integer.MIN_VALUE, null); | ||
ColumnVector v1 = ColumnVector.fromBoxedInts(0, null, -100, null, null, Integer.MAX_VALUE); | ||
ColumnVector result = ColumnVector.serial32BitMurmurHash3(42, new ColumnVector[]{v0, v1}); | ||
ColumnVector expected = ColumnVector.fromBoxedInts(59727262, 751823303, -1080202046, 42, 723455942, 133916647)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have a good reason for hard-coding and I am not familiar with hashing in Java, instead of hard-coding, can we generate the expected values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could generate expected values, but the results should be static and never change. Also using org.apache.commons.codec.digest.MurmurHash3 would add other potential failure points (requires converting input to byte arrays, which would not necessarily be an apples to apples comparison).
NEGATIVE_DOUBLE_NAN_UPPER_RANGE, NEGATIVE_DOUBLE_NAN_LOWER_RANGE, | ||
Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY); | ||
ColumnVector result = ColumnVector.serial32BitMurmurHash3(new ColumnVector[]{v}); | ||
ColumnVector expected = ColumnVector.fromBoxedInts(1669671676, 0, -544903190, -1831674681, 150502665, 474144502, 1428788237, 1428788237, 1428788237, 1428788237, 420913893, 1915664072)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about hard-coding the expected values
Others have approved the PR some of them are questions and minor, please feel free to merge |
This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11383
Expand existing murmur3 hashing functionality to hash the row elements serially rather than using a merge function. Also enables configuring the hash seed and null hash value.