Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Serial murmur3 hash with configurable seed #6781

Merged
merged 15 commits into from
Dec 3, 2020

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Nov 17, 2020

Expand existing murmur3 hashing functionality to hash the row elements serially rather than using a merge function. Also enables configuring the hash seed and null hash value.

@rwlee rwlee added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 17, 2020
@rwlee rwlee requested a review from a team as a code owner November 17, 2020 00:47
@rwlee rwlee requested review from harrism and nvdbaranec November 17, 2020 00:47
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@@ -642,12 +642,14 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
std::unique_ptr<column> hash(table_view const& input,
hash_id hash_function,
std::vector<uint32_t> const& initial_hash,
uint32_t seed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the initial_hash not sufficient for the seed? Can it be made to be sufficient? I'd like to avoid having both initial_hash and seed as it is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can definitely substitute. I'll include an assertion that it should be a single value in the vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the difference. Previously initial_hash requires a value per column. seed is just a single value. Hm, maybe both are okay then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change is simple, so it's really a question of what is most intuitive to a user. I had originally confused initial hash with seed values -- which is why I split it off -- but I'm also worried about adding too many arguments to the hash function. I think an argument of a single seed value is generic enough to include though.

@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #6781 (5593718) into branch-0.17 (e1e3047) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #6781      +/-   ##
===============================================
- Coverage        81.94%   81.94%   -0.01%     
===============================================
  Files               96       96              
  Lines            16164    16166       +2     
===============================================
+ Hits             13246    13247       +1     
- Misses            2918     2919       +1     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/datetime.py 88.44% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/string.py 86.64% <0.00%> (-0.18%) ⬇️
python/cudf/cudf/core/tools/datetimes.py 81.60% <0.00%> (-0.15%) ⬇️
python/cudf/cudf/core/series.py 91.29% <0.00%> (-0.08%) ⬇️
python/cudf/cudf/core/column/timedelta.py 89.45% <0.00%> (-0.05%) ⬇️
python/cudf/cudf/core/dataframe.py 90.99% <0.00%> (-0.02%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.50% <0.00%> (ø)
python/cudf/cudf/core/frame.py 90.06% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/index.py 93.13% <0.00%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1e3047...5593718. Read the comment docs.

@harrism
Copy link
Member

harrism commented Nov 18, 2020

Instead of using [WIP], please create a draft PR and set the "in progress" label. When it's ready for review, add the corresponding label, and click "ready for review". This way reviewers are not notified until you are ready.

@rwlee rwlee marked this pull request as draft November 18, 2020 09:21
@rwlee rwlee added the 2 - In Progress Currently a work in progress label Nov 18, 2020
@rwlee rwlee marked this pull request as ready for review November 20, 2020 19:46
@rwlee rwlee requested review from a team as code owners November 20, 2020 19:46
@rwlee rwlee added the 3 - Ready for Review Ready for review by team label Nov 20, 2020
@rwlee rwlee changed the title [WIP] Serial murmur3 hash with configurable seed [REVIEW] Serial murmur3 hash with configurable seed Nov 20, 2020
* @param columns array of columns to hash, must have identical number of rows.
* @return the new ColumnVector of 32 character hex strings representing each row's hash value.
*/
public static ColumnVector serial32BitMurmurHash3(int seed, ColumnVector... columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input parameters need to be ColumnView instead of ColumnVector.

return new ColumnVector(hash(columnViews, HashType.HASH_SERIAL_MURMUR3.getNativeId(), new int[0], seed));
}

public static ColumnVector serial32BitMurmurHash3(ColumnVector... columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get java docs for this? Also just like for the above function we need ColumnView instead of ColumnVector as the input.

jobject j_object,
jlongArray column_handles,
jint hash_function_id) {
jobject j_object,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation appears to be off.

Comment on lines 743 to 749
for (int col_index = 0; col_index < device_input.num_columns(); col_index++) {
hash_result = cudf::type_dispatcher(
device_input.column(col_index).type(),
element_hasher_with_seed<MurmurHash3_32, true>{hash_result, hash_result},
device_input.column(col_index),
row_index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be better done as a thrust::reduce (or transform_reduce) with a thrust::seq exec policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured to use thrust::tabulate and a lambda with a thrust::reduce

if (has_nulls && col.is_null(row_index)) { return _null_hash; }

return hash_function<T>{_seed}(col.element<T>(row_index));
}
Copy link
Contributor

@nvdbaranec nvdbaranec Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a heads up on handling nested types here. One way to add support without requiring recursive examination here would be:

  • Preprocess any nested type column into a uint32_t column of pre-hashed values.

  • Substitute that preprocessed column into the table view in place of the nested type. You'd also have to know not to invoke hash_function<> here and just return the value directly.

  • Depending what it means to hash something like a List<List<Struct<int, float>, int, List>>>> etc you could then potentially generate this preprocessed column using the standard nested type technique of processing each level of nesting as a separate chunk of GPU work, and then recursing on the CPU. A good example of this being something like:

    std::unique_ptr<column> concatenate(std::vector<column_view> const& columns,

    The plausibility great depends on what it even means to hash data like this, of course.

Note : not suggesting that should go into this PR. Just how it might work when we get to it.

cpp/src/hash/hashing.cu Outdated Show resolved Hide resolved
@rwlee rwlee added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed 2 - In Progress Currently a work in progress labels Dec 1, 2020
@nvdbaranec nvdbaranec self-requested a review December 1, 2020 23:44
auto output_view = output->mutable_view();

if (has_nulls(input)) {
thrust::tabulate(rmm::exec_policy(stream)->on(stream.value()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

python/cudf/cudf/_lib/hash.pyx Outdated Show resolved Hide resolved
@harrism harrism requested a review from galipremsagar December 3, 2020 01:26
@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 3, 2020
@harrism
Copy link
Member

harrism commented Dec 3, 2020

@rapidsai/ops can you explain why the automerge didn't work here? Why is github saying merging is blocked? I don't see any remaining required reviews.

@ajschmidt8
Copy link
Member

@rapidsai/ops can you explain why the automerge didn't work here? Why is github saying merging is blocked? I don't see any remaining required reviews.

@harrism, the automerger did its job correctly here. It shouldn't merge any PRs that aren't all green.

It looks like the PR still requires a rapidsai/cudf-java-codeowners member review. It appears that @revans2 from that group has left a comment via a PR review, but didn't explicitly approve the PR yet. Once he approves the PR, this should merge assuming all other checks are passing.

image

* @return the new ColumnVector of 32 character hex strings representing each row's hash value.
*/
public static ColumnVector serial32BitMurmurHash3(ColumnView columns[]) {
return serial32BitMurmurHash3(0, columns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a stupid question but why 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is the default seed that cuDF uses, had previously been hard coded a few layers deep with no configurability. Trying to retain and match cuDF existing behavior as consistently as possible.

try (ColumnVector v0 = ColumnVector.fromBoxedInts(0, 100, null, null, Integer.MIN_VALUE, null);
ColumnVector v1 = ColumnVector.fromBoxedInts(0, null, -100, null, null, Integer.MAX_VALUE);
ColumnVector result = ColumnVector.serial32BitMurmurHash3(42, new ColumnVector[]{v0, v1});
ColumnVector expected = ColumnVector.fromBoxedInts(59727262, 751823303, -1080202046, 42, 723455942, 133916647)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have a good reason for hard-coding and I am not familiar with hashing in Java, instead of hard-coding, can we generate the expected values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could generate expected values, but the results should be static and never change. Also using org.apache.commons.codec.digest.MurmurHash3 would add other potential failure points (requires converting input to byte arrays, which would not necessarily be an apples to apples comparison).

NEGATIVE_DOUBLE_NAN_UPPER_RANGE, NEGATIVE_DOUBLE_NAN_LOWER_RANGE,
Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY);
ColumnVector result = ColumnVector.serial32BitMurmurHash3(new ColumnVector[]{v});
ColumnVector expected = ColumnVector.fromBoxedInts(1669671676, 0, -544903190, -1831674681, 150502665, 474144502, 1428788237, 1428788237, 1428788237, 1428788237, 420913893, 1915664072)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about hard-coding the expected values

@rapids-bot rapids-bot bot merged commit 73cca47 into rapidsai:branch-0.17 Dec 3, 2020
@razajafri
Copy link
Contributor

Others have approved the PR some of them are questions and minor, please feel free to merge

rapids-bot bot pushed a commit that referenced this pull request Aug 1, 2022
This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)
  - Jason Lowe (https://github.com/jlowe)

URL: #11383
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants