Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark list hashing #11292

Merged
merged 48 commits into from
Jul 28, 2022
Merged

Spark list hashing #11292

merged 48 commits into from
Jul 28, 2022

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jul 18, 2022

Closes #10378. This PR provides Spark-compliant hash values for list columns.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 18, 2022
@bdice bdice self-assigned this Jul 18, 2022
@bdice bdice added the 2 - In Progress Currently a work in progress label Jul 18, 2022
@bdice bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 18, 2022
@github-actions github-actions bot added the CMake CMake build issue label Jul 19, 2022
Copy link
Contributor Author

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review comments.

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved
@bdice bdice added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 20, 2022
@bdice bdice marked this pull request as ready for review July 20, 2022 19:08
@@ -44,6 +44,12 @@ std::unique_ptr<column> murmur_hash3_32(
rmm::cuda_stream_view stream = cudf::default_stream_value,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

std::unique_ptr<column> spark_murmur_hash3_32(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the APIs here don't have doxygen?

Copy link
Contributor Author

@bdice bdice Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're detail APIs, which don't require docs. The public API is cudf::hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build does not require detail APIs to have doxygen but programmers would still appreciate documentation.
You can see many detail functions are documented with @copydoc tags for example.
I can add some detail to the https://github.com/rapidsai/cudf/blob/branch-22.08/cpp/docs/DOCUMENTATION.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be some significant changes in this file with other planned work (#11296), so I'm going to defer on this until I can do it for the whole file. I added a note to myself to improve this later for all hash functions. #10081 (comment)


void check_hash_compatibility(table_view const& input)
{
using column_checker_fn_t = std::function<void(column_view const&)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a function wrapper is used here? Why can't just use a lambda?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I asked this on a previous PR that did something like this (can't find the reference). The problem is that lambda functions defined as auto don't work when they must be called recursively. https://stackoverflow.com/questions/2067988/recursive-lambda-functions-in-c11

You can see this in other places in libcudf:

column_checker_fn_t check_column = [&](column_view const& c) {

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Handling the level of complexity of template+typename+template... in this PR is highly appreciated 😄

Nullate const _check_nulls;
table_device_view const _table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there a style guideline or other reasoning for the ordering change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing one of two possibilities:

  • He just chose to alphabetize.
  • The order in which members are initialized is based on the order that they are declared here, not the order that they appear in the initializer list (the part after the : in the constructor). Since check_nulls comes before t in the constructor signature, he may have reordered the initializer list to match, and then reordering this bit becomes necessary to avoid creating an asymmetry that could catch unwary developers off guard (there are subtle bugs that can come from the wrong initialization order if the constructor makes some invalid assumptions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about initialization order matching member order. Compilers sometimes throw warnings about this, and it’s good practice to make the constructor argument order match the initialization order and member order when possible.

Copy link
Contributor

@rwlee rwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, all the nits I had seen previously got cleaned up in other iterations. Really exciting nested type functionality!

I added Java tests for nested-structs, lists, and structs-of-lists for better coverage. We can expand this testing in the plugin and in the follow on for lists-of-structs.

Plugin side testing has also indicated good results with this solution outside of the lists-of-structs case that is currently caught and excepted inside cudf C++. Because the error case is non-trivial to catch in the JNI layer, relying on the C++ exception is acceptable.

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved
@bdice bdice requested review from vyasr and robertmaynard July 27, 2022 22:32
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I haven't looked at the Java test, but I assume it's an issue with not properly closing a resource.

@bdice bdice mentioned this pull request Jul 28, 2022
3 tasks
@bdice
Copy link
Contributor Author

bdice commented Jul 28, 2022

@rwlee It appears there might be an issue with the Java tests leaking memory. I reverted those tests and have opened PR #11379 to fix this for branch-22.10. I don't think the leak is in libcudf - I see a few Java tests that manually close their column views, so I hope this is one of those cases (but I'm unsure what to do to fix it).

ai.rapids.cudf.RmmException: Could not shut down RMM there appear to be outstanding allocations

@bdice
Copy link
Contributor Author

bdice commented Jul 28, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0891746 into rapidsai:branch-22.08 Jul 28, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 1, 2022
This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)
  - Jason Lowe (https://github.com/jlowe)

URL: #11383
rapids-bot bot pushed a commit that referenced this pull request Aug 1, 2022
This PR adds Java tests for the Spark list hashing feature added in #11292.

Depends on #11292.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: #11379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Hashing of LIST columns
6 participants