-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark list hashing #11292
Spark list hashing #11292
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review comments.
@@ -44,6 +44,12 @@ std::unique_ptr<column> murmur_hash3_32( | |||
rmm::cuda_stream_view stream = cudf::default_stream_value, | |||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | |||
|
|||
std::unique_ptr<column> spark_murmur_hash3_32( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the APIs here don't have doxygen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're detail APIs, which don't require docs. The public API is cudf::hash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build does not require detail APIs to have doxygen but programmers would still appreciate documentation.
You can see many detail functions are documented with @copydoc
tags for example.
I can add some detail to the https://github.com/rapidsai/cudf/blob/branch-22.08/cpp/docs/DOCUMENTATION.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be some significant changes in this file with other planned work (#11296), so I'm going to defer on this until I can do it for the whole file. I added a note to myself to improve this later for all hash functions. #10081 (comment)
|
||
void check_hash_compatibility(table_view const& input) | ||
{ | ||
using column_checker_fn_t = std::function<void(column_view const&)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a function wrapper is used here? Why can't just use a lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I asked this on a previous PR that did something like this (can't find the reference). The problem is that lambda functions defined as auto
don't work when they must be called recursively. https://stackoverflow.com/questions/2067988/recursive-lambda-functions-in-c11
You can see this in other places in libcudf:
cudf/cpp/src/table/row_operators.cu
Line 266 in e7e5f45
column_checker_fn_t check_column = [&](column_view const& c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Handling the level of complexity of template+typename+template... in this PR is highly appreciated 😄
Nullate const _check_nulls; | ||
table_device_view const _table; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is there a style guideline or other reasoning for the ordering change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing one of two possibilities:
- He just chose to alphabetize.
- The order in which members are initialized is based on the order that they are declared here, not the order that they appear in the initializer list (the part after the
:
in the constructor). Sincecheck_nulls
comes beforet
in the constructor signature, he may have reordered the initializer list to match, and then reordering this bit becomes necessary to avoid creating an asymmetry that could catch unwary developers off guard (there are subtle bugs that can come from the wrong initialization order if the constructor makes some invalid assumptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about initialization order matching member order. Compilers sometimes throw warnings about this, and it’s good practice to make the constructor argument order match the initialization order and member order when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, all the nits I had seen previously got cleaned up in other iterations. Really exciting nested type functionality!
I added Java tests for nested-structs
, lists
, and structs-of-lists
for better coverage. We can expand this testing in the plugin and in the follow on for lists-of-structs
.
Plugin side testing has also indicated good results with this solution outside of the lists-of-structs
case that is currently caught and excepted inside cudf C++. Because the error case is non-trivial to catch in the JNI layer, relying on the C++ exception is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! I haven't looked at the Java test, but I assume it's an issue with not properly closing a resource.
@rwlee It appears there might be an issue with the Java tests leaking memory. I reverted those tests and have opened PR #11379 to fix this for branch-22.10. I don't think the leak is in libcudf - I see a few Java tests that manually close their column views, so I hope this is one of those cases (but I'm unsure what to do to fix it).
|
@gpucibot merge |
This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11383
This PR adds Java tests for the Spark list hashing feature added in #11292. Depends on #11292. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #11379
Closes #10378. This PR provides Spark-compliant hash values for list columns.