-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HiveHash support for nested types #9
Conversation
Signed-off-by: ustcfy <[email protected]>
src/main/cpp/src/hive_hash.cu
Outdated
|
||
template <typename T, CUDF_ENABLE_IF(cudf::is_nested<T>())> | ||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better keep the two original methods here to separate the non-nested types from nested types. It can avoid some useless "if" condition check and "accumulate" execution. Which will impact the perf when there are more than millions of rows.
src/main/cpp/src/hive_hash.cu
Outdated
__device__ hive_hash_value_t operator()(cudf::column_device_view const& col, | ||
cudf::size_type row_index) const noexcept | ||
{ | ||
CUDF_UNREACHABLE("Nested type is not supported"); | ||
cudf::column_device_view curr_col = col.slice(row_index, 1); | ||
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the size "8" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can not understand how stack way works. Better to add some comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the size "8" ?
It's just a temporary solution, and I will improve it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the 8 into the template.
true, false, null, false, true, null); | ||
ColumnView structs = ColumnView.makeStructView(strings, integers, doubles, floats, bools); | ||
ColumnVector result = Hash.hiveHash(new ColumnView[]{structs}); | ||
ColumnVector expected = Hash.hiveHash(new ColumnVector[]{strings, integers, doubles, floats, bools})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can not test the correctness of HiveHash computation. It can only prove that nested struct has nothing to do with the hash result. If the hash algorithm produces different output than Spark HiveHash, the test can still pass.
ColumnView structs = ColumnView.makeStructView(strings, integers, doubles, floats, bools); | ||
ColumnVector result = Hash.hiveHash(new ColumnView[]{structs}); | ||
ColumnVector expected = Hash.hiveHash(new ColumnVector[]{strings, integers, doubles, floats, bools})) { | ||
assertColumnsAreEqual(expected, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases are using the same interface to get both expected and actual results.
Please leverage the results in testHiveHashStrings
, testHiveHashFloats
...
Test case will like:
struct_data = {1.0, 2.3, 5}
actual = Hash.hiveHash(...)
expected = 37 * (37 * hash(1.0) + hash(2.3)) + hash(5)
assert_equals(expected, actual)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better that we can use Spark hive hash interface to get the expected result.
Try to introduce the corresponding Java dependency with test scope.
<dependency>
<groupId>xx</groupId>
<artifactId>xx</artifactId>
<scope>test</scope>
</dependency>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not suggest do this in JNI repo. We can hardcode the expected results just as what other tests do.
But we can test them against Spark hive hash in Rapids tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to read the stack logic, it looks like something correct, but could we add some more comments?
Also if there is not much effort in plugin side, I suggest we can test there with integration test, which has a larger coverage and more convenient to write. Ok seems it is not so convenient for this case
src/main/cpp/src/hive_hash.cu
Outdated
__device__ hive_hash_value_t operator()(cudf::column_device_view const& col, | ||
cudf::size_type row_index) const noexcept | ||
{ | ||
CUDF_UNREACHABLE("Nested type is not supported"); | ||
cudf::column_device_view curr_col = col.slice(row_index, 1); | ||
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does this mean that we only support nested types whose depth is less than 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should fallback in the Plugin when encounter a nest type is large than 8 depths.
@@ -510,4 +510,164 @@ void testHiveHashMixed() { | |||
assertColumnsAreEqual(expected, result); | |||
} | |||
} | |||
|
|||
@Test | |||
void testHiveHashStruct() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please test all the types, especially DICTIONARY32
, TIMESTAMP_MILLISECONDS
, DECIMAL32
types.
switch (dtype.id()) {
case type_id::INT8:
return f.template operator()<typename IdTypeMap<type_id::INT8>::type>(
std::forward<Ts>(args)...);
case type_id::INT16:
return f.template operator()<typename IdTypeMap<type_id::INT16>::type>(
std::forward<Ts>(args)...);
case type_id::INT32:
return f.template operator()<typename IdTypeMap<type_id::INT32>::type>(
std::forward<Ts>(args)...);
case type_id::INT64:
return f.template operator()<typename IdTypeMap<type_id::INT64>::type>(
std::forward<Ts>(args)...);
case type_id::UINT8:
return f.template operator()<typename IdTypeMap<type_id::UINT8>::type>(
std::forward<Ts>(args)...);
case type_id::UINT16:
return f.template operator()<typename IdTypeMap<type_id::UINT16>::type>(
std::forward<Ts>(args)...);
case type_id::UINT32:
return f.template operator()<typename IdTypeMap<type_id::UINT32>::type>(
std::forward<Ts>(args)...);
case type_id::UINT64:
return f.template operator()<typename IdTypeMap<type_id::UINT64>::type>(
std::forward<Ts>(args)...);
case type_id::FLOAT32:
return f.template operator()<typename IdTypeMap<type_id::FLOAT32>::type>(
std::forward<Ts>(args)...);
case type_id::FLOAT64:
return f.template operator()<typename IdTypeMap<type_id::FLOAT64>::type>(
std::forward<Ts>(args)...);
case type_id::BOOL8:
return f.template operator()<typename IdTypeMap<type_id::BOOL8>::type>(
std::forward<Ts>(args)...);
case type_id::TIMESTAMP_DAYS:
return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_DAYS>::type>(
std::forward<Ts>(args)...);
case type_id::TIMESTAMP_SECONDS:
return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_SECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::TIMESTAMP_MILLISECONDS:
return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_MILLISECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::TIMESTAMP_MICROSECONDS:
return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_MICROSECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::TIMESTAMP_NANOSECONDS:
return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_NANOSECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::DURATION_DAYS:
return f.template operator()<typename IdTypeMap<type_id::DURATION_DAYS>::type>(
std::forward<Ts>(args)...);
case type_id::DURATION_SECONDS:
return f.template operator()<typename IdTypeMap<type_id::DURATION_SECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::DURATION_MILLISECONDS:
return f.template operator()<typename IdTypeMap<type_id::DURATION_MILLISECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::DURATION_MICROSECONDS:
return f.template operator()<typename IdTypeMap<type_id::DURATION_MICROSECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::DURATION_NANOSECONDS:
return f.template operator()<typename IdTypeMap<type_id::DURATION_NANOSECONDS>::type>(
std::forward<Ts>(args)...);
case type_id::DICTIONARY32:
return f.template operator()<typename IdTypeMap<type_id::DICTIONARY32>::type>(
std::forward<Ts>(args)...);
case type_id::STRING:
return f.template operator()<typename IdTypeMap<type_id::STRING>::type>(
std::forward<Ts>(args)...);
case type_id::LIST:
return f.template operator()<typename IdTypeMap<type_id::LIST>::type>(
std::forward<Ts>(args)...);
case type_id::DECIMAL32:
return f.template operator()<typename IdTypeMap<type_id::DECIMAL32>::type>(
std::forward<Ts>(args)...);
case type_id::DECIMAL64:
return f.template operator()<typename IdTypeMap<type_id::DECIMAL64>::type>(
std::forward<Ts>(args)...);
case type_id::DECIMAL128:
return f.template operator()<typename IdTypeMap<type_id::DECIMAL128>::type>(
std::forward<Ts>(args)...);
case type_id::STRUCT:
return f.template operator()<typename IdTypeMap<type_id::STRUCT>::type>(
std::forward<Ts>(args)...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Do you mean, for example, to add fields like
TIMESTAMP_MILLISECONDS
andDECIMAL32
insidestruct
? However, it seems that theHivehash
for these two types has not been implemented yet. -
As for
DICTIONARY
, I will consider adding support if the cases forstruct
andlist
are resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean, for example, to add fields like TIMESTAMP_MILLISECONDS and DECIMAL32 inside struct? However, it seems that the Hivehash for these two types has not been implemented yet.
If Hivehash does not support these two types, we should try to find a way to implement it, or we should fallback in the Rapids plugin. I assume Spark supports these two types.
As for DICTIONARY, I will consider adding support if the cases for struct and list are resolved.
Please help test.
- DICTIONARY for basic types like int, long
- DICTIONARY for nested types, maybe cuDF can not generate this kind of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More primitive types support can be a follow-up. This PR focuses on nested types support.
src/main/cpp/src/hive_hash.cu
Outdated
__device__ hive_hash_value_t operator()(cudf::column_device_view const& col, | ||
cudf::size_type row_index) const noexcept | ||
{ | ||
CUDF_UNREACHABLE("Nested type is not supported"); | ||
cudf::column_device_view curr_col = col.slice(row_index, 1); | ||
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to initilize 8 column_device_view.
Consider:
constexpr int len_of_8_stack_element = 8 * sizeof(stack_element);
// column_device_view default constructor is deleted, can not allocate column_device_view array directly
// use byte array to wrapper stack_element list
unit8 stack_wrapper[len_of_8_stack_element];
stack_element *stack = reinterpret_cast<stack_element*>(stack_wrapper);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some pseudocode to explain the calculation logic:
e.g.:
// The stack is simulating the recursive function invokes:
hive_hash(float) {
return (int)float;
}
hive_hash(list) {
int hash = 0
for(child : list) {
// Depth first Traversal
hash += hive_hash(child)
}
return hash
}
It's better to add a simple example.
Signed-off-by: ustcfy <[email protected]>
…vehash-nested-support
Signed-off-by: ustcfy <[email protected]>
src/main/cpp/src/hive_hash.cu
Outdated
cudf::column_device_view curr_col = col.slice(row_index, 1); | ||
// column_device_view default constructor is deleted, can not allocate column_device_view array directly | ||
// use byte array to wrapper StackElement list | ||
constexpr int len_of_8_StackElement = 8 * sizeof(StackElement); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 8 used here ?
src/main/cpp/src/hive_hash.cu
Outdated
template <typename T, CUDF_ENABLE_IF(cudf::is_nested<T>())> | ||
__device__ hive_hash_value_t operator()(cudf::column_device_view const& col, | ||
cudf::size_type row_index) const noexcept | ||
{ | ||
CUDF_UNREACHABLE("Nested type is not supported"); | ||
hive_hash_value_t ret = HIVE_INIT_HASH; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
src/main/cpp/src/hive_hash.cu
Outdated
uint8_t stack_wrapper[len_of_8_StackElement]; | ||
StackElementPtr stack = reinterpret_cast<StackElementPtr>(stack_wrapper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused on creating a stack of StackElement ?
Does the below not work in C++ ?
StackElement statck[8];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused on creating a stack of StackElement ? Does the below not work in C++ ?
StackElement statck[8];
Because if I define an array StackElement stack[8];
without initialization, each element of the array will be constructed using the default constructor of StackElement
. However, the default constructor for the member cudf::column_device_view
in StackElement
is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, But you can restore the default constructor. Any reason it has to be deleted?
uint8_t stack_wrapper[len_of_8_StackElement]
will also initialize 8 empty StackElement
, the same as what StackElement statck[8]
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason it has to be deleted?
Do not know why cuDF marks default constructor as deleted.
column_device_view() = delete;
uint8_t stack_wrapper[len_of_8_StackElement] will also initialize 8 empty StackElement
For a array like A_Class[8], it will first allocate 8 * sizeof(A_Class) memory, then call default constructor to initilize 8 instences. uint8_t stack_wrapper[len_of_8_StackElement]
will skip the call of default constructor to initilize.
Here is a workaround to skip the call of default constructor.
uint8_t stack_wrapper[len_of_8_StackElement];
StackElementPtr stack = reinterpret_cast<StackElementPtr>(stack_wrapper);
Signed-off-by: ustcfy <[email protected]>
src/main/cpp/src/hive_hash.cu
Outdated
@@ -157,7 +157,7 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper | |||
* @tparam hash_function Hash functor to use for hashing elements. Must be hive_hash_function. | |||
* @tparam Nullate A cudf::nullate type describing whether to check for nulls. | |||
*/ | |||
template <template <typename> class hash_function, typename Nullate> | |||
template <template <typename> class hash_function, typename Nullate, int MAX_NESTED_LEN = 8> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally it is not a good idea to hardcode the nested length. It will crash when there are more than 8 nested layers.
If 8 is the upper limitation, we need to check the length and throw an exception (simliar as what check_hash_compatibility
does)before running into this row hasher, and notice this limitation in the doc of this function/class.
Another feasible way is we can calculate this length from the schema (you can get the schema from the input table view) and allocate a GPU stack (e.g. rmm::device_vector
, or rmm::device_buffer) and pass the stack into the hive_device_row_hasher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another feasible way is we can calculate this length from the schema (you can get the schema from the input table view) and allocate a GPU stack
get-json-object
is also using a constant depth. I remember that allocate a GPU stack
will impact performance.
I recommand that check the max length in CPU cpp code before call into this kernel and throw an exception if length exceeds 8. In plugin Java code, add fallback logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU stack (actually a GPU memory) is allocated with a fixed size calculated from the schema, just like allocating an output column (a GPU memory inside). So it should have no big impact on perf, I think.
And personally this case is different from yours. This max length can be inferred from the schema without accessing any actual row. But get-json-object
requires rows access to get this info.
src/main/cpp/src/hive_hash.cu
Outdated
stack_size--; | ||
} else { | ||
// Push child column to stack | ||
stack[stack_size++] = StackElement(cudf::detail::structs_column_device_view(curr_col).get_sliced_child(element->child_idx), element->factor_exp + curr_col.num_child_columns() - 1 - element->child_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add throw exception
if stack_size > 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we throw an exception in a CUDA kernel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See an example as below, a mcro is used to handle this kind of case. And CUDF_UNREACHABLE
template <typename T, CUDF_ENABLE_IF(not cudf::column_device_view::has_element_accessor<T>())>
__device__ hash_value_type operator()(cudf::column_device_view const&,
cudf::size_type,
Nullate const,
hash_value_type const) const noexcept
{
CUDF_UNREACHABLE("Unsupported type for xxhash64");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw an exception in CPU code before call into this kernel.
src/main/cpp/src/hive_hash.cu
Outdated
if (element->child_idx == curr_col.size()) { | ||
stack_size--; | ||
} else { | ||
stack[stack_size++] = StackElement(curr_col.slice(element->child_idx, 1), element->factor_exp + curr_col.size() - element->child_idx - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw an exception in CPU code before call into this kernel.
Signed-off-by: ustcfy <[email protected]>
src/main/cpp/src/hive_hash.cu
Outdated
@@ -157,7 +157,7 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper | |||
* @tparam hash_function Hash functor to use for hashing elements. Must be hive_hash_function. | |||
* @tparam Nullate A cudf::nullate type describing whether to check for nulls. | |||
*/ | |||
template <template <typename> class hash_function, typename Nullate> | |||
template <template <typename> class hash_function, typename Nullate, int MAX_NESTED_LEN> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Rename MAX_NESTED_LEN
to MAX_NESTED_DEPTH
src/main/cpp/src/hive_hash.cu
Outdated
@@ -239,6 +443,10 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input, | |||
// Return early if there's nothing to hash | |||
if (input.num_columns() == 0 || input.num_rows() == 0) { return output; } | |||
|
|||
//Nested depth cannot exceed 8 | |||
int const max_nested_len = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr
is better than int const
src/main/cpp/src/hive_hash.cu
Outdated
* The hash value for the column is computed as: | ||
* hash(S1) = hash(S2) * HIVE_HASH_FACTOR + hash(d) | ||
* = (hash(i) * HIVE_HASH_FACTOR + hash(f)) * HIVE_HASH_FACTOR + hash(d) | ||
* = hash(i) * HIVE_HASH_FACTOR^2 + hash(f) * HIVE_HASH_FACTOR + hash(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transformation is not equivalent when considering overflow.
src/main/cpp/src/hive_hash.cu
Outdated
stack[stack_size++] = StackElement(curr_col, 1); | ||
|
||
while (stack_size > 0) { | ||
StackElementPtr element = &stack[stack_size - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
StackElement const& element = stack[stack_size - 1];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: StackElement const& element = stack[stack_size - 1];
But I need to modify the child_idx
member of the element
.
src/main/cpp/src/hive_hash.cu
Outdated
check_nested_depth_impl(*child, max_nested_len - 1); | ||
} | ||
} else { // Primitive type | ||
CUDF_EXPECTS(max_nested_len > 0, "The nested depth of the input table exceeds the maximum supported depth."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Add max_nested_len
into error msg to tell report more details.
Signed-off-by: ustcfy <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
} | ||
|
||
@Test | ||
void testHiveHashNestedType() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have a test to verify the case when the max nested depth is larger than 8.
Also missing tests for the type of list of struct.
… depth limit Signed-off-by: ustcfy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good to Me
Signed-off-by: ustcfy <[email protected]>
Signed-off-by: ustcfy <[email protected]>
From this link, we can see the nested HiveHash computation in the Spark CPU version.
Here is a simple example of calculating the hash value for a
List<List<int>>
type.Assuming
column_device_view col
is aList<List<int>>
type:cudf::column_device_view curr_col = col.slice(row_index, 1);
Now
curr_col
may look like the following:You can run the following code in the Spark shell to calculate the hash value of
curr_col
.