Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HiveHash support for nested types #9

Merged
merged 16 commits into from
Oct 23, 2024

Conversation

ustcfy
Copy link

@ustcfy ustcfy commented Oct 11, 2024

From this link, we can see the nested HiveHash computation in the Spark CPU version.

Here is a simple example of calculating the hash value for a List<List<int>> type.

Assuming column_device_view col is a List<List<int>> type:

cudf::column_device_view curr_col = col.slice(row_index, 1);
Now curr_col may look like the following:

Index List<List<int>>
0 {{1, 0}, null, {2, null}}

You can run the following code in the Spark shell to calculate the hash value of curr_col.

import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.util._

val inType = BoundReference(0, ArrayType(ArrayType(IntegerType)), true)
val hh = HiveHash(Seq(inType))

def toAData(any:Any): ArrayData = if(any==null) null else ArrayData.toArrayData(any)
 
val data = Seq(
    List(List(1, 0), null, List(2, null))
  )

data.map { row =>
 val rr = if (row == null) null else row.map(toAData(_))
 val r = toAData(rr)
 hh.eval(new GenericInternalRow(Array(r.asInstanceOf[Any])))
}


template <typename T, CUDF_ENABLE_IF(cudf::is_nested<T>())>
template <typename T>
Copy link

@firestarman firestarman Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better keep the two original methods here to separate the non-nested types from nested types. It can avoid some useless "if" condition check and "accumulate" execution. Which will impact the perf when there are more than millions of rows.

__device__ hive_hash_value_t operator()(cudf::column_device_view const& col,
cudf::size_type row_index) const noexcept
{
CUDF_UNREACHABLE("Nested type is not supported");
cudf::column_device_view curr_col = col.slice(row_index, 1);
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size "8" ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can not understand how stack way works. Better to add some comment here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size "8" ?

It's just a temporary solution, and I will improve it later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the 8 into the template.

true, false, null, false, true, null);
ColumnView structs = ColumnView.makeStructView(strings, integers, doubles, floats, bools);
ColumnVector result = Hash.hiveHash(new ColumnView[]{structs});
ColumnVector expected = Hash.hiveHash(new ColumnVector[]{strings, integers, doubles, floats, bools})) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can not test the correctness of HiveHash computation. It can only prove that nested struct has nothing to do with the hash result. If the hash algorithm produces different output than Spark HiveHash, the test can still pass.

ColumnView structs = ColumnView.makeStructView(strings, integers, doubles, floats, bools);
ColumnVector result = Hash.hiveHash(new ColumnView[]{structs});
ColumnVector expected = Hash.hiveHash(new ColumnVector[]{strings, integers, doubles, floats, bools})) {
assertColumnsAreEqual(expected, result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases are using the same interface to get both expected and actual results.
Please leverage the results in testHiveHashStrings, testHiveHashFloats...
Test case will like:

struct_data = {1.0, 2.3, 5}
actual = Hash.hiveHash(...)
expected = 37 * (37 * hash(1.0) + hash(2.3)) + hash(5)
assert_equals(expected, actual)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better that we can use Spark hive hash interface to get the expected result.
Try to introduce the corresponding Java dependency with test scope.

    <dependency>
      <groupId>xx</groupId>
      <artifactId>xx</artifactId>
      <scope>test</scope>
    </dependency>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not suggest do this in JNI repo. We can hardcode the expected results just as what other tests do.
But we can test them against Spark hive hash in Rapids tests.

Copy link

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to read the stack logic, it looks like something correct, but could we add some more comments?

Also if there is not much effort in plugin side, I suggest we can test there with integration test, which has a larger coverage and more convenient to write. Ok seems it is not so convenient for this case

__device__ hive_hash_value_t operator()(cudf::column_device_view const& col,
cudf::size_type row_index) const noexcept
{
CUDF_UNREACHABLE("Nested type is not supported");
cudf::column_device_view curr_col = col.slice(row_index, 1);
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this mean that we only support nested types whose depth is less than 8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should fallback in the Plugin when encounter a nest type is large than 8 depths.

@@ -510,4 +510,164 @@ void testHiveHashMixed() {
assertColumnsAreEqual(expected, result);
}
}

@Test
void testHiveHashStruct() {
Copy link
Collaborator

@res-life res-life Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please test all the types, especially DICTIONARY32, TIMESTAMP_MILLISECONDS, DECIMAL32 types.

 switch (dtype.id()) {
    case type_id::INT8:
      return f.template operator()<typename IdTypeMap<type_id::INT8>::type>(
        std::forward<Ts>(args)...);
    case type_id::INT16:
      return f.template operator()<typename IdTypeMap<type_id::INT16>::type>(
        std::forward<Ts>(args)...);
    case type_id::INT32:
      return f.template operator()<typename IdTypeMap<type_id::INT32>::type>(
        std::forward<Ts>(args)...);
    case type_id::INT64:
      return f.template operator()<typename IdTypeMap<type_id::INT64>::type>(
        std::forward<Ts>(args)...);
    case type_id::UINT8:
      return f.template operator()<typename IdTypeMap<type_id::UINT8>::type>(
        std::forward<Ts>(args)...);
    case type_id::UINT16:
      return f.template operator()<typename IdTypeMap<type_id::UINT16>::type>(
        std::forward<Ts>(args)...);
    case type_id::UINT32:
      return f.template operator()<typename IdTypeMap<type_id::UINT32>::type>(
        std::forward<Ts>(args)...);
    case type_id::UINT64:
      return f.template operator()<typename IdTypeMap<type_id::UINT64>::type>(
        std::forward<Ts>(args)...);
    case type_id::FLOAT32:
      return f.template operator()<typename IdTypeMap<type_id::FLOAT32>::type>(
        std::forward<Ts>(args)...);
    case type_id::FLOAT64:
      return f.template operator()<typename IdTypeMap<type_id::FLOAT64>::type>(
        std::forward<Ts>(args)...);
    case type_id::BOOL8:
      return f.template operator()<typename IdTypeMap<type_id::BOOL8>::type>(
        std::forward<Ts>(args)...);
    case type_id::TIMESTAMP_DAYS:
      return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_DAYS>::type>(
        std::forward<Ts>(args)...);
    case type_id::TIMESTAMP_SECONDS:
      return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_SECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::TIMESTAMP_MILLISECONDS:
      return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_MILLISECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::TIMESTAMP_MICROSECONDS:
      return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_MICROSECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::TIMESTAMP_NANOSECONDS:
      return f.template operator()<typename IdTypeMap<type_id::TIMESTAMP_NANOSECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DURATION_DAYS:
      return f.template operator()<typename IdTypeMap<type_id::DURATION_DAYS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DURATION_SECONDS:
      return f.template operator()<typename IdTypeMap<type_id::DURATION_SECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DURATION_MILLISECONDS:
      return f.template operator()<typename IdTypeMap<type_id::DURATION_MILLISECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DURATION_MICROSECONDS:
      return f.template operator()<typename IdTypeMap<type_id::DURATION_MICROSECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DURATION_NANOSECONDS:
      return f.template operator()<typename IdTypeMap<type_id::DURATION_NANOSECONDS>::type>(
        std::forward<Ts>(args)...);
    case type_id::DICTIONARY32:
      return f.template operator()<typename IdTypeMap<type_id::DICTIONARY32>::type>(
        std::forward<Ts>(args)...);
    case type_id::STRING:
      return f.template operator()<typename IdTypeMap<type_id::STRING>::type>(
        std::forward<Ts>(args)...);
    case type_id::LIST:
      return f.template operator()<typename IdTypeMap<type_id::LIST>::type>(
        std::forward<Ts>(args)...);
    case type_id::DECIMAL32:
      return f.template operator()<typename IdTypeMap<type_id::DECIMAL32>::type>(
        std::forward<Ts>(args)...);
    case type_id::DECIMAL64:
      return f.template operator()<typename IdTypeMap<type_id::DECIMAL64>::type>(
        std::forward<Ts>(args)...);
    case type_id::DECIMAL128:
      return f.template operator()<typename IdTypeMap<type_id::DECIMAL128>::type>(
        std::forward<Ts>(args)...);
    case type_id::STRUCT:
      return f.template operator()<typename IdTypeMap<type_id::STRUCT>::type>(
        std::forward<Ts>(args)...);

Copy link
Author

@ustcfy ustcfy Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@res-life

  • Do you mean, for example, to add fields like TIMESTAMP_MILLISECONDS and DECIMAL32 inside struct? However, it seems that the Hivehash for these two types has not been implemented yet.

  • As for DICTIONARY, I will consider adding support if the cases for struct and list are resolved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, for example, to add fields like TIMESTAMP_MILLISECONDS and DECIMAL32 inside struct? However, it seems that the Hivehash for these two types has not been implemented yet.

If Hivehash does not support these two types, we should try to find a way to implement it, or we should fallback in the Rapids plugin. I assume Spark supports these two types.

As for DICTIONARY, I will consider adding support if the cases for struct and list are resolved.
Please help test.

  • DICTIONARY for basic types like int, long
  • DICTIONARY for nested types, maybe cuDF can not generate this kind of data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More primitive types support can be a follow-up. This PR focuses on nested types support.

__device__ hive_hash_value_t operator()(cudf::column_device_view const& col,
cudf::size_type row_index) const noexcept
{
CUDF_UNREACHABLE("Nested type is not supported");
cudf::column_device_view curr_col = col.slice(row_index, 1);
stack_element stack[8] = {stack_element(curr_col), stack_element(curr_col), stack_element(curr_col), stack_element(curr_col),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to initilize 8 column_device_view.
Consider:

constexpr int len_of_8_stack_element = 8 * sizeof(stack_element);
// column_device_view default constructor is deleted, can not allocate column_device_view array directly
// use byte array to wrapper stack_element list
unit8 stack_wrapper[len_of_8_stack_element];
stack_element *stack = reinterpret_cast<stack_element*>(stack_wrapper);

Copy link
Collaborator

@res-life res-life Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some pseudocode to explain the calculation logic:
e.g.:

// The stack is simulating the recursive function invokes:

hive_hash(float) {
   return (int)float;
}

hive_hash(list) {
  int hash = 0
  for(child : list) {
     // Depth first Traversal
     hash += hive_hash(child)
  }
  return hash
}

It's better to add a simple example.

cudf::column_device_view curr_col = col.slice(row_index, 1);
// column_device_view default constructor is deleted, can not allocate column_device_view array directly
// use byte array to wrapper StackElement list
constexpr int len_of_8_StackElement = 8 * sizeof(StackElement);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 8 used here ?

template <typename T, CUDF_ENABLE_IF(cudf::is_nested<T>())>
__device__ hive_hash_value_t operator()(cudf::column_device_view const& col,
cudf::size_type row_index) const noexcept
{
CUDF_UNREACHABLE("Nested type is not supported");
hive_hash_value_t ret = HIVE_INIT_HASH;

This comment was marked as resolved.

Comment on lines 233 to 234
uint8_t stack_wrapper[len_of_8_StackElement];
StackElementPtr stack = reinterpret_cast<StackElementPtr>(stack_wrapper);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused on creating a stack of StackElement ?
Does the below not work in C++ ?

StackElement statck[8];

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused on creating a stack of StackElement ? Does the below not work in C++ ?

StackElement statck[8];

Because if I define an array StackElement stack[8]; without initialization, each element of the array will be constructed using the default constructor of StackElement. However, the default constructor for the member cudf::column_device_view in StackElement is deleted.

Copy link

@firestarman firestarman Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, But you can restore the default constructor. Any reason it has to be deleted?
uint8_t stack_wrapper[len_of_8_StackElement] will also initialize 8 empty StackElement, the same as what StackElement statck[8] will do.

Copy link
Collaborator

@res-life res-life Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason it has to be deleted?

Do not know why cuDF marks default constructor as deleted.

column_device_view()                          = delete;

uint8_t stack_wrapper[len_of_8_StackElement] will also initialize 8 empty StackElement

For a array like A_Class[8], it will first allocate 8 * sizeof(A_Class) memory, then call default constructor to initilize 8 instences. uint8_t stack_wrapper[len_of_8_StackElement] will skip the call of default constructor to initilize.

Here is a workaround to skip the call of default constructor.

      uint8_t stack_wrapper[len_of_8_StackElement];
      StackElementPtr stack = reinterpret_cast<StackElementPtr>(stack_wrapper);

@@ -157,7 +157,7 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper
* @tparam hash_function Hash functor to use for hashing elements. Must be hive_hash_function.
* @tparam Nullate A cudf::nullate type describing whether to check for nulls.
*/
template <template <typename> class hash_function, typename Nullate>
template <template <typename> class hash_function, typename Nullate, int MAX_NESTED_LEN = 8>
Copy link

@firestarman firestarman Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally it is not a good idea to hardcode the nested length. It will crash when there are more than 8 nested layers.
If 8 is the upper limitation, we need to check the length and throw an exception (simliar as what check_hash_compatibility does)before running into this row hasher, and notice this limitation in the doc of this function/class.
Another feasible way is we can calculate this length from the schema (you can get the schema from the input table view) and allocate a GPU stack (e.g. rmm::device_vector, or rmm::device_buffer) and pass the stack into the hive_device_row_hasher.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another feasible way is we can calculate this length from the schema (you can get the schema from the input table view) and allocate a GPU stack

get-json-object is also using a constant depth. I remember that allocate a GPU stack will impact performance.
I recommand that check the max length in CPU cpp code before call into this kernel and throw an exception if length exceeds 8. In plugin Java code, add fallback logic.

Copy link

@firestarman firestarman Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU stack (actually a GPU memory) is allocated with a fixed size calculated from the schema, just like allocating an output column (a GPU memory inside). So it should have no big impact on perf, I think.

And personally this case is different from yours. This max length can be inferred from the schema without accessing any actual row. But get-json-object requires rows access to get this info.

stack_size--;
} else {
// Push child column to stack
stack[stack_size++] = StackElement(cudf::detail::structs_column_device_view(curr_col).get_sliced_child(element->child_idx), element->factor_exp + curr_col.num_child_columns() - 1 - element->child_idx);
Copy link
Collaborator

@res-life res-life Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add throw exception if stack_size > 8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw an exception in a CUDA kernel ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See an example as below, a mcro is used to handle this kind of case. And CUDF_UNREACHABLE

    template <typename T, CUDF_ENABLE_IF(not cudf::column_device_view::has_element_accessor<T>())>
    __device__ hash_value_type operator()(cudf::column_device_view const&,
                                          cudf::size_type,
                                          Nullate const,
                                          hash_value_type const) const noexcept
    {
      CUDF_UNREACHABLE("Unsupported type for xxhash64");
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception in CPU code before call into this kernel.

if (element->child_idx == curr_col.size()) {
stack_size--;
} else {
stack[stack_size++] = StackElement(curr_col.slice(element->child_idx, 1), element->factor_exp + curr_col.size() - element->child_idx - 1);
Copy link
Collaborator

@res-life res-life Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception in CPU code before call into this kernel.

@@ -157,7 +157,7 @@ hive_hash_value_t __device__ inline hive_hash_function<cudf::timestamp_us>::oper
* @tparam hash_function Hash functor to use for hashing elements. Must be hive_hash_function.
* @tparam Nullate A cudf::nullate type describing whether to check for nulls.
*/
template <template <typename> class hash_function, typename Nullate>
template <template <typename> class hash_function, typename Nullate, int MAX_NESTED_LEN>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Rename MAX_NESTED_LEN to MAX_NESTED_DEPTH

@@ -239,6 +443,10 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
// Return early if there's nothing to hash
if (input.num_columns() == 0 || input.num_rows() == 0) { return output; }

//Nested depth cannot exceed 8
int const max_nested_len = 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr is better than int const

* The hash value for the column is computed as:
* hash(S1) = hash(S2) * HIVE_HASH_FACTOR + hash(d)
* = (hash(i) * HIVE_HASH_FACTOR + hash(f)) * HIVE_HASH_FACTOR + hash(d)
* = hash(i) * HIVE_HASH_FACTOR^2 + hash(f) * HIVE_HASH_FACTOR + hash(d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation is not equivalent when considering overflow.

stack[stack_size++] = StackElement(curr_col, 1);

while (stack_size > 0) {
StackElementPtr element = &stack[stack_size - 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
StackElement const& element = stack[stack_size - 1];

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: StackElement const& element = stack[stack_size - 1];

But I need to modify the child_idx member of the element.

check_nested_depth_impl(*child, max_nested_len - 1);
}
} else { // Primitive type
CUDF_EXPECTS(max_nested_len > 0, "The nested depth of the input table exceeds the maximum supported depth.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Add max_nested_len into error msg to tell report more details.

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved
}

@Test
void testHiveHashNestedType() {
Copy link

@firestarman firestarman Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have a test to verify the case when the max nested depth is larger than 8.
Also missing tests for the type of list of struct.

Copy link

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good to Me

Signed-off-by: ustcfy <[email protected]>
Signed-off-by: ustcfy <[email protected]>
@ustcfy ustcfy marked this pull request as ready for review October 22, 2024 05:37
@ustcfy ustcfy changed the base branch from 0902 to 1010 October 22, 2024 06:46
@wjxiz1992 wjxiz1992 merged commit e3d5769 into wjxiz1992:1010 Oct 23, 2024
1 check passed
@ustcfy ustcfy deleted the hivehash-nested-support branch October 24, 2024 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants