-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parquet dictionary encoding #10635
Improve parquet dictionary encoding #10635
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10635 +/- ##
================================================
+ Coverage 86.15% 86.36% +0.20%
================================================
Files 141 140 -1
Lines 22510 22304 -206
================================================
- Hits 19394 19263 -131
+ Misses 3116 3041 -75
Continue to review full report at Codecov.
|
rerun tests |
uniq_elem_size = [&]() -> size_type { | ||
if (not is_unique) { return 0; } | ||
switch (col->physical_type) { | ||
case Type::INT32: return 4; | ||
case Type::INT64: return 8; | ||
case Type::INT96: return 12; | ||
case Type::FLOAT: return 4; | ||
case Type::DOUBLE: return 8; | ||
case Type::BYTE_ARRAY: | ||
if (data_col.type().id() == type_id::STRING) { | ||
// Strings are stored as 4 byte length + string bytes | ||
return 4 + data_col.element<string_view>(val_idx).size_bytes(); | ||
} | ||
case Type::FIXED_LEN_BYTE_ARRAY: | ||
if (data_col.type().id() == type_id::DECIMAL128) { return sizeof(__int128_t); } | ||
default: CUDF_UNREACHABLE("Unsupported type for dictionary encoding"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch
seems redundant with the type_dispatcher
. Couldn't the map_insert_fn
be made to return the same information and avoid the extra switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, it seems like this could be simplified to:
auto const [is_unique, element_size] = is_valid ? type_dispatcher(...) : {0, 0};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are parquet types, not cudf types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the parquet type can be derived the cudf type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better would be to push the is_valid
check inside map_insert_fn
. Then I'd write this as:
while (val_idx - block_size < end_value_idx) {
thrust::optional<size_type> unique_element_size = type_dispatcher(...);
...
auto const num_unique = block_reduce(reduce_storage).Sum( unique_element_size.has_value() );
__syncthreads();
auto const unique_data_size = block_reduce(reduce_storage).Sum(unique_element_size.value_or(0));
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the parquet type can be derived the cudf type?
Not trivially. There can be multiple parquet types associated with a cuDF type and the decision to use which one is determined by a user passed metadata which gets to here.
If the user wants time stamp to be encoded with int96 instead of int64 then that might affect the decision to use dictionary.
Don't benchmark using file output. Use void output. The IO time dominates the kernel running time and hides any actual improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
@gpucibot merge |
This PR includes several changes to improve parquet dictionary encoding:
__launch_bounds__
while
Other ideas tested but not eventually included in this PR due to zero or negative performance impact:
cg::shfl
instead of shared memory + syncinsert
/find
num_dict_entries
anduniq_data_size
cg::reduce
instead ofcub::BlockReduce
Before:
Now: