Improve parquet dictionary encoding #10635

PointKernel · 2022-04-11T22:00:27Z

This PR includes several changes to improve parquet dictionary encoding:

API cleanups: get rid of unused arguments
Remove min block limit in __launch_bounds__
Simplify the grid-stride loop logic by using while
All threads calculate start/end indices instead of one doing the calculation and broadcasting the result (no more shared memory or block-wide sync).

Other ideas tested but not eventually included in this PR due to zero or negative performance impact:

Tuning hash map occupancy
cg::shfl instead of shared memory + sync
CG based insert/find
Relaxed atomic for num_dict_entries and uniq_data_size
cg::reduce instead of cub::BlockReduce

Before:

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
ParquetWrite/integral_void_output/29/0/1/1/2/manual_time            734 ms          734 ms            1 bytes_per_second=697.128M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/1/1/2/manual_time         303 ms          303 ms            2 bytes_per_second=1.65131G/s encoded_file_size=397.998M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/1/2/manual_time           734 ms          734 ms            1 bytes_per_second=697.713M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/32/1/2/manual_time       61.9 ms         61.9 ms           11 bytes_per_second=8.07721G/s encoded_file_size=159.574M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/1/0/2/manual_time            690 ms          690 ms            1 bytes_per_second=742.205M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/1/0/2/manual_time         282 ms          282 ms            2 bytes_per_second=1.76991G/s encoded_file_size=398.712M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/0/2/manual_time           690 ms          690 ms            1 bytes_per_second=742.268M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/32/0/2/manual_time       59.5 ms         59.5 ms           12 bytes_per_second=8.40878G/s encoded_file_size=199.926M peak_memory_usage=1.49675G

Now:

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
ParquetWrite/integral_void_output/29/0/1/1/2/manual_time            733 ms          733 ms            1 bytes_per_second=698.24M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/1/1/2/manual_time         302 ms          302 ms            2 bytes_per_second=1.65496G/s encoded_file_size=397.998M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/1/2/manual_time           733 ms          733 ms            1 bytes_per_second=698.701M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/32/1/2/manual_time       61.3 ms         61.3 ms           11 bytes_per_second=8.1533G/s encoded_file_size=159.572M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/1/0/2/manual_time            688 ms          688 ms            1 bytes_per_second=743.71M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/1/0/2/manual_time         282 ms          282 ms            2 bytes_per_second=1.7712G/s encoded_file_size=398.712M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/0/2/manual_time           688 ms          688 ms            1 bytes_per_second=743.658M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/32/0/2/manual_time       58.9 ms         58.9 ms           12 bytes_per_second=8.49093G/s encoded_file_size=199.926M peak_memory_usage=1.49675G

codecov · 2022-04-11T23:12:21Z

Codecov Report

Merging #10635 (48804f4) into branch-22.06 (b50ae82) will increase coverage by 0.20%.
The diff coverage is n/a.

❗ Current head 48804f4 differs from pull request most recent head 7e5d0bb. Consider uploading reports for the commit 7e5d0bb to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10635      +/-   ##
================================================
+ Coverage         86.15%   86.36%   +0.20%     
================================================
  Files               141      140       -1     
  Lines             22510    22304     -206     
================================================
- Hits              19394    19263     -131     
+ Misses             3116     3041      -75

Impacted Files	Coverage Δ
python/cudf/cudf/core/mixins/binops.py	`90.00% <0.00%> (-10.00%)`	⬇️
python/cudf/cudf/testing/testing.py	`81.69% <0.00%> (-2.82%)`	⬇️
python/cudf/cudf/core/indexed_frame.py	`91.77% <0.00%> (-1.21%)`	⬇️
python/cudf/cudf/core/udf/typing.py	`96.72% <0.00%> (-0.82%)`	⬇️
python/cudf/cudf/testing/_utils.py	`93.85% <0.00%> (-0.56%)`	⬇️
python/cudf/cudf/core/_base_index.py	`85.45% <0.00%> (-0.47%)`	⬇️
python/cudf/cudf/core/single_column_frame.py	`96.52% <0.00%> (-0.33%)`	⬇️
python/cudf/cudf/__init__.py	`90.47% <0.00%> (-0.23%)`	⬇️
python/cudf/cudf/core/column/decimal.py	`91.09% <0.00%> (-0.21%)`	⬇️
python/cudf/cudf/io/orc.py	`92.77% <0.00%> (-0.09%)`	⬇️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 018924f...7e5d0bb. Read the comment docs.

PointKernel · 2022-04-12T20:20:02Z

rerun tests

cpp/src/io/parquet/chunk_dict.cu

jrhemstad · 2022-04-12T21:19:57Z

cpp/src/io/parquet/chunk_dict.cu

+      uniq_elem_size = [&]() -> size_type {
+        if (not is_unique) { return 0; }
+        switch (col->physical_type) {
+          case Type::INT32: return 4;
+          case Type::INT64: return 8;
+          case Type::INT96: return 12;
+          case Type::FLOAT: return 4;
+          case Type::DOUBLE: return 8;
+          case Type::BYTE_ARRAY:
+            if (data_col.type().id() == type_id::STRING) {
+              // Strings are stored as 4 byte length + string bytes
+              return 4 + data_col.element<string_view>(val_idx).size_bytes();
+            }
+          case Type::FIXED_LEN_BYTE_ARRAY:
+            if (data_col.type().id() == type_id::DECIMAL128) { return sizeof(__int128_t); }
+          default: CUDF_UNREACHABLE("Unsupported type for dictionary encoding");
+        }


This switch seems redundant with the type_dispatcher. Couldn't the map_insert_fn be made to return the same information and avoid the extra switch?

Specifically, it seems like this could be simplified to:

auto const [is_unique, element_size] = is_valid ? type_dispatcher(...) : {0, 0};

These are parquet types, not cudf types.

I assume the parquet type can be derived the cudf type?

Even better would be to push the is_valid check inside map_insert_fn. Then I'd write this as:

while (val_idx - block_size < end_value_idx) { thrust::optional<size_type> unique_element_size = type_dispatcher(...); ... auto const num_unique = block_reduce(reduce_storage).Sum( unique_element_size.has_value() ); __syncthreads(); auto const unique_data_size = block_reduce(reduce_storage).Sum(unique_element_size.value_or(0)); ... }

I assume the parquet type can be derived the cudf type?

Not trivially. There can be multiple parquet types associated with a cuDF type and the decision to use which one is determined by a user passed metadata which gets to here.

If the user wants time stamp to be encoded with int96 instead of int64 then that might affect the decision to use dictionary.

cpp/src/io/parquet/chunk_dict.cu

devavret · 2022-04-13T10:54:11Z

Don't benchmark using file output. Use void output. The IO time dominates the kernel running time and hides any actual improvements

cpp/src/io/parquet/chunk_dict.cu

hyperbolic2346

Looks good, thanks!

PointKernel · 2022-04-20T15:07:26Z

@gpucibot merge

PointKernel added 5 commits April 8, 2022 14:38

Remove unused arguments

979365f

Remove redundant find

53ee307

size_type instead of size_t

e45bd6b

Use relaxed memory order

c824f4b

Minor cleanups

da2718d

PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 11, 2022

PointKernel self-assigned this Apr 11, 2022

PointKernel added 3 commits April 12, 2022 10:58

Use CG logical to get rid of block-wide sync

e43eccc

Use while logical

68c4520

Revert back to cub reduce

def1c51

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 12, 2022

PointKernel marked this pull request as ready for review April 12, 2022 18:11

PointKernel requested a review from a team as a code owner April 12, 2022 18:11

PointKernel requested review from cwharris, ttnghia and devavret April 12, 2022 18:11