Optimization to decoding of parquet level streams #13203

nvdbaranec · 2023-04-23T23:54:41Z

An optimization to the decoding of the definition and repetition level streams in Parquet files. Previously, we were decoding these streams using 1 warp. With this optimization we do it arbitrarily wide (currently set for 512 threads). This gives a dramatic improvement.

The core of the work is in the new file rle_stream.cuh which encapsulates the decoding into an rle_stream object.

This PR only applies the opimization to the gpuComputePageSizes kernel, used for preprocessing list columns and for the chunked read case involving strings or lists. In addition, the UpdatePageSizes function has been improved to also work at the block level instead of just using a single warp. Testing with the cudf parquet reader list benchmarks result in as much as a 75% reduction in time in the gpuComputePageSizes kernel.

Future PRs will apply this to the gpuDecodePageData kernel.

Leaving as a draft for the moment - more detailed benchmarks and numbers forthcoming, along with some possible parameter tuning.

Benchmark info. A before/after sample from the parquet_reader_io_compression suite on an A5000. The kernel goes from 427 milliseconds to 93 milliseconds. This seems to be a pretty typical situation, although it will definitely be affected by the encoded data (run lengths, etc).

The reader benchmarks that involve this kernel yield some great improvements.

parquet_read_decode (A = Before. B = After)
| data_type |      io       | cardinality | run_length | bytes_per_second (A) | bytes_per_second (B)
|-----------|---------------|-------------|------------|----------------------|---------------------|
|      LIST | DEVICE_BUFFER |           0 |          1 | 5399068099           | 6044036091          |
|      LIST | DEVICE_BUFFER |        1000 |          1 | 5930855807           | 6505889742          |
|      LIST | DEVICE_BUFFER |           0 |         32 | 6862874160           | 7531918407          |
|      LIST | DEVICE_BUFFER |        1000 |         32 | 6781795229           | 7463856554          |

parquet_read_io_compression (A = Before. B = After)
      io        | compression | bytes_per_second (A) | bytes_per_second(B)
|---------------|-------------|----------------------|-------------------|
| DEVICE_BUFFER |      SNAPPY |        307421363     | 393735255         |
| DEVICE_BUFFER |      SNAPPY |        323998549     | 426045725         |
| DEVICE_BUFFER |      SNAPPY |        386112997     | 508751604         |
| DEVICE_BUFFER |      SNAPPY |        381398279     | 498963635         |

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…, it was only 1 warp wide. Now it is block-wide. Only integrated into the gpuComputePageSizes() kernel. gpuDecodePages() will be a followup PR.

cpp/src/io/parquet/rle_stream.cuh

vuule

Bunch of minor comments, still need to fully understand the core algorithm.

cpp/src/io/utilities/block_utils.cuh

cpp/src/io/parquet/reader_impl.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/page_hdr.cu

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/rle_stream.cuh

…al with a performance issue introduced in gpuDecodePageData by previously changing them to be pointers instead of hardcoded arrays.

… buffer size from 4096 to 2048. Global scratch memory cost per page now 8k instead of 32k. This will likely need to be tuned further as this optimization gets appled to the decode kernel.

…incorrectly sized (benign) run_buffer_size constexpr.

…device_resource.

… cases, we only need 1 byte to store level information since size of the values is proportional to nesting depth.

cpp/src/io/parquet/rle_stream.cuh

vuule

std::review << few_small_comments << std::flush

cpp/src/io/parquet/rle_stream.cuh

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/rle_stream.cuh

etseidl

This is good stuff! I've been using it in my own branch and it's really made a big difference. Looking forward to the sequel 😄.

A few const nits, but can't see any other issues (and have not run into any in testing). My only question is the choice of 512 for num_rle_stream_decode_threads. Is there an existing corpus for tuning this value? In my own testing it has been the best overall, but there have been cases where a smaller value is optimal. I'm also curious what this choice means for Spark where they'll have multiple decodes running concurrently IIRC.

cpp/src/io/parquet/rle_stream.cuh

PointKernel

A first pass, still digesting the changes. 🔥

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/rle_stream.cuh

cpp/src/io/utilities/block_utils.cuh

cpp/src/io/parquet/page_data.cu

nvdbaranec · 2023-05-10T17:47:02Z

num_rle_stream_decode_threads. Is there an existing corpus for tuning this value?

I've done benchmarking against some internal queries we use. Specifically a case where we have 4 cpu threads running parquet jobs at the same time on the gpu. This has traditionally been sensitive to occupancy issues so I kept an eye on that. I do suspect there will be tuning in the future: in particular because I think a useful post-optimization will be to balance out uneven run sizes across the decode warps - that'll certainly affect the # of useful warps.

etseidl

Found a weird edge case, but otherwise good to go as far as I can see.

cpp/src/io/parquet/page_data.cu

…ce condition when computing the number of skipped values during the preprocess step.

PointKernel

Looks great! 🔥

Thanks @nvdbaranec

vuule

Looks great, just a few questions/suggestions.

cpp/src/io/parquet/rle_stream.cuh

vuule · 2023-05-15T18:20:56Z

/merge

@nvdbaranec

The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by `make_strings_column`. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to the `column_buffer`, which can then be used directly, bypassing the gather step. The image below compares the new approach to the old. The green arc at the top (82ms) is `gpuDecodePageData`, and the red arc (252ms) is the time spent in `make_strings_column`. The green arc below (25ms) is `gpuDecodePageData`, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns. ![flat_edited](https://user-images.githubusercontent.com/25541553/236529570-f2d0d8d4-b2b5-4078-93ae-5123fa489c3c.png) NVbench shows a good speed up for strings as well. There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types. The INTEGRAL time seems to be affected by extra time spent in `malloc` allocating host memory for a `hostdevice_vector`. This `malloc` always occurs, but for some reason in this branch it takes much longer to return. This is comparing to @nvdbaranec's branch for #13203. ``` | data_type | io | cardinality | run_length | Ref Time | Cmp Time | Diff | %Diff | |-------------|---------------|---------------|--------------|------------|------------|-------------|---------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 14.288 ms | 14.729 ms | 440.423 us | 3.08% | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 13.397 ms | 13.997 ms | 600.596 us | 4.48% | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 11.831 ms | 12.354 ms | 522.485 us | 4.42% | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 11.335 ms | 11.854 ms | 518.791 us | 4.58% | | FLOAT | DEVICE_BUFFER | 0 | 1 | 8.681 ms | 8.715 ms | 34.846 us | 0.40% | | FLOAT | DEVICE_BUFFER | 1000 | 1 | 8.473 ms | 8.472 ms | -0.680 us | -0.01% | | FLOAT | DEVICE_BUFFER | 0 | 32 | 7.217 ms | 7.192 ms | -25.311 us | -0.35% | | FLOAT | DEVICE_BUFFER | 1000 | 32 | 7.425 ms | 7.422 ms | -3.162 us | -0.04% | | STRING | DEVICE_BUFFER | 0 | 1 | 50.079 ms | 42.566 ms |-7513.004 us | -15.00% | | STRING | DEVICE_BUFFER | 1000 | 1 | 16.813 ms | 14.989 ms |-1823.660 us | -10.85% | | STRING | DEVICE_BUFFER | 0 | 32 | 49.875 ms | 42.443 ms |-7432.718 us | -14.90% | | STRING | DEVICE_BUFFER | 1000 | 32 | 15.312 ms | 13.953 ms |-1358.910 us | -8.87% | | LIST | DEVICE_BUFFER | 0 | 1 | 80.303 ms | 80.688 ms | 385.916 us | 0.48% | | LIST | DEVICE_BUFFER | 1000 | 1 | 71.921 ms | 72.356 ms | 435.153 us | 0.61% | | LIST | DEVICE_BUFFER | 0 | 32 | 61.658 ms | 62.129 ms | 471.022 us | 0.76% | | LIST | DEVICE_BUFFER | 1000 | 32 | 63.086 ms | 63.371 ms | 285.608 us | 0.45% | | STRUCT | DEVICE_BUFFER | 0 | 1 | 66.272 ms | 61.142 ms |-5130.639 us | -7.74% | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 40.217 ms | 39.328 ms | -888.781 us | -2.21% | | STRUCT | DEVICE_BUFFER | 0 | 32 | 63.660 ms | 58.837 ms |-4822.647 us | -7.58% | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 38.080 ms | 37.104 ms | -976.133 us | -2.56% | ``` May address #13024 ~Depends on #13203~ Authors: - Ed Seidl (https://github.com/etseidl) - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #13302

nvdbaranec added 2 commits April 23, 2023 17:49

Rework of level decoding to be considerably more parallel. Previously…

63a2d88

…, it was only 1 warp wide. Now it is block-wide. Only integrated into the gpuComputePageSizes() kernel. gpuDecodePages() will be a followup PR.

Merge branch 'branch-23.06' into parquet_level_optimization

85dfe8a

nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 23, 2023

nvdbaranec requested a review from a team as a code owner April 23, 2023 23:54

nvdbaranec requested review from PointKernel and vuule April 23, 2023 23:54

nvdbaranec marked this pull request as draft April 23, 2023 23:55

nvdbaranec added 2 commits April 24, 2023 14:45

Merge branch 'branch-23.06' into parquet_level_optimization

eb37a59

Style formatting.

9211bcc

etseidl reviewed Apr 25, 2023

View reviewed changes

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved

vuule reviewed Apr 27, 2023

View reviewed changes

nvdbaranec added 2 commits May 1, 2023 13:38

Remove definition and repetition levels from page_data_s struct to de…

b79c9ec

…al with a performance issue introduced in gpuDecodePageData by previously changing them to be pointers instead of hardcoded arrays.

Squeeze level values into uint16_t instead of uint32_t, shrink deocde…

2d42bf3

… buffer size from 4096 to 2048. Global scratch memory cost per page now 8k instead of 32k. This will likely need to be tuned further as this optimization gets appled to the decode kernel.

nvdbaranec marked this pull request as ready for review May 3, 2023 20:48

nvdbaranec added 3 commits May 3, 2023 16:11

Merge branch 'branch-23.06' into parquet_level_optimization

8d81822

PR review fixes. Removed unused shuffle_ptr() function. Corrected an …

f4cf521

…incorrectly sized (benign) run_buffer_size constexpr.

Change the level_decode_buf (temp space) to use rmm::mr::get_current_…

ad231f8

…device_resource.

etseidl mentioned this pull request May 5, 2023

Write string data directly to column_buffer in Parquet reader #13302

Merged

3 tasks

Use a dynamically sized type for level/repetition data. In almost all…

3d5c1c8

… cases, we only need 1 byte to store level information since size of the values is proportional to nesting depth.

etseidl reviewed May 8, 2023

View reviewed changes

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

vuule self-requested a review May 8, 2023 22:01

vuule reviewed May 8, 2023

View reviewed changes

nvdbaranec added 2 commits May 9, 2023 10:56

PR review feedback.

24fb8f2

Merge branch 'branch-23.06' into parquet_level_optimization

e85577d

nvdbaranec requested review from vuule and etseidl May 9, 2023 15:57

etseidl reviewed May 9, 2023

View reviewed changes

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

PointKernel reviewed May 9, 2023

View reviewed changes

PR review feedback.

859eb43

nvdbaranec requested review from PointKernel and etseidl May 10, 2023 18:11

etseidl reviewed May 10, 2023

View reviewed changes

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved

Fix a bug where specific usage of skip_rows/num_rows could cause a ra…

f12bcc9

…ce condition when computing the number of skipped values during the preprocess step.

PointKernel approved these changes May 12, 2023

View reviewed changes

ttnghia self-requested a review May 12, 2023 22:52

vuule reviewed May 13, 2023

View reviewed changes

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved

nvdbaranec added 2 commits May 15, 2023 09:31

PR review feedback.

8804007

Merge branch 'branch-23.06' into parquet_level_optimization

8bbbab1

nvdbaranec force-pushed the parquet_level_optimization branch from 359e281 to 8bbbab1 Compare May 15, 2023 14:34

nvdbaranec requested a review from vuule May 15, 2023 14:35

vuule approved these changes May 15, 2023

View reviewed changes

rapids-bot bot merged commit 1581773 into rapidsai:branch-23.06 May 15, 2023

GregoryKimball mentioned this pull request Jun 2, 2023

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

nvdbaranec mentioned this pull request Jun 30, 2023

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization to decoding of parquet level streams #13203

Optimization to decoding of parquet level streams #13203

nvdbaranec commented Apr 23, 2023 •

edited

Loading

vuule left a comment

vuule left a comment

etseidl left a comment

PointKernel left a comment

nvdbaranec commented May 10, 2023

etseidl left a comment

PointKernel left a comment

vuule left a comment

vuule commented May 15, 2023

Optimization to decoding of parquet level streams #13203

Optimization to decoding of parquet level streams #13203

Conversation

nvdbaranec commented Apr 23, 2023 • edited Loading

Checklist

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

nvdbaranec commented May 10, 2023

etseidl left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule commented May 15, 2023

nvdbaranec commented Apr 23, 2023 •

edited

Loading