[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

GregoryKimball · 2024-03-13T22:46:59Z

Describe the bug
The Parquet reader includes several specialized kernels that work together to decode strings data and generate columns (gpuComputeStringPageBounds, gpuComputePageStringSizes and gpuDecodeStringPageData). Generally these kernels process strings data efficiently, however we found an example where a "map" column formatted as a LIST<STRUCT<string,string>> column shows poor decoding throughput in the Parquet reader.

Steps/Code to reproduce bug
I'm attaching slice1.pq which is a small file that shows a long kernel runtime of gpuDecodeStringPageData. The file is is 360 KB, so presumably it fits on only a few data pages. Since we use block-per-page processing in gpuDecodeStringPageData, we expect poor block utilization on a file like this. However, there appears to be a thread-per-row or thread-per-string step that assigns a few threads 112 ms of decode work. As a result, we observe an unusually low ~35 MB/s throughput for decoding this 4 MB column.

slice1.zip

Expected behavior
Based on benchmarks reported in #13302, we would expect strings data decoding to show throughput closer to 10-30 GB/s.

The text was updated successfully, but these errors were encountered:

etseidl · 2024-03-14T06:07:28Z

I was kind of worried about this happening some day...the data is all over the place in terms of string lengths. Very short mixed with extremely long. One thing to try is recalculating the average string length for each 32 rows, and switching between the string parallel and char parallel decoders within a page. That or just use the max string length rather than the average, since a single long string would dominate. Or maybe a ballot to see how many strings in a warp exceed a threshold.

GregoryKimball · 2024-03-14T16:38:11Z

Thank you @etseidl for sharing these suggestions. What is the performance impact of always choosing the character-parallel decoder? (code pointer) I'm surprised that we maintained the string-parallel code path.

As a starting point, I'm interested in the idea of selecting the decoder based on the max string length instead of average string length.

etseidl · 2024-03-14T17:03:47Z

Using the char parallel approach on short strings is bad because then you wind up with many threads doing nothing. Obviously string-per-thread is bad for long strings. I've already tested a few different approaches on the file you attached. The best time so far has been doing avg string length per batch. Doing __any_sync(-1, len > warp_size) was a bit slower, and __popc(__ballot_sync(-1, len > warp_size)) > 1 was a bit slower again. With the warp average approach I got the decode step down to around 27ms from around 94ms on my A6000. Always choosing char parallel was 49ms.

Now that #15159 is merged, we can also look at a block-wide approach to the string decoding.

See #15297. The Parquet string decoder can become a bottleneck in the presence of strings of widely varying sizes. This PR is an attempt to address this, at least as a stop gap solution. A more complete solution may be to rework the string decoder to work in a block-wide fashion, such as the new micro-kernels added in #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #15304

GregoryKimball · 2024-04-16T18:53:26Z

I'm closing this based on the improvement from #15304. We may choose to revisit long strings performance in Parquet decode with a microkernel approach. Before we can kick off more work in this area, we need a more sophisticated long strings performance benchmarking approach.

nvdbaranec · 2024-04-16T19:10:11Z

I'm closing this based on the improvement from #15304. We may choose to revisit long strings performance in Parquet decode with a microkernel approach. Before we can kick off more work in this area, we need a more sophisticated long strings performance benchmarking approach.

I don't think the micro-kernel approach really attacks the issue at hand here. The micro-kernels method is just a way of getting rid of excess baggage for things we know don't need to happen in a particular type of data page. This is more of a fundamental classic issue: parallelizing work of variable sizes.

It does make sense to micro-kernel this, but that's only because it makes sense to micro-kernel everything for the underlying speedups. But it won't address the core algorithmic challenges here.

Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly: The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each). For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache. ### Benchmark Results (Gaussian-distributed string lengths): * NO dictionary, length from 0 - 32: No difference * NO dictionary, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 10% - 20% faster. * Dictionary, cardinality 0: 0% - 15% faster. * Dictionary, cardinality 1000, length from 0 - 32: 30% - 35% faster. * Dictionary, cardinality 1000, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 50% - 60% faster. * Selected customer data: 5% faster. These performance improvements also hold for [this previous long-string performance issue](#15297). The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized. Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #17286

GregoryKimball added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Mar 13, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Mar 13, 2024

GregoryKimball added this to libcudf Mar 13, 2024

etseidl mentioned this issue Mar 14, 2024

Address poor performance of Parquet string decoding #15304

Merged

3 tasks

GregoryKimball closed this as completed Apr 16, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

pmattione-nvidia mentioned this issue Dec 5, 2024

Mukernels strings #17286

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

GregoryKimball commented Mar 13, 2024 •

edited

Loading

etseidl commented Mar 14, 2024

GregoryKimball commented Mar 14, 2024

etseidl commented Mar 14, 2024

GregoryKimball commented Apr 16, 2024

nvdbaranec commented Apr 16, 2024

[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

Comments

GregoryKimball commented Mar 13, 2024 • edited Loading

etseidl commented Mar 14, 2024

GregoryKimball commented Mar 14, 2024

etseidl commented Mar 14, 2024

GregoryKimball commented Apr 16, 2024

nvdbaranec commented Apr 16, 2024

GregoryKimball commented Mar 13, 2024 •

edited

Loading