-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297
Comments
I was kind of worried about this happening some day...the data is all over the place in terms of string lengths. Very short mixed with extremely long. One thing to try is recalculating the average string length for each 32 rows, and switching between the string parallel and char parallel decoders within a page. That or just use the max string length rather than the average, since a single long string would dominate. Or maybe a ballot to see how many strings in a warp exceed a threshold. |
Thank you @etseidl for sharing these suggestions. What is the performance impact of always choosing the character-parallel decoder? (code pointer) I'm surprised that we maintained the string-parallel code path. As a starting point, I'm interested in the idea of selecting the decoder based on the max string length instead of average string length. |
Using the char parallel approach on short strings is bad because then you wind up with many threads doing nothing. Obviously string-per-thread is bad for long strings. I've already tested a few different approaches on the file you attached. The best time so far has been doing avg string length per batch. Doing Now that #15159 is merged, we can also look at a block-wide approach to the string decoding. |
See #15297. The Parquet string decoder can become a bottleneck in the presence of strings of widely varying sizes. This PR is an attempt to address this, at least as a stop gap solution. A more complete solution may be to rework the string decoder to work in a block-wide fashion, such as the new micro-kernels added in #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #15304
I'm closing this based on the improvement from #15304. We may choose to revisit long strings performance in Parquet decode with a microkernel approach. Before we can kick off more work in this area, we need a more sophisticated long strings performance benchmarking approach. |
I don't think the micro-kernel approach really attacks the issue at hand here. The micro-kernels method is just a way of getting rid of excess baggage for things we know don't need to happen in a particular type of data page. This is more of a fundamental classic issue: parallelizing work of variable sizes. It does make sense to micro-kernel this, but that's only because it makes sense to micro-kernel everything for the underlying speedups. But it won't address the core algorithmic challenges here. |
Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly: The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each). For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache. ### Benchmark Results (Gaussian-distributed string lengths): * NO dictionary, length from 0 - 32: No difference * NO dictionary, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 10% - 20% faster. * Dictionary, cardinality 0: 0% - 15% faster. * Dictionary, cardinality 1000, length from 0 - 32: 30% - 35% faster. * Dictionary, cardinality 1000, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 50% - 60% faster. * Selected customer data: 5% faster. These performance improvements also hold for [this previous long-string performance issue](#15297). The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized. Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #17286
Describe the bug
The Parquet reader includes several specialized kernels that work together to decode strings data and generate columns (
gpuComputeStringPageBounds
,gpuComputePageStringSizes
andgpuDecodeStringPageData
). Generally these kernels process strings data efficiently, however we found an example where a "map" column formatted as aLIST<STRUCT<string,string>>
column shows poor decoding throughput in the Parquet reader.Steps/Code to reproduce bug
I'm attaching
slice1.pq
which is a small file that shows a long kernel runtime ofgpuDecodeStringPageData
. The file is is 360 KB, so presumably it fits on only a few data pages. Since we use block-per-page processing ingpuDecodeStringPageData
, we expect poor block utilization on a file like this. However, there appears to be a thread-per-row or thread-per-string step that assigns a few threads 112 ms of decode work. As a result, we observe an unusually low ~35 MB/s throughput for decoding this 4 MB column.slice1.zip
Expected behavior
Based on benchmarks reported in #13302, we would expect strings data decoding to show throughput closer to 10-30 GB/s.
The text was updated successfully, but these errors were encountered: