-
Notifications
You must be signed in to change notification settings - Fork 919
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly: The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each). For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache. ### Benchmark Results (Gaussian-distributed string lengths): * NO dictionary, length from 0 - 32: No difference * NO dictionary, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 10% - 20% faster. * Dictionary, cardinality 0: 0% - 15% faster. * Dictionary, cardinality 1000, length from 0 - 32: 30% - 35% faster. * Dictionary, cardinality 1000, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 50% - 60% faster. * Selected customer data: 5% faster. These performance improvements also hold for [this previous long-string performance issue](#15297). The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized. Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #17286
- Loading branch information
1 parent
e272f1e
commit 834565a
Showing
8 changed files
with
505 additions
and
376 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.