-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700
Comments
I'll close this issue for now. It's likely that the PQ reader would be a poor choice for our first benchmarks using multiple streams on a single host thread. The reader functions maintain complex host- and device-side data and use internal syncs, making the reader functions hard to reason about compared to simpler libcudf APIs. Update (July 2023): I'm reopening this issue and dropping the "single thread" specification. |
Example regression for this benchmark to catch: #14167 |
We have a partial implementation available here: https://github.com/hyperbolic2346/cudf/tree/mwilson/multithreadparquet |
Addresses: #12700 Adds multithreaded benchmarks for the parquet reader. Separate benchmarks for the chunked and non-chunked readers. In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time. There is not a ton of variability in the other benchmarking axes. The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other). Whereas normal benchmarks tend to be more for intra-kernel performance checking. NVTX ranges are included to help visually group the bundles of reads together in nsight-sys. I also posted a new issue which would help along these lines: #15575 Update: I've tweaked some of the numbers to demonstrate some mild performance improvements as we go up in thread count, and included 1-thread as a case. Some examples: ``` ## parquet_multithreaded_read_decode_mixed | cardinality | total_data_size | num_threads | num_cols | bytes_per_second | |-------------|-----------------|-------------|----------|------------------| | 1000 | 536870912 | 1 | 4 | 28874731473 | | 1000 | 1073741824 | 1 | 4 | 30564139526 | | 1000 | 536870912 | 2 | 4 | 29399214255 | | 1000 | 1073741824 | 2 | 4 | 31486327920 | | 1000 | 536870912 | 4 | 4 | 27009769400 | | 1000 | 1073741824 | 4 | 4 | 32234841632 | | 1000 | 536870912 | 8 | 4 | 24416650118 | | 1000 | 1073741824 | 8 | 4 | 30841124677 | ``` ``` ## parquet_multithreaded_read_decode_chunked_string | cardinality | total_data_size | num_threads | num_cols | bytes_per_second | |-------------|-----------------|-------------|----------|------------------| | 1000 | 536870912 | 1 | 4 | 14637004584 | | 1000 | 1073741824 | 1 | 4 | 16025843421 | | 1000 | 536870912 | 2 | 4 | 15333491977 | | 1000 | 1073741824 | 2 | 4 | 17164197747 | | 1000 | 536870912 | 4 | 4 | 16556300728 | | 1000 | 1073741824 | 4 | 4 | 17711338934 | | 1000 | 536870912 | 8 | 4 | 15788371298 | | 1000 | 1073741824 | 8 | 4 | 17911649578 | ``` In addition, this benchmark clearly shows multi-thread only regressions. An example case below using the pageable-error-code regression we've seen in the past. Example without regression: ``` ## parquet_multithreaded_read_decode_chunked_fixed_width total_data_size | num_threads | bytes_per_second | ----------------|-------------|------------------| 536870912 | 1 | 25681728660 | 1073741824 | 1 | 26281335927 | 536870912 | 2 | 25597258848 | 1073741824 | 2 | 26733626352 | 536870912 | 4 | 25190211717 | 1073741824 | 4 | 28117411682 | 536870912 | 8 | 25805791994 | 1073741824 | 8 | 27788485204 | ``` Example with regression (pageable error-code return values): ``` ## parquet_multithreaded_read_decode_chunked_fixed_width total_data_size | num_threads | bytes_per_second | -----------------|------------|------------------| 536870912 | 1 | 25660470283 | 1073741824 | 1 | 26146862480 | 536870912 | 2 | 25040145602 | 1073741824 | 2 | 25460591520 | 536870912 | 4 | 22917046969 | 1073741824 | 4 | 24922624784 | 536870912 | 8 | 20529770200 | 1073741824 | 8 | 23333751767 | ``` In both cases, we can see that the single-thread case remains the same but there's a regression in the multi-thread case. particularly with 4 threads. Authors: - https://github.com/nvdbaranec - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #15585
Is your feature request related to a problem? Please describe.
Our suite of Parquet reader benchmarks includes a variety of data source, data types, compression formats and reader options. However it does not include a benchmark that uses multiple CUDA streams with multiple host threads to read portions of the same dataset and maximize GPU utilization. The Spark-RAPIDS plugin relies on multi-stream parquet reads from host buffers (using per-thread-default-stream, PTDS) for the data ingest step into libcudf.
Describe the solution you'd like
We should add a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large parquet dataset from host memory into a libcudf table. Currently we haven't exposed a stream in the public API for the parquet reader, but development of the benchmark can begin by using the read_parquet detail API. We could design this benchmark either to read one file per thread or one row group per thread, whichever is more expedient. After the read step, we might want to add a concatenation step to yield a single table. It might be useful to leverage the same generated data as in the other Parquet reader benchmarks, so we have a performance reference when studying the advantage of multi-thread, multi-stream read times.
Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.
The text was updated successfully, but these errors were encountered: