[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

GregoryKimball · 2023-02-04T00:00:11Z

Is your feature request related to a problem? Please describe.
Our suite of Parquet reader benchmarks includes a variety of data source, data types, compression formats and reader options. However it does not include a benchmark that uses multiple CUDA streams with multiple host threads to read portions of the same dataset and maximize GPU utilization. The Spark-RAPIDS plugin relies on multi-stream parquet reads from host buffers (using per-thread-default-stream, PTDS) for the data ingest step into libcudf.

Describe the solution you'd like
We should add a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large parquet dataset from host memory into a libcudf table. Currently we haven't exposed a stream in the public API for the parquet reader, but development of the benchmark can begin by using the read_parquet detail API. We could design this benchmark either to read one file per thread or one row group per thread, whichever is more expedient. After the read step, we might want to add a concatenation step to yield a single table. It might be useful to leverage the same generated data as in the other Parquet reader benchmarks, so we have a performance reference when studying the advantage of multi-thread, multi-stream read times.

Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.

GregoryKimball · 2023-04-02T18:19:59Z

I'll close this issue for now. It's likely that the PQ reader would be a poor choice for our first benchmarks using multiple streams on a single host thread. The reader functions maintain complex host- and device-side data and use internal syncs, making the reader functions hard to reason about compared to simpler libcudf APIs.

Update (July 2023): I'm reopening this issue and dropping the "single thread" specification.

hyperbolic2346 · 2023-11-30T20:22:09Z

Example regression for this benchmark to catch: #14167

GregoryKimball · 2024-03-27T21:05:36Z

We have a partial implementation available here: https://github.com/hyperbolic2346/cudf/tree/mwilson/multithreadparquet

Addresses: #12700 Adds multithreaded benchmarks for the parquet reader. Separate benchmarks for the chunked and non-chunked readers. In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time. There is not a ton of variability in the other benchmarking axes. The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other). Whereas normal benchmarks tend to be more for intra-kernel performance checking. NVTX ranges are included to help visually group the bundles of reads together in nsight-sys. I also posted a new issue which would help along these lines: #15575 Update: I've tweaked some of the numbers to demonstrate some mild performance improvements as we go up in thread count, and included 1-thread as a case. Some examples: ``` ## parquet_multithreaded_read_decode_mixed | cardinality | total_data_size | num_threads | num_cols | bytes_per_second | |-------------|-----------------|-------------|----------|------------------| | 1000 | 536870912 | 1 | 4 | 28874731473 | | 1000 | 1073741824 | 1 | 4 | 30564139526 | | 1000 | 536870912 | 2 | 4 | 29399214255 | | 1000 | 1073741824 | 2 | 4 | 31486327920 | | 1000 | 536870912 | 4 | 4 | 27009769400 | | 1000 | 1073741824 | 4 | 4 | 32234841632 | | 1000 | 536870912 | 8 | 4 | 24416650118 | | 1000 | 1073741824 | 8 | 4 | 30841124677 | ``` ``` ## parquet_multithreaded_read_decode_chunked_string | cardinality | total_data_size | num_threads | num_cols | bytes_per_second | |-------------|-----------------|-------------|----------|------------------| | 1000 | 536870912 | 1 | 4 | 14637004584 | | 1000 | 1073741824 | 1 | 4 | 16025843421 | | 1000 | 536870912 | 2 | 4 | 15333491977 | | 1000 | 1073741824 | 2 | 4 | 17164197747 | | 1000 | 536870912 | 4 | 4 | 16556300728 | | 1000 | 1073741824 | 4 | 4 | 17711338934 | | 1000 | 536870912 | 8 | 4 | 15788371298 | | 1000 | 1073741824 | 8 | 4 | 17911649578 | ``` In addition, this benchmark clearly shows multi-thread only regressions. An example case below using the pageable-error-code regression we've seen in the past. Example without regression: ``` ## parquet_multithreaded_read_decode_chunked_fixed_width total_data_size | num_threads | bytes_per_second | ----------------|-------------|------------------| 536870912 | 1 | 25681728660 | 1073741824 | 1 | 26281335927 | 536870912 | 2 | 25597258848 | 1073741824 | 2 | 26733626352 | 536870912 | 4 | 25190211717 | 1073741824 | 4 | 28117411682 | 536870912 | 8 | 25805791994 | 1073741824 | 8 | 27788485204 | ``` Example with regression (pageable error-code return values): ``` ## parquet_multithreaded_read_decode_chunked_fixed_width total_data_size | num_threads | bytes_per_second | -----------------|------------|------------------| 536870912 | 1 | 25660470283 | 1073741824 | 1 | 26146862480 | 536870912 | 2 | 25040145602 | 1073741824 | 2 | 25460591520 | 536870912 | 4 | 22917046969 | 1073741824 | 4 | 24922624784 | 536870912 | 8 | 20529770200 | 1073741824 | 8 | 23333751767 | ``` In both cases, we can see that the single-thread case remains the same but there's a regression in the multi-thread case. particularly with 4 threads. Authors: - https://github.com/nvdbaranec - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #15585

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Feb 4, 2023

GregoryKimball added this to libcudf Feb 4, 2023

GregoryKimball mentioned this issue Feb 8, 2023

[FEA] Update IO benchmarks for consistency between formats #12739

Open

12 tasks

GregoryKimball closed this as completed Apr 2, 2023

mattahrens assigned SrikarVanavasam Apr 13, 2023

mattahrens added the Performance Performance related issue label Apr 13, 2023

mattahrens unassigned SrikarVanavasam Apr 13, 2023

GregoryKimball removed this from libcudf Jun 28, 2023

GregoryKimball reopened this Jul 6, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Jul 6, 2023

GregoryKimball added this to libcudf Jul 6, 2023

GregoryKimball modified the milestones: Parquet continuous improvement, Benchmarking Jul 23, 2023

mattahrens assigned hyperbolic2346 and unassigned hyperbolic2346 Jul 27, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

GregoryKimball added this to libcudf Jan 31, 2024

GregoryKimball moved this to In progress in libcudf Jan 31, 2024

mattahrens assigned nvdbaranec and unassigned hyperbolic2346 Mar 14, 2024

gm3g11 mentioned this issue Mar 27, 2024

[QST] How can the performance of chunked reading in Parquet be improved? #15376

Closed

nvdbaranec mentioned this issue Apr 23, 2024

Add multithreaded parquet reader benchmarks. #15585

Merged

3 tasks

mattahrens closed this as completed May 24, 2024

zpuller mentioned this issue Jun 11, 2024

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

Closed

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

GregoryKimball commented Feb 4, 2023 •

edited

Loading

GregoryKimball commented Apr 2, 2023 •

edited

Loading

hyperbolic2346 commented Nov 30, 2023

GregoryKimball commented Mar 27, 2024

[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

Comments

GregoryKimball commented Feb 4, 2023 • edited Loading

GregoryKimball commented Apr 2, 2023 • edited Loading

hyperbolic2346 commented Nov 30, 2023

GregoryKimball commented Mar 27, 2024

GregoryKimball commented Feb 4, 2023 •

edited

Loading

GregoryKimball commented Apr 2, 2023 •

edited

Loading