Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700

Closed
GregoryKimball opened this issue Feb 4, 2023 · 3 comments
Closed
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Milestone

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 4, 2023

Is your feature request related to a problem? Please describe.
Our suite of Parquet reader benchmarks includes a variety of data source, data types, compression formats and reader options. However it does not include a benchmark that uses multiple CUDA streams with multiple host threads to read portions of the same dataset and maximize GPU utilization. The Spark-RAPIDS plugin relies on multi-stream parquet reads from host buffers (using per-thread-default-stream, PTDS) for the data ingest step into libcudf.

Describe the solution you'd like
We should add a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large parquet dataset from host memory into a libcudf table. Currently we haven't exposed a stream in the public API for the parquet reader, but development of the benchmark can begin by using the read_parquet detail API. We could design this benchmark either to read one file per thread or one row group per thread, whichever is more expedient. After the read step, we might want to add a concatenation step to yield a single table. It might be useful to leverage the same generated data as in the other Parquet reader benchmarks, so we have a performance reference when studying the advantage of multi-thread, multi-stream read times.

Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Feb 4, 2023
@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Apr 2, 2023

I'll close this issue for now. It's likely that the PQ reader would be a poor choice for our first benchmarks using multiple streams on a single host thread. The reader functions maintain complex host- and device-side data and use internal syncs, making the reader functions hard to reason about compared to simpler libcudf APIs.

Update (July 2023): I'm reopening this issue and dropping the "single thread" specification.

@hyperbolic2346
Copy link
Contributor

Example regression for this benchmark to catch: #14167

@GregoryKimball
Copy link
Contributor Author

We have a partial implementation available here: https://github.com/hyperbolic2346/cudf/tree/mwilson/multithreadparquet

rapids-bot bot pushed a commit that referenced this issue May 21, 2024
Addresses:  #12700

Adds multithreaded benchmarks for the parquet reader.  Separate benchmarks for the chunked and non-chunked readers.  In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time.   There is not a ton of variability in the other benchmarking axes.  

The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other).  Whereas normal benchmarks tend to be more for intra-kernel performance checking.

NVTX ranges are included to help visually group the bundles of reads together in nsight-sys.   I also posted a new issue which would help along these lines: #15575

Update:  I've tweaked some of the numbers to demonstrate some mild performance improvements as we go up in thread count, and included 1-thread as a case.  Some examples:

```
## parquet_multithreaded_read_decode_mixed
| cardinality | total_data_size | num_threads | num_cols | bytes_per_second |
|-------------|-----------------|-------------|----------|------------------|
|        1000 |       536870912 |           1 |        4 |      28874731473 |
|        1000 |      1073741824 |           1 |        4 |      30564139526 |
|        1000 |       536870912 |           2 |        4 |      29399214255 |
|        1000 |      1073741824 |           2 |        4 |      31486327920 |
|        1000 |       536870912 |           4 |        4 |      27009769400 |
|        1000 |      1073741824 |           4 |        4 |      32234841632 |
|        1000 |       536870912 |           8 |        4 |      24416650118 |
|        1000 |      1073741824 |           8 |        4 |      30841124677 |
```

```
## parquet_multithreaded_read_decode_chunked_string
| cardinality | total_data_size | num_threads | num_cols | bytes_per_second |
|-------------|-----------------|-------------|----------|------------------|
|        1000 |       536870912 |           1 |        4 |      14637004584 |
|        1000 |      1073741824 |           1 |        4 |      16025843421 |
|        1000 |       536870912 |           2 |        4 |      15333491977 |
|        1000 |      1073741824 |           2 |        4 |      17164197747 |
|        1000 |       536870912 |           4 |        4 |      16556300728 |
|        1000 |      1073741824 |           4 |        4 |      17711338934 |
|        1000 |       536870912 |           8 |        4 |      15788371298 |
|        1000 |      1073741824 |           8 |        4 |      17911649578 |
```

In addition, this benchmark clearly shows multi-thread only regressions. An example case below using the pageable-error-code regression we've seen in the past.

Example without regression:
```

## parquet_multithreaded_read_decode_chunked_fixed_width
total_data_size | num_threads | bytes_per_second |
----------------|-------------|------------------|
      536870912 |           1 |      25681728660 |
     1073741824 |           1 |      26281335927 |
      536870912 |           2 |      25597258848 |
     1073741824 |           2 |      26733626352 |
      536870912 |           4 |      25190211717 |
     1073741824 |           4 |      28117411682 |
      536870912 |           8 |      25805791994 |
     1073741824 |           8 |      27788485204 |
```

Example with regression (pageable error-code return values):

```
## parquet_multithreaded_read_decode_chunked_fixed_width
total_data_size | num_threads | bytes_per_second |
-----------------|------------|------------------|
       536870912 |          1 |      25660470283 |
      1073741824 |          1 |      26146862480 |
       536870912 |          2 |      25040145602 |
      1073741824 |          2 |      25460591520 |
       536870912 |          4 |      22917046969 |
      1073741824 |          4 |      24922624784 |
       536870912 |          8 |      20529770200 |
      1073741824 |          8 |      23333751767 |
```

In both cases, we can see that the single-thread case remains the same but there's a regression in the multi-thread case. particularly with 4 threads.

Authors:
  - https://github.com/nvdbaranec
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #15585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants