[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

zpuller · 2024-06-11T16:47:28Z

Is your feature request related to a problem? Please describe.
This is an extension of #12700 to provide a benchmark for multi-stream ORC reads, which is also common in Spark-RAPIDS, similar to parquet.

Describe the solution you'd like
Again, similar to #12700, a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large ORC dataset from host memory into a libcudf table, using the read_orc detail api.

Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.

GregoryKimball · 2024-06-11T19:31:52Z

Thank you @zpuller. This sounds like a good first issue and straightforward extension of #15585

zpuller · 2024-06-11T19:40:06Z

I'm planning to work on this btw.

Addresses: #15973 Adds multithreaded benchmarks for the ORC reader. Based off of the parquet equivalent in #15585 ``` # Benchmark Results ## orc_multithreaded_read_decode_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 338x | 44.348 ms | 1.18% | 44.343 ms | 1.18% | 12107185968 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 80x | 77.634 ms | 0.65% | 77.629 ms | 0.65% | 13831742649 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 341x | 43.921 ms | 1.20% | 43.916 ms | 1.20% | 12224889363 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 75.418 ms | 0.70% | 75.414 ms | 0.70% | 14237999015 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 42.682 ms | 1.18% | 42.678 ms | 1.18% | 12579566132 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 9x | 74.056 ms | 0.48% | 74.052 ms | 0.48% | 14499873867 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 25x | 42.198 ms | 0.50% | 42.194 ms | 0.49% | 12723960975 | 940.562 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 8x | 73.933 ms | 0.49% | 73.929 ms | 0.49% | 14524042443 | 1.781 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 13x | 40.149 ms | 0.04% | 40.144 ms | 0.04% | 13373482726 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 211x | 71.216 ms | 0.67% | 71.211 ms | 0.67% | 15078297784 | 1.257 GiB | 119.650 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 378x | 39.662 ms | 1.31% | 39.658 ms | 1.31% | 13537590893 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 209x | 71.693 ms | 0.71% | 71.688 ms | 0.71% | 14978085376 | 1.257 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 377x | 39.731 ms | 1.30% | 39.726 ms | 1.30% | 13514305239 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 8x | 70.766 ms | 0.08% | 70.761 ms | 0.08% | 15174115364 | 1.030 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 379x | 39.486 ms | 1.27% | 39.482 ms | 1.27% | 13597888468 | 647.399 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 207x | 72.686 ms | 2.04% | 72.681 ms | 2.04% | 14773317833 | 1.143 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 80x | 22.933 ms | 2.13% | 22.928 ms | 2.13% | 23415352877 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 160x | 34.167 ms | 1.41% | 34.162 ms | 1.41% | 31430436877 | 1.293 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 560x | 22.533 ms | 2.18% | 22.528 ms | 2.18% | 23830839172 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 34.311 ms | 1.54% | 34.307 ms | 1.54% | 31298288990 | 1.188 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 23x | 22.179 ms | 0.11% | 22.175 ms | 0.11% | 24211151047 | 624.177 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 15x | 33.793 ms | 0.08% | 33.789 ms | 0.08% | 31777989791 | 1.190 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 679x | 22.006 ms | 1.74% | 22.002 ms | 1.74% | 24401381631 | 624.524 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 160x | 33.320 ms | 1.57% | 33.316 ms | 1.57% | 32229227026 | 1.207 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 96x | 74.437 ms | 0.68% | 74.433 ms | 0.68% | 7212831148 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 7x | 80.994 ms | 0.49% | 80.990 ms | 0.49% | 13257745936 | 1.173 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 80x | 79.234 ms | 4.57% | 79.229 ms | 4.57% | 6776190522 | 600.950 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 166x | 90.437 ms | 17.19% | 90.432 ms | 17.19% | 11873413959 | 1.173 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 78.613 ms | 2.98% | 78.608 ms | 2.98% | 6829702014 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 127x | 118.629 ms | 22.67% | 118.624 ms | 22.67% | 9051644873 | 1.174 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 112x | 133.950 ms | 4.45% | 133.945 ms | 4.45% | 4008135293 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 90x | 167.850 ms | 15.93% | 167.844 ms | 15.93% | 6397248426 | 1.177 GiB | 120.646 MiB | ## orc_multithreaded_read_decode_chunked_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 333x | 45.009 ms | 1.10% | 45.005 ms | 1.10% | 11929261073 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 96x | 81.524 ms | 0.61% | 81.519 ms | 0.61% | 13171640865 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 339x | 44.183 ms | 0.96% | 44.179 ms | 0.96% | 12152252271 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 7x | 79.051 ms | 0.02% | 79.046 ms | 0.02% | 13583676002 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 12x | 43.276 ms | 0.09% | 43.272 ms | 0.09% | 12407024794 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 19x | 78.019 ms | 0.49% | 78.014 ms | 0.49% | 13763433041 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 42.803 ms | 1.22% | 42.799 ms | 1.22% | 12543864010 | 911.993 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 193x | 77.856 ms | 0.59% | 77.852 ms | 0.59% | 13792063986 | 1.837 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_chunked_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 112x | 40.497 ms | 1.23% | 40.493 ms | 1.23% | 13258480947 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 7x | 75.440 ms | 0.09% | 75.435 ms | 0.09% | 14234033611 | 1.648 GiB | 119.651 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 39.793 ms | 1.36% | 39.789 ms | 1.36% | 13493067216 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 69x | 74.499 ms | 0.50% | 74.494 ms | 0.50% | 14413864845 | 1.336 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 381x | 39.273 ms | 1.11% | 39.269 ms | 1.11% | 13671742653 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 204x | 73.755 ms | 0.60% | 73.751 ms | 0.60% | 14559012350 | 1.648 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 39.490 ms | 1.31% | 39.486 ms | 1.31% | 13596333864 | 631.980 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 203x | 73.907 ms | 1.34% | 73.903 ms | 1.34% | 14529071322 | 1.454 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_chunked_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 23.022 ms | 1.96% | 23.017 ms | 1.96% | 23324556592 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 37.687 ms | 1.37% | 37.682 ms | 1.37% | 28494755419 | 1.659 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 22.703 ms | 2.30% | 22.699 ms | 2.30% | 23652118769 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 37.581 ms | 1.42% | 37.577 ms | 1.42% | 28574723179 | 1.658 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 544x | 22.296 ms | 1.56% | 22.293 ms | 1.56% | 24082840350 | 631.319 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 14x | 36.990 ms | 0.14% | 36.985 ms | 0.14% | 29031484389 | 1.554 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 676x | 22.114 ms | 1.22% | 22.110 ms | 1.22% | 24281965280 | 627.616 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 37.409 ms | 1.40% | 37.405 ms | 1.40% | 28706077426 | 1.562 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_chunked_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 74.780 ms | 0.67% | 74.776 ms | 0.67% | 7179747067 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 175x | 86.040 ms | 0.56% | 86.035 ms | 0.56% | 12480222210 | 1.576 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 186x | 80.668 ms | 4.14% | 80.664 ms | 4.14% | 6655685080 | 600.951 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 143x | 105.217 ms | 21.56% | 105.212 ms | 21.56% | 10205531345 | 1.576 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 128x | 80.087 ms | 3.05% | 80.082 ms | 3.05% | 6704042147 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 135x | 111.556 ms | 21.88% | 111.551 ms | 21.88% | 9625546746 | 1.489 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 112x | 134.677 ms | 4.14% | 134.672 ms | 4.14% | 3986513604 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 178.735 ms | 14.17% | 178.730 ms | 14.17% | 6007630497 | 1.520 GiB | 120.646 MiB | ``` Authors: - Zach Puller (https://github.com/zpuller) - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) Approvers: - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) URL: #16009

zpuller added the feature request New feature or request label Jun 11, 2024

github-actions bot added the External Issues or PRs created by external contributors label Jun 11, 2024

Matt711 added 0 - Backlog In queue waiting for assignment Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Jun 11, 2024

GregoryKimball added this to the Benchmarking milestone Jun 11, 2024

GregoryKimball added this to libcudf Jun 11, 2024

GregoryKimball added good first issue Good for newcomers and removed Needs Triage Need team to review and classify External Issues or PRs created by external contributors labels Jun 11, 2024

jlowe assigned zpuller Jun 11, 2024

zpuller mentioned this issue Jun 12, 2024

orc multithreaded benchmark #16009

Merged

3 tasks

zpuller closed this as completed Jun 14, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

zpuller commented Jun 11, 2024

GregoryKimball commented Jun 11, 2024

zpuller commented Jun 11, 2024

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

Comments

zpuller commented Jun 11, 2024

GregoryKimball commented Jun 11, 2024

zpuller commented Jun 11, 2024