[FEA] Address performance issue with decimal types in ORC reader #12677

GregoryKimball · 2023-02-02T04:24:54Z

Is your feature request related to a problem? Please describe.
In the ORC reader benchmarks, the decimal data_type in orc_read_decode is much slower than other fixed width types. The profiles below used low cardinality, high run length settings, but the slowdown applies to high cardinality, low run length data as well. The extra time is being spent in the ORC gpuDecodeOrcColumnData kernel.

read_orc takes about 25 ms for signed integers

read_orc takes about 150 ms for decimal

Describe the solution you'd like
I'd like decimal types to read with data ingest throughput that is similar to integer and float types.

Describe alternatives you've considered
n/a

Additional context
Here are the commands I used to generate the profiles:

export WORKSPACE=/raid && export CUDF_BENCHMARK_DROP_CACHE=1 && export B=ORC_READER_NVBENCH
/nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230201_orc_decimal/p3 cpp/build/benchmarks/$B --devices 0 --benchmark 0 --json /nfs/20230201_orc_decimal/"$B".json --axis data_type=[FLOAT,DECIMAL] --axis run_length=32 --axis cardinality=1000

Running rapids devel image 0022659d9d65 on cudf commit 3c39be5a9d7d69adaad47121359e0c084b76decf A100 + AMD Epyc hardware

The text was updated successfully, but these errors were encountered:

vuule · 2023-02-03T00:55:28Z

Looks like there's a bit to unpack here. Local benchmark results with separated decimal types:

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMAL32 |           0 |          1 |     21x | 744.186 ms | 1.21% | 744.205 ms | 1.21% |        721402250 |         1.286 GiB |       599.958 MiB |
|    DECIMAL32 |        1000 |          1 |     20x | 769.859 ms | 0.95% | 769.878 ms | 0.95% |        697345317 |         1.253 GiB |       384.062 MiB |
|    DECIMAL32 |           0 |         32 |      5x | 665.438 ms | 0.25% | 665.452 ms | 0.25% |        806776608 |         1.174 GiB |        55.202 MiB |
|    DECIMAL32 |        1000 |         32 |      5x | 664.632 ms | 0.19% | 664.647 ms | 0.19% |        807753490 |         1.169 GiB |        52.236 MiB |

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMAL64 |           0 |          1 |     31x | 496.289 ms | 1.71% | 496.292 ms | 1.71% |       1081764433 |         1.174 GiB |       571.132 MiB |
|    DECIMAL64 |        1000 |          1 |     30x | 502.897 ms | 0.77% | 502.901 ms | 0.77% |       1067548109 |         1.200 GiB |       268.817 MiB |
|    DECIMAL64 |           0 |         32 |      5x | 426.883 ms | 0.20% | 426.884 ms | 0.20% |       1257649713 |         1.155 GiB |        48.331 MiB |
|    DECIMAL64 |        1000 |         32 |      5x | 426.912 ms | 0.11% | 426.914 ms | 0.11% |       1257563245 |         1.153 GiB |        46.630 MiB |

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMA128 |           0 |          1 |    120x | 124.971 ms | 0.74% | 124.967 ms | 0.74% |       4296117684 |       595.779 MiB |         2.587 MiB |
|    DECIMA128 |        1000 |          1 |      5x | 124.656 ms | 0.36% | 124.651 ms | 0.36% |       4306985267 |       591.265 MiB |         2.446 MiB |
|    DECIMA128 |           0 |         32 |     28x | 122.432 ms | 0.50% | 122.427 ms | 0.50% |       4385230693 |       569.161 MiB |         1.753 MiB |
|    DECIMA128 |        1000 |         32 |      5x | 121.576 ms | 0.07% | 121.572 ms | 0.07% |       4416080458 |       568.406 MiB |         1.729 MiB |

At first glance - decimal 32 and 64 are even slower than the mixed table benchmark suggests. Related - dec128 looks broken; files are too small and the throughput is too large to be correct.

GregoryKimball · 2023-07-05T15:55:41Z

Closing in favor of #13251

GregoryKimball added feature request New feature or request 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Feb 2, 2023

GregoryKimball assigned vuule Feb 2, 2023

GregoryKimball added this to libcudf Feb 2, 2023

GregoryKimball changed the title ~~[FEA] Address performance issues decimal types in ORC reader~~ [FEA] Address performance issue with decimal types in ORC reader Feb 2, 2023

GregoryKimball mentioned this issue May 1, 2023

[FEA] Improve ORC reader performance for decimal types #13251

Open

GregoryKimball unassigned vuule Jul 5, 2023

GregoryKimball closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Address performance issue with decimal types in ORC reader #12677

[FEA] Address performance issue with decimal types in ORC reader #12677

GregoryKimball commented Feb 2, 2023 •

edited

Loading

vuule commented Feb 3, 2023

GregoryKimball commented Jul 5, 2023

[FEA] Address performance issue with decimal types in ORC reader #12677

[FEA] Address performance issue with decimal types in ORC reader #12677

Comments

GregoryKimball commented Feb 2, 2023 • edited Loading

vuule commented Feb 3, 2023

GregoryKimball commented Jul 5, 2023

GregoryKimball commented Feb 2, 2023 •

edited

Loading