Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Address performance issue with decimal types in ORC reader #12677

Closed
GregoryKimball opened this issue Feb 2, 2023 · 2 comments
Closed
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 2, 2023

Is your feature request related to a problem? Please describe.
In the ORC reader benchmarks, the decimal data_type in orc_read_decode is much slower than other fixed width types. The profiles below used low cardinality, high run length settings, but the slowdown applies to high cardinality, low run length data as well. The extra time is being spent in the ORC gpuDecodeOrcColumnData kernel.

read_orc takes about 25 ms for signed integers
image

read_orc takes about 150 ms for decimal
image

Describe the solution you'd like
I'd like decimal types to read with data ingest throughput that is similar to integer and float types.

Describe alternatives you've considered
n/a

Additional context
Here are the commands I used to generate the profiles:

export WORKSPACE=/raid && export CUDF_BENCHMARK_DROP_CACHE=1 && export B=ORC_READER_NVBENCH
/nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230201_orc_decimal/p3 cpp/build/benchmarks/$B --devices 0 --benchmark 0 --json /nfs/20230201_orc_decimal/"$B".json --axis data_type=[FLOAT,DECIMAL] --axis run_length=32 --axis cardinality=1000

Running rapids devel image 0022659d9d65 on cudf commit 3c39be5a9d7d69adaad47121359e0c084b76decf A100 + AMD Epyc hardware

@GregoryKimball GregoryKimball added feature request New feature or request 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue labels Feb 2, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Address performance issues decimal types in ORC reader [FEA] Address performance issue with decimal types in ORC reader Feb 2, 2023
@vuule
Copy link
Contributor

vuule commented Feb 3, 2023

Looks like there's a bit to unpack here. Local benchmark results with separated decimal types:

| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMAL32 |           0 |          1 |     21x | 744.186 ms | 1.21% | 744.205 ms | 1.21% |        721402250 |         1.286 GiB |       599.958 MiB |
|    DECIMAL32 |        1000 |          1 |     20x | 769.859 ms | 0.95% | 769.878 ms | 0.95% |        697345317 |         1.253 GiB |       384.062 MiB |
|    DECIMAL32 |           0 |         32 |      5x | 665.438 ms | 0.25% | 665.452 ms | 0.25% |        806776608 |         1.174 GiB |        55.202 MiB |
|    DECIMAL32 |        1000 |         32 |      5x | 664.632 ms | 0.19% | 664.647 ms | 0.19% |        807753490 |         1.169 GiB |        52.236 MiB |
| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMAL64 |           0 |          1 |     31x | 496.289 ms | 1.71% | 496.292 ms | 1.71% |       1081764433 |         1.174 GiB |       571.132 MiB |
|    DECIMAL64 |        1000 |          1 |     30x | 502.897 ms | 0.77% | 502.901 ms | 0.77% |       1067548109 |         1.200 GiB |       268.817 MiB |
|    DECIMAL64 |           0 |         32 |      5x | 426.883 ms | 0.20% | 426.884 ms | 0.20% |       1257649713 |         1.155 GiB |        48.331 MiB |
|    DECIMAL64 |        1000 |         32 |      5x | 426.912 ms | 0.11% | 426.914 ms | 0.11% |       1257563245 |         1.153 GiB |        46.630 MiB |
| data_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|    DECIMA128 |           0 |          1 |    120x | 124.971 ms | 0.74% | 124.967 ms | 0.74% |       4296117684 |       595.779 MiB |         2.587 MiB |
|    DECIMA128 |        1000 |          1 |      5x | 124.656 ms | 0.36% | 124.651 ms | 0.36% |       4306985267 |       591.265 MiB |         2.446 MiB |
|    DECIMA128 |           0 |         32 |     28x | 122.432 ms | 0.50% | 122.427 ms | 0.50% |       4385230693 |       569.161 MiB |         1.753 MiB |
|    DECIMA128 |        1000 |         32 |      5x | 121.576 ms | 0.07% | 121.572 ms | 0.07% |       4416080458 |       568.406 MiB |         1.729 MiB |

At first glance - decimal 32 and 64 are even slower than the mixed table benchmark suggests. Related - dec128 looks broken; files are too small and the throughput is too large to be correct.

@GregoryKimball
Copy link
Contributor Author

Closing in favor of #13251

@GregoryKimball GregoryKimball closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants