Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix slow decompression performance #7951

Closed
wants to merge 3 commits into from

Conversation

elstehle
Copy link
Contributor

Fixes #7114

No performance regression on Turing (T4, CUDA 11.0, driver 450.102.04):

Click here to see environment details
nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1

*** WITH PATCH ***
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
3.34085s 9.8091ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4661]
4.77212s 9.6983ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9822]
4.88598s 9.8804ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10833]
5.00339s 9.9037ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11844]
5.12055s 9.8821ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12855]
5.23300s 9.8300ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13866]
5.34499s 9.7688ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14877]

*** WITHOUT PATCH ***

Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
3.34647s 10.088ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4661]
4.70031s 9.6986ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9822]
4.80015s 9.9992ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10833]
4.90408s 9.8958ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11844]
5.00237s 9.8824ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12855]
5.10425s 9.9476ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13866]
5.20341s 9.8957ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14877]
5.30279s 9.8980ms (954 1 1) (128 1 1) 44 1.5938KB 0B - - - - Tesla T4 (0) 1 7 void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [15888]

@elstehle elstehle requested a review from a team as a code owner April 14, 2021 00:38
@elstehle elstehle requested review from ttnghia and rgsl888prabhu and removed request for a team April 14, 2021 00:38
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 14, 2021
@kkraus14 kkraus14 added ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround ! - Release non-breaking Non-breaking change bug Something isn't working Performance Performance related issue cuIO cuIO issue labels Apr 14, 2021
@kkraus14
Copy link
Collaborator

cc @vuule @devavret @tgravescs

cpp/src/io/utilities/block_utils.cuh Outdated Show resolved Hide resolved
@devavret
Copy link
Contributor

Can you also post the results with CUDA 10.2? The nanosleep issue is in driver versions 450.80.02 to 450.110 but only visible in CUDA 10.2 because CUDA 11 nvcc patches it.

@revans2
Copy link
Contributor

revans2 commented Apr 14, 2021

I tested this against the files that were causing issues on by dev box that caused me to file #7114. This fixes them. The files that took 1.8 seconds before now finish in 90 ms, and a complete build of the spark plugin with tests went from 38 mins back to 30.

@ttnghia
Copy link
Contributor

ttnghia commented Apr 14, 2021

Wait, does this still target branch 0.19? Should branch 0.19 be frozen?

@jrhemstad
Copy link
Contributor

Wait, does this still target branch 0.19? Should branch 0.19 be frozen?

It's a hotfix.

@kkraus14
Copy link
Collaborator

We're actually going to target this to branch-0.18 to hotfix 0.18 for this and then make sure it gets forward applied to 0.19 and 0.20.

@kkraus14 kkraus14 changed the base branch from branch-0.19 to branch-0.18 April 14, 2021 18:25
@kkraus14 kkraus14 requested review from a team as code owners April 14, 2021 18:25
@kkraus14 kkraus14 removed the request for review from a team April 14, 2021 18:25
@kkraus14 kkraus14 requested a review from galipremsagar April 14, 2021 18:25
@kkraus14 kkraus14 changed the base branch from branch-0.18 to branch-0.19 April 14, 2021 18:25
@kkraus14
Copy link
Collaborator

Alright, tried targeting branch-0.18 and git hell occurred. Let's iron out reviews here and then we'll cherry pick the changes as needed.

@rgsl888prabhu
Copy link
Contributor

@elstehle Can you please post the performance after your current commit where you have removed nanosleep.

@rgsl888prabhu rgsl888prabhu self-requested a review April 14, 2021 21:47
@kkraus14
Copy link
Collaborator

@revans2 if you could test again with the latest changes it would appreciated as well!

@revans2
Copy link
Contributor

revans2 commented Apr 14, 2021

@revans2 if you could test again with the latest changes it would appreciated as well!

I did a quick test with it and it looks good. I don't have the exact numbers right now, but I will try to get them

@kkraus14 kkraus14 removed request for a team April 14, 2021 22:06
@elstehle
Copy link
Contributor Author

elstehle commented Apr 14, 2021

@elstehle Can you please post the performance after your current commit where you have removed nanosleep.

This time tested with CUDA 10.2 (driver : (T4, CUDA 10.2, driver 450.102.04)

*** PATCH IN PLACE***
ParquetRead/integral_file_input/28/1000/1/1/0/manual_time        116 ms          116 ms            6 bytes_per_second=4.32857G/s

nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1|grep unsnap
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.83721s  14.245ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4657]
4.24138s  13.080ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9818]
4.35756s  13.359ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10829]
4.47735s  13.361ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11840]
4.59230s  13.424ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12851]
4.70582s  13.407ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13862]
4.82084s  13.386ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14873]


*** NO PATCH***
ParquetRead/integral_file_input/28/1000/1/1/0/manual_time        120 ms          120 ms            5 bytes_per_second=4.15742G/s

nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1|grep unsnap
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.79920s  17.970ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4657]
4.22674s  15.256ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9818]
4.34504s  15.410ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10829]
4.46671s  16.608ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11840]
4.58445s  15.099ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12851]
4.70102s  16.133ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13862]
4.81767s  16.921ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14873]

@kkraus14 kkraus14 removed request for a team and galipremsagar April 14, 2021 22:06
@elstehle elstehle changed the title [WIP] Fix slow decompression performance [REVIEW] Fix slow decompression performance Apr 14, 2021
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elstehle should probably emphasise that the performance actually improved for non affected driver versions as well. 16 ms -> 13 ms

@revans2
Copy link
Contributor

revans2 commented Apr 14, 2021

Yup the numbers are about the same. about 93 ms for reading the file that used to take 1.8 seconds and 31 mins to run the full set of tests instead of 38. Don't know if it is run to run variance or not, but either way it is a lot better.

raydouglass pushed a commit that referenced this pull request Apr 15, 2021
@kkraus14
Copy link
Collaborator

Closing this in favor of #7975 which merges the commits from branch-0.18 into branch-0.19.

@kkraus14 kkraus14 closed this Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants