[REVIEW] Fix slow decompression performance #7951

elstehle · 2021-04-14T00:38:31Z

No performance regression on Turing (T4, CUDA 11.0, driver 450.102.04):

Click here to see environment details

nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1
*** WITH PATCH ***

Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name

3.34085s  9.8091ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4661]

4.77212s  9.6983ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9822]

4.88598s  9.8804ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10833]

5.00339s  9.9037ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11844]

5.12055s  9.8821ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12855]

5.23300s  9.8300ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13866]

5.34499s  9.7688ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14877]
*** WITHOUT PATCH ***
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name

3.34647s  10.088ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4661]

4.70031s  9.6986ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9822]

4.80015s  9.9992ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10833]

4.90408s  9.8958ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11844]

5.00237s  9.8824ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12855]

5.10425s  9.9476ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13866]

5.20341s  9.8957ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14877]

5.30279s  9.8980ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [15888]

kkraus14 · 2021-04-14T00:46:22Z

cc @vuule @devavret @tgravescs

cpp/src/io/utilities/block_utils.cuh

devavret · 2021-04-14T10:58:23Z

Can you also post the results with CUDA 10.2? The nanosleep issue is in driver versions 450.80.02 to 450.110 but only visible in CUDA 10.2 because CUDA 11 nvcc patches it.

revans2 · 2021-04-14T12:14:30Z

I tested this against the files that were causing issues on by dev box that caused me to file #7114. This fixes them. The files that took 1.8 seconds before now finish in 90 ms, and a complete build of the spark plugin with tests went from 38 mins back to 30.

ttnghia · 2021-04-14T18:00:46Z

Wait, does this still target branch 0.19? Should branch 0.19 be frozen?

jrhemstad · 2021-04-14T18:16:10Z

Wait, does this still target branch 0.19? Should branch 0.19 be frozen?

It's a hotfix.

kkraus14 · 2021-04-14T18:25:04Z

We're actually going to target this to branch-0.18 to hotfix 0.18 for this and then make sure it gets forward applied to 0.19 and 0.20.

kkraus14 · 2021-04-14T18:26:18Z

Alright, tried targeting branch-0.18 and git hell occurred. Let's iron out reviews here and then we'll cherry pick the changes as needed.

rgsl888prabhu · 2021-04-14T21:46:58Z

@elstehle Can you please post the performance after your current commit where you have removed nanosleep.

kkraus14 · 2021-04-14T21:52:16Z

@revans2 if you could test again with the latest changes it would appreciated as well!

revans2 · 2021-04-14T22:04:01Z

@revans2 if you could test again with the latest changes it would appreciated as well!

I did a quick test with it and it looks good. I don't have the exact numbers right now, but I will try to get them

elstehle · 2021-04-14T22:06:29Z

@elstehle Can you please post the performance after your current commit where you have removed nanosleep.

This time tested with CUDA 10.2 (driver : (T4, CUDA 10.2, driver 450.102.04)

*** PATCH IN PLACE***
ParquetRead/integral_file_input/28/1000/1/1/0/manual_time        116 ms          116 ms            6 bytes_per_second=4.32857G/s

nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1|grep unsnap
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.83721s  14.245ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4657]
4.24138s  13.080ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9818]
4.35756s  13.359ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10829]
4.47735s  13.361ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11840]
4.59230s  13.424ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12851]
4.70582s  13.407ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13862]
4.82084s  13.386ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14873]


*** NO PATCH***
ParquetRead/integral_file_input/28/1000/1/1/0/manual_time        120 ms          120 ms            5 bytes_per_second=4.15742G/s

nvprof --print-gpu-trace ./gbenchmarks/PARQUET_READER_BENCH --benchmark_filter="ParquetRead/integral_file_input/28/1000/1/1/0/manual_time" 2>&1|grep unsnap
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.79920s  17.970ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [4657]
4.22674s  15.256ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [9818]
4.34504s  15.410ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [10829]
4.46671s  16.608ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [11840]
4.58445s  15.099ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [12851]
4.70102s  16.133ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [13862]
4.81767s  16.921ms            (954 1 1)       (128 1 1)        44  1.5938KB        0B         -           -           -           -     Tesla T4 (0)         1         7  void cudf::io::unsnap_kernel<int=128>(cudf::io::gpu_inflate_input_s*, cudf::io::gpu_inflate_status_s*) [14873]

devavret

@elstehle should probably emphasise that the performance actually improved for non affected driver versions as well. 16 ms -> 13 ms

revans2 · 2021-04-14T22:42:18Z

Yup the numbers are about the same. about 93 ms for reading the file that used to take 1.8 seconds and 31 mins to run the full set of tests instead of 38. Don't know if it is run to run variance or not, but either way it is a lot better.

See #7951 Authors: - Elias Stehle (https://github.com/elstehle)

kkraus14 · 2021-04-15T21:38:42Z

Closing this in favor of #7975 which merges the commits from branch-0.18 into branch-0.19.

remove nanosleep intrinsic

b96ae65

elstehle requested a review from a team as a code owner April 14, 2021 00:38

elstehle requested review from ttnghia and rgsl888prabhu and removed request for a team April 14, 2021 00:38

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 14, 2021

kkraus14 added ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround ! - Release non-breaking Non-breaking change bug Something isn't working Performance Performance related issue cuIO cuIO issue labels Apr 14, 2021

fix formatting

7fb6d78

ttnghia requested changes Apr 14, 2021

View reviewed changes

cpp/src/io/utilities/block_utils.cuh Outdated Show resolved Hide resolved

rgsl888prabhu approved these changes Apr 14, 2021

View reviewed changes

kkraus14 changed the base branch from branch-0.19 to branch-0.18 April 14, 2021 18:25

kkraus14 requested review from a team as code owners April 14, 2021 18:25

kkraus14 removed the request for review from a team April 14, 2021 18:25

kkraus14 requested a review from galipremsagar April 14, 2021 18:25

kkraus14 changed the base branch from branch-0.18 to branch-0.19 April 14, 2021 18:25

rgsl888prabhu self-requested a review April 14, 2021 21:47

completely removed nanosleep and usage

cc4071f

elstehle force-pushed the branch-0.19 branch from 4366ad9 to cc4071f Compare April 14, 2021 21:53

kkraus14 removed request for a team April 14, 2021 22:06

kkraus14 removed request for a team and galipremsagar April 14, 2021 22:06

jrhemstad approved these changes Apr 14, 2021

View reviewed changes

elstehle changed the title ~~[WIP] Fix slow decompression performance~~ [REVIEW] Fix slow decompression performance Apr 14, 2021

rgsl888prabhu approved these changes Apr 14, 2021

View reviewed changes

ttnghia approved these changes Apr 14, 2021

View reviewed changes

devavret approved these changes Apr 14, 2021

View reviewed changes

elstehle mentioned this pull request Apr 14, 2021

[REVIEW][HOTFIX] Fix slow decompression performance #7962

Merged

raydouglass pushed a commit that referenced this pull request Apr 15, 2021

remove nanosleep (#7962)

a600005

See #7951 Authors: - Elias Stehle (https://github.com/elstehle)

kkraus14 closed this Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix slow decompression performance #7951

[REVIEW] Fix slow decompression performance #7951

elstehle commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

devavret commented Apr 14, 2021

revans2 commented Apr 14, 2021

ttnghia commented Apr 14, 2021 •

edited

Loading

jrhemstad commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

rgsl888prabhu commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

revans2 commented Apr 14, 2021

elstehle commented Apr 14, 2021 •

edited

Loading

devavret left a comment

revans2 commented Apr 14, 2021

kkraus14 commented Apr 15, 2021

[REVIEW] Fix slow decompression performance #7951

[REVIEW] Fix slow decompression performance #7951

Conversation

elstehle commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

devavret commented Apr 14, 2021

revans2 commented Apr 14, 2021

ttnghia commented Apr 14, 2021 • edited Loading

jrhemstad commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

rgsl888prabhu commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

revans2 commented Apr 14, 2021

elstehle commented Apr 14, 2021 • edited Loading

devavret left a comment

Choose a reason for hiding this comment

revans2 commented Apr 14, 2021

kkraus14 commented Apr 15, 2021

ttnghia commented Apr 14, 2021 •

edited

Loading

elstehle commented Apr 14, 2021 •

edited

Loading