[FEA] Improve PTDS performance #533

rongou · 2020-08-08T01:05:01Z

Is your feature request related to a problem? Please describe.
CUDA per-thread default stream (PTDS) is enabled for the plugin, but benchmark results haven't shown a big jump compared to the legacy default stream. Need to figure out why and try to improve the performance.

Describe the solution you'd like
At the moment there is no clear solution, but here are some ideas:

Get rid of cudaDeviceSynchronize calls (reduce cudaDeviceSynchronize calls thrust#1255)
Profiling using a smaller dataset on the desktop to better understand performance bottlenecks
Benchmark with the shuffle manager and UCX enabled to ameliorate the effects of I/O
Fix synchronous cudaMemcpy and cudaMemset calls ([REVIEW] call cudaMemcpyAsync/cudaMemsetAsync in JNI [skip ci] rapidsai/cudf#5913)
Improve CUDA event handling in RMM's pool_memory_resource
Reduce GPU memory fragmentation with multiple streams
Better support spilling, where memory is freed from a different thread than the one originally allocated it
Fix cudaErrorIllegalAddress errors (see [BUG]cudaErrorIllegalAddress an illegal memory access was encountered rapidsai/rmm#563)
(Nice to have) Remove unnecessary cudaStreamSynchronize calls

Describe alternatives you've considered
It's possible some of the benchmark queries are too I/O bound that increasing GPU concurrency does not help with reducing the wall clock time.

Additional context
Original issue to enable PTDS: #15

The text was updated successfully, but these errors were encountered:

harrism · 2020-08-11T02:34:16Z

but benchmark results haven't shown a big jump compared to the legacy default stream.

There's an important option you really should consider: a single stream might not be the bottleneck?

I think your first checkbox is crucial (simple standalone benchmark profiling).

JustPlay · 2020-08-11T07:36:40Z

@rongou

"but benchmark results haven't shown a big jump compared to the legacy default stream"

what benchmark? and how much?

thanks

rongou · 2020-08-11T16:17:24Z

We are using TPCx-BB. When I/O is very saturated, PTDS is only slightly faster, but probably not statistically significant. On a single GPU there are 10-20% improvements. Still investigating.

rongou · 2020-08-13T22:57:58Z

Tried with the shuffle manager enabled. Looks like there is more memory pressure/fragmentation, so had to increase the number of shuffle partitions/reduce gpu concurrency. Seems to perform better than without the shuffle manager, but it's not clear PTDS gives bigger gains.

rongou · 2020-11-06T18:42:21Z

This should be considered done.

rongou added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 8, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 10, 2020

sameerz assigned rongou Aug 10, 2020

rongou mentioned this issue Aug 11, 2020

[REVIEW] call cudaMemcpyAsync/cudaMemsetAsync in JNI [skip ci] rapidsai/cudf#5913

Merged

rongou mentioned this issue Aug 18, 2020

[REVIEW] upgrade CUB/Thrust to the latest commit rapidsai/cudf#6015

Merged

rongou mentioned this issue Sep 10, 2020

[REVIEW] Add an arena-based memory resource for PTDS rapidsai/rmm#543

Merged

JustPlay mentioned this issue Sep 28, 2020

[FEA] config option to enable RMM arena memory resource #864

Closed

rongou closed this as completed Nov 6, 2020

sameerz added performance A performance related task/issue and removed feature request New feature or request labels Dec 14, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Fixing ansi casting to integer with decimal string values (NVIDIA#533)

fdee291

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve PTDS performance #533

[FEA] Improve PTDS performance #533

rongou commented Aug 8, 2020 •

edited

Loading

harrism commented Aug 11, 2020

JustPlay commented Aug 11, 2020

rongou commented Aug 11, 2020

rongou commented Aug 13, 2020

rongou commented Nov 6, 2020

[FEA] Improve PTDS performance #533

[FEA] Improve PTDS performance #533

Comments

rongou commented Aug 8, 2020 • edited Loading

harrism commented Aug 11, 2020

JustPlay commented Aug 11, 2020

rongou commented Aug 11, 2020

rongou commented Aug 13, 2020

rongou commented Nov 6, 2020

rongou commented Aug 8, 2020 •

edited

Loading