-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve PTDS performance #533
Comments
There's an important option you really should consider: a single stream might not be the bottleneck? I think your first checkbox is crucial (simple standalone benchmark profiling). |
"but benchmark results haven't shown a big jump compared to the legacy default stream" what benchmark? and how much? thanks |
We are using TPCx-BB. When I/O is very saturated, PTDS is only slightly faster, but probably not statistically significant. On a single GPU there are 10-20% improvements. Still investigating. |
Tried with the shuffle manager enabled. Looks like there is more memory pressure/fragmentation, so had to increase the number of shuffle partitions/reduce gpu concurrency. Seems to perform better than without the shuffle manager, but it's not clear PTDS gives bigger gains. |
This should be considered done. |
Is your feature request related to a problem? Please describe.
CUDA per-thread default stream (PTDS) is enabled for the plugin, but benchmark results haven't shown a big jump compared to the legacy default stream. Need to figure out why and try to improve the performance.
Describe the solution you'd like
At the moment there is no clear solution, but here are some ideas:
cudaDeviceSynchronize
calls (reduce cudaDeviceSynchronize calls thrust#1255)cudaMemcpy
andcudaMemset
calls ([REVIEW] call cudaMemcpyAsync/cudaMemsetAsync in JNI [skip ci] rapidsai/cudf#5913)pool_memory_resource
cudaErrorIllegalAddress
errors (see [BUG]cudaErrorIllegalAddress an illegal memory access was encountered rapidsai/rmm#563)cudaStreamSynchronize
callsDescribe alternatives you've considered
It's possible some of the benchmark queries are too I/O bound that increasing GPU concurrency does not help with reducing the wall clock time.
Additional context
Original issue to enable PTDS: #15
The text was updated successfully, but these errors were encountered: