-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hostdevice_vector
in kernel_error
to avoid the pageable copy
#15140
Use hostdevice_vector
in kernel_error
to avoid the pageable copy
#15140
Conversation
@abellina suspects this might be the case. It's possible that the shared bounce buffer used for pageable copies creates a bottleneck only in multi-threaded use case. This could explain why the regression is specific to spark. I opened the PR to enable @abellina to test the hypothesis. |
@vuule @GregoryKimball we will review/test this, sorry for the delayed response. |
@vuule I have done some testing with this but I need a bit more time. So far I think with the error code checking plus the in This:
Is 10 seconds faster than:
Which is much better than what I had measured before (with pageable we had a 20 seconds or 5% regression). I also want to not do that last |
@vuule here are my updates so far:
|
Correct. The synchronization there is needed because the null count code needs the updated data from the page and nesting info bufs that gets computed in the kernel. The |
But the call to |
Sorry the funny business is in my end. I also didn't call |
What if we change |
Got an extra data point by using Comparing it with a baseline that is the pageable version of this (the code before this PR), it's essentially unchanged overall. Some queries are faster, and some are slower. Query 9 is 9% faster. Details:
|
Re-reading the comment again, I think it makes sense now, so I am not sure we need to reword. Hope we have enough data for this PR. If you need other tests let me know. |
…perf-hd_v-kernel_error
No measurable performance impact on libcudf |
…into perf-hd_v-kernel_error
…perf-hd_v-kernel_error
I cooked the |
LGTM! Thanks! |
…perf-hd_v-kernel_error
*/ | ||
[[nodiscard]] std::string str() const | ||
[[nodiscard]] static std::string to_string(value_type value) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of static, shouldn't this be a free function instead of a member function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, but I like how it reads at the call site: kernel_error::to_string(error)
and it does not really change much else about the function.
I'm not insisting on this option, so, open to suggestions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically don't like to see if (auto ret = foo(); ret == true)
type stuff, but I'll give you pass on this usage, @vuule
Yeah, it would be pretty pointless if not for the repeated synchronization this avoids. |
/merge |
Description
Issue #15122
The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck.
This PR aims to eliminate most of the error checking overhead by using
hostdevice_vector
in thekernel_error
class. Thehostdevice_vector
uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error.Checklist