-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workaround for syevd in CUDA 12.0 #2332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lowener for the fix. While the fix looks good, I think we should only apply it for the affected CTX versions.
@@ -91,6 +91,15 @@ void eigDC(raft::resources const& handle, | |||
#if CUDART_VERSION < 11010 | |||
eigDC_legacy(handle, in, n_rows, n_cols, eig_vectors, eig_vals, stream); | |||
#else | |||
|
|||
// Use a new stream instead of `cudaStreamPerThread` to avoid cusolver bug # 4580093. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the cusolver bug is solved in cuda toolkit 12.4.1.003. It would be great if we apply this workaround only when we use an earlier cuda version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple notes on manually creating streams / events.
cudaStream_t stream_new; | ||
cudaEvent_t sync_event; | ||
RAFT_CUDA_TRY(cudaStreamCreate(&stream_new)); | ||
RAFT_CUDA_TRY(cudaEventCreate(&sync_event)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, in this case it would be justified to use the resource::detail::get_cuda_stream_sync_event instead of manually managing the resource.
This event resource is normally intended for synchronization between the streams in the resource pool, but as long as you don't use the stream pool resource at the same time, it's ok to reuse the event.
You can also use the stream from the stream pool resource, but there is a small problem with it, that raft/rmm would create 16 streams by default instead of one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that we shouldn't be calling detail APIs in any namdespaces outside of the immediate namespace where those detail APIs reside. Please expose this function if it's going to be used publicly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lowener for the update. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, LGTM!
/merge |
Bug 4580093 of cusolver is causing an issue with
cusolverDnXsyevd
. This bug has been seen in rapidsai/cuml#5555 and impact PCA and Linear Regression with CUDA 12.0+. Setting the stream to a different value thancudaStreamPerThread
seems to solve it as a workaround.