-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace use of custom CUDA bindings with CUDA-Python #930
Conversation
Seeing all of the code dropped here is great! Thanks for working on this Ashwin! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @shwina! Thanks for doing this. I reviewed this primarily to learn more about RMM's Cython API, which I haven't looked at much before. I have one suggestion and one question attached below.
This is so cool. RMM is getting leaner! |
Is this breaking or non-breaking? |
I'm marking it non-breaking, as the only changes are to non-public APIs. However, this will break As part of this PR, I'm going to remove those APIs entirely from RMM ( |
Also should we be adding |
Co-authored-by: Bradley Dice <[email protected]>
Thanks @jakirkham - I just did that. Could you please double check that the constraints are appropriate? |
Also python/dev_requirements.txt. Also, the Is it intentional that we have pinnings in |
Hmm...I'm not seeing them. Did they get pushed? |
Co-authored-by: Mark Harris <[email protected]>
python/rmm/_cuda/gpu.py
Outdated
err, name = cudart.cudaGetErrorName(status) | ||
if err != cudart.cudaError_t.cudaSuccess: | ||
raise CUDARuntimeError(err.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's awkward that we might raise a CUDARuntimeError
from within the constructor of CUDARuntimeError
... which got me to looking a little closer.
I saw the suggestion from @jakirkham in #930 (comment) / 8be6c0b but I'm pretty sure this is a spurious error check that illuminates some slightly awkward design in cuda-python (a meaningless err
value).
For cudaGetErrorName
, the err
value is hardcoded as cudaError_t.cudaSuccess
. It's not likely that this will ever change, because the corresponding runtime API does not generate an CUDA error: cudaGetErrorName returns a const char*
with a special value of "unrecognized error code"
rather than setting an error code if the parameters are invalid.
https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/cudart.pyx#L8084
Similarly for cudaGetErrorString
... sort of:
https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/cudart.pyx#L8115
This function actually can set an error code via its call to the driver API cuGetErrorString
, but that error is not returned -- the cudaGetErrorString
in cuda-python also has err
hard-coded as a success and the value "unrecognized error code"
is returned. https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L522-L531
In summary, I think it would be safe to revert 8be6c0b and use the previous snippet with a throwaway variable _, name = ...
for both cudaGetErrorName
and cudaGetErrorString
because the err
is always a success. As a consequence, we don't need to handle this awkward possibility of raising an error class in its own constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is awkward. Would hope this error can't come up, but it might come if the error code is new. IOW only shows up in later CUDA versions, but not earlier ones. Or it could show up if the error code is malformed.
Maybe a middle ground would be to raise a RuntimeError
so we don't have a recursive CUDARuntimeError
definition. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a new or malformed error code is provided, the returned string from cudaGetErrorName
or cudaGetErrorString
will say "unrecognized error code"
but no CUDA error will be set. Thus we do not need to check for a CUDA error from those functions. The cuda-python decision to hard-code the err
value as "success" strikes me as odd because no err
should be returned at all.
If we were to need to raise an error, the recursive use of CUDARuntimeError
is not problematic/incorrect -- just awkward.
However, I don't have strong enough feelings on this to hold back on merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should check for that build string and then error? Having suppressed errors doesn't sound good.
Agree with your assessment CUDA-Python should be handling this better, but maybe we can have that discussion in a new CUDA-Python issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea! Raise a RuntimeError if we get the unrecognized error code string back, but don’t check the value of err
:
err, name = cudart.cudaGetErrorName(status) | |
if err != cudart.cudaError_t.cudaSuccess: | |
raise CUDARuntimeError(err.value) | |
_, name = cudart.cudaGetErrorName(status) | |
if name == "unrecognized error code": | |
raise RuntimeError(name) |
(and similarly for cudaGetErrorString
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix'd!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a brief sync offline with Bradley, we decided to just not worry about handling an unrecognized error code, because we couldn't figure out how such an error code could be constructed to begin with. cudaError_t
is an enum
that cannot be constructed from arbitrary integers.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @shwina!
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to simplify. I will leave the Python review up to Python experts -- what would you like me to review?
I think you had some review comments above. So was curious if those were addressed from your perspective If anything else pops out, please let us know :) |
Just the one comment about why we are keeping some cudart functions in RMM. |
@gpucibot merge |
… CUDA Python bindings (#10008) This PR replaces custom CUDA bindings that are provided by RMM, with official CUDA Python bindings. This PR should be merged after the RMM PR rapidsai/rmm#930 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #10008
…451) As a follow up to rapidsai/rmm#930, fix RAFT to rely on CUDA Python directly rather than custom CUDA bindings that RMM provided. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Jordan Jacobelli (https://github.com/Ethyling) URL: #451
Please do not merge until rapidsai/rmm#930 is merged. For the reasons described in that PR, this API has changed to accept a `cuda.cudart.cudaDeviceAttr` object, `cuda` being the official CUDA Python bindings package. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Brad Rees (https://github.com/BradReesWork) - Rick Ratzel (https://github.com/rlratzel) - AJ Schmidt (https://github.com/ajschmidt8) URL: #2008
This PR removes many of the custom CUDA bindings we wrote in RMM to support calls to the driver/runtime APIs from Python in downstream libraries (cudf, cuml, cugraph). We should now use CUDA Python instead.
However, the module
rmm._cuda.gpu
is not being removed. It has been converted from an extension module (.pyx
) to a regular.py
module. This module contains high-level wrappers around raw CUDA bindings, with some niceties like converting errors to exceptions with the appropriate error message. Reimplementing that functionality in each downstream library would be a bad idea. When CUDA Python rolls its own higher-level API, we can remove thegpu
module as well.One API change worth mentioning here is to the function
rmm._cuda.gpu.getDeviceAttribute
. Previously, the API accepted acudaDeviceAttr
, a type defined as part of RMM's custom CUDA bindings. The API has now changed to accept acudaDeviceAttr
defined in CUDA-Python. This requires changes in downstream libraries that use this API.I am marking this PR non-breaking as it does not affect the user-facing API. It does cause breakages in downstream libraries that are currently relying on internal APIs (from the
rmm._cuda
module).