-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Consider adopting C++20 style Span class abstraction #3118
Comments
Assigning myself to this ticket. TODO. File a pull request to implement POC for device-side |
It looks like some Span implementations do bounds checking in debug builds but not release builds, e.g. https://github.com/tcbrindle/span |
Note that libcudf also has a OOB accesses for Furthermore, bounds checking from device code is dubious. It looks like XGBoost's Instead of using |
Interesting. I will try the built-in assertfail. |
Even though it is dangerous, bounds-check has performance concerns (as stated by everyone before me). So, for key algos like RF, I'd not want bounds-check to be done in device code, atleast not in release builds. Ideally, we should emphasize on better unit-tests and them coupled with cuda-memcheck enabled runs should catch most of these issues. And we do have had success in catching such issues in ml-prims. Sadly, we didn't spend time after 0.14 release on enabling this flow for cuML unit-tests. |
Got it. In that case, we can conditionally enable bound checks in Debug builds. Even without bound checks, Span is a useful abstraction. For example, we can eliminate the mistake of using the wrong bounds when multiple arrays are being passed in as function arguments. (RF, for example, has multiple functions with 5+ array arguments. It's easy to use a wrong size info.) Span lets us keep the array size information close to the array pointer. |
I'm in favor of bounds checking. Are there any measurements for performance penalties caused by device-side bounds checking, especially for memory bandwidth-bound problems? |
@canonizer I believe it varies between different cards. Also most of the hot loops have pre defined bound hence the overhead can be avoided manually in those loops. I'm also in favor of having span, it's not only safer, but also a nice abstraction. |
I've updated the title of this issue, to separate the two concerns: 1) introducing |
I too am completely in favor of span. Just a couple of notes:
Thank you @jrhemstad. I think |
Thanks for pointing to cc @dantegd |
Hi @jrhemstad , I looked into the
I also looked into the host device function |
Wow, that's definitely wrong and definitely not a real macro. Looks like a mismash of I corrected it to |
I'm hoping to ultimately get it into Thrust. That would be the ideal place for it to live. |
@trivialfis thank you for pointing out this mistake. Looks like a perfect storm of typos/misunderstanding and a dropped unit test led to this being uncaught for a very long time. I've corrected it here: rapidsai/cudf#6696 |
I've opened NVIDIA/cccl#752. We'll work with the Thrust team to try and get this added in the medium term. It probably won't be ready in the short term, so if you want something sooner than later, copying what is in cudf is probably the best thing to do. |
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
This issue has been labeled |
Hi, I think it will take some time before thrust adding span, and some more time before we can use it. Any chance we can make our own implementation in cuml as a temporary solution? |
Being worked on here: rapidsai/raft#399 . |
@hcho3 It's merged into raft now. |
Problem.
#3107 was caused by an out-of-bounds access to an array. Many cuML algorithms such as random forest uses lots of arrays, and it is easy to introduce an out-of-bounds access to the codebase by accident. In addition, out-of-bounds access in CUDA kernels is quite difficult to pinpoint and debug, as the kernel crash with
cudaErrorIllegalAddress
exception, and we'd have to manually run cuda-memcheck, which can take a while to run.Proposal. We should adopt C++20 style
Span
class to model arrays with defined bounds. TheSpan
class will conduct automatic bounds checks, allowing us to quickly detect and fix out-of-bound access bugs. It also follows the Fail-Fast principle of software development. Furthermore, it makes a nicer abstraction to pack arrays with their bounds information. (Think Java-style arrays, where every array has thelength
field.) So passing arrays between functions will become less error-prone when we pass them asSpan
objects.XGBoost has a device-side
Span
class (credit to @trivialfis): https://github.com/dmlc/xgboost/blob/2fcc4f2886340e66e988688473a73d2c46525630/include/xgboost/span.h#L412Possible disadvantages. Bounds check may introduce performance penalty due to the addition of extra branching. My opinion is that the benefit of automatic bounds checking (fewer bugs, improved developer productivity) outweighs a slight performance penalty. Performance impact can be mitigated by supplying
data()
method to theSpan
class. For performance-critical loops, we can extract the raw pointer from theSpan
class and use the pointer directly to avoid the overhead of bounds checking inoperator[]
. This should be done sparingly, and only for small tight loops where the bounds of the loop variable is clearly identified.The text was updated successfully, but these errors were encountered: