Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Consider adopting C++20 style Span class abstraction #3118

Closed
hcho3 opened this issue Nov 5, 2020 · 21 comments
Closed

[FEA] Consider adopting C++20 style Span class abstraction #3118

hcho3 opened this issue Nov 5, 2020 · 21 comments
Assignees
Labels
CUDA / C++ CUDA issue feature request New feature or request

Comments

@hcho3
Copy link
Contributor

hcho3 commented Nov 5, 2020

Problem.

#3107 was caused by an out-of-bounds access to an array. Many cuML algorithms such as random forest uses lots of arrays, and it is easy to introduce an out-of-bounds access to the codebase by accident. In addition, out-of-bounds access in CUDA kernels is quite difficult to pinpoint and debug, as the kernel crash with cudaErrorIllegalAddress exception, and we'd have to manually run cuda-memcheck, which can take a while to run.

Proposal. We should adopt C++20 style Span class to model arrays with defined bounds. The Span class will conduct automatic bounds checks, allowing us to quickly detect and fix out-of-bound access bugs. It also follows the Fail-Fast principle of software development. Furthermore, it makes a nicer abstraction to pack arrays with their bounds information. (Think Java-style arrays, where every array has the length field.) So passing arrays between functions will become less error-prone when we pass them as Span objects.

XGBoost has a device-side Span class (credit to @trivialfis): https://github.com/dmlc/xgboost/blob/2fcc4f2886340e66e988688473a73d2c46525630/include/xgboost/span.h#L412

Possible disadvantages. Bounds check may introduce performance penalty due to the addition of extra branching. My opinion is that the benefit of automatic bounds checking (fewer bugs, improved developer productivity) outweighs a slight performance penalty. Performance impact can be mitigated by supplying data() method to the Span class. For performance-critical loops, we can extract the raw pointer from the Span class and use the pointer directly to avoid the overhead of bounds checking in operator[]. This should be done sparingly, and only for small tight loops where the bounds of the loop variable is clearly identified.

@hcho3 hcho3 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Nov 5, 2020
@hcho3
Copy link
Contributor Author

hcho3 commented Nov 5, 2020

Assigning myself to this ticket. TODO. File a pull request to implement POC for device-side Span class in cuML.

@hcho3 hcho3 self-assigned this Nov 5, 2020
@hcho3 hcho3 added CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify labels Nov 5, 2020
@JohnZed
Copy link
Contributor

JohnZed commented Nov 5, 2020

It looks like some Span implementations do bounds checking in debug builds but not release builds, e.g. https://github.com/tcbrindle/span
Would this be desirable behavior to mitigate performance concerns? Frequent bounds checks do seem expensive to me unless the compiler can do a lot of magic to lift them out of key loops and kernels.

@jrhemstad
Copy link
Contributor

jrhemstad commented Nov 5, 2020

Note that libcudf also has a device_span class. I'm a big fan of using span-like abstractions. We don't do any bounds checking on accesses (debug or otherwise). Doing checks in debug mode with assert seems reasonable.

OOB accesses for operator[] in std::span is UB: https://en.cppreference.com/w/cpp/container/span/operator_at . UB does mean that it's perfectly acceptable to crash the program with an assert or otherwise.

Furthermore, bounds checking from device code is dubious. It looks like XGBoost's Span type uses a mixture of printf and asm("trap;"). The trap corrupts the CUDA context and is an unrecoverable error. Furthermore, the presence of a printf in a kernel, even if it's never executed, can cause noticeable performance degradation as the runtime has to make the necessary syscalls to setup the staging buffer to receive printf output from threads.

Instead of using printf, using the __assert_fail built-in is preferred. It has the same drawback as asm("trap") where it will corrupt the CUDA context, but at least you don't need a printf. See libcudf's usage here: https://github.com/rapidsai/cudf/blob/89b802e6cecffe2425048f1f70cd682b865730b8/cpp/include/cudf/detail/utilities/release_assert.cuh#L33-L36

@trivialfis
Copy link
Member

Interesting. I will try the built-in assertfail.

@teju85
Copy link
Member

teju85 commented Nov 5, 2020

Even though it is dangerous, bounds-check has performance concerns (as stated by everyone before me). So, for key algos like RF, I'd not want bounds-check to be done in device code, atleast not in release builds.

Ideally, we should emphasize on better unit-tests and them coupled with cuda-memcheck enabled runs should catch most of these issues. And we do have had success in catching such issues in ml-prims. Sadly, we didn't spend time after 0.14 release on enabling this flow for cuML unit-tests.

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 5, 2020

Got it. In that case, we can conditionally enable bound checks in Debug builds.

Even without bound checks, Span is a useful abstraction. For example, we can eliminate the mistake of using the wrong bounds when multiple arrays are being passed in as function arguments. (RF, for example, has multiple functions with 5+ array arguments. It's easy to use a wrong size info.) Span lets us keep the array size information close to the array pointer.

@canonizer
Copy link
Contributor

I'm in favor of bounds checking. Are there any measurements for performance penalties caused by device-side bounds checking, especially for memory bandwidth-bound problems?

@trivialfis
Copy link
Member

trivialfis commented Nov 5, 2020

@canonizer I believe it varies between different cards. Also most of the hot loops have pre defined bound hence the overhead can be avoided manually in those loops. I'm also in favor of having span, it's not only safer, but also a nice abstraction.

@hcho3 hcho3 changed the title [FEA] Consider adopting C++20 style Span class to automate bounds check [FEA] Consider adopting C++20 style Span class abstraction Nov 5, 2020
@hcho3
Copy link
Contributor Author

hcho3 commented Nov 5, 2020

I've updated the title of this issue, to separate the two concerns: 1) introducing Span in general; and 2) performing bounds check in operator[].

@teju85
Copy link
Member

teju85 commented Nov 6, 2020

I too am completely in favor of span. Just a couple of notes:

  1. relying on this feature for safety of our code is not enough (I have made this mistake in the past!). Rigorous testing as well as regularly running our tests with cuda-memcheck tooling enabled are a more robust and sure-shot way to achieve safer code.
  2. since we already have a lot of algos with varying roofline behavior, we'll need to introduce bounds-check with a lot of care. I'd strongly prefer doing this only after we have our ASV-based reporting enabled even for all of our C++ benchmarks.

Thank you @jrhemstad. I think device_span class in cudf pretty much covers all our needs in cuml too!

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 6, 2020

Thanks for pointing to device_span, @jrhemstad. Would it be a good idea to submit a PR to put it in RAFT, so that cuML can also use it? (I don't think cuML has direct access to libcudf's headers.)

cc @dantegd

@trivialfis
Copy link
Member

trivialfis commented Nov 6, 2020

Hi @jrhemstad , I looked into the __assertfail you mentioned in your previous comment and got a few questions:

  • Is __CUDACC_ARCH__ a real macro defined by nvcc?
  • Skimming through the cuda header common_functions.h, __assertfail seems to be only defined at runtime compilation?

I also looked into the host device function __assert_fail. There might be a nvcc bug which turns it into undefined identifier in some device functions (4 functions in xgboost), but works for all the other device functions.

@jrhemstad
Copy link
Contributor

  • Is __CUDACC_ARCH__ a real macro defined by nvcc?

Wow, that's definitely wrong and definitely not a real macro. Looks like a mismash of __CUDACC__ and __CUDA_ARCH__. It looks like our code hasn't actually been using __assertfail for a long time.

I corrected it to __CUDA_ARCH__ and confirmed that I'm seeing undefined reference errors. I'm investigating now.

@jrhemstad
Copy link
Contributor

Thanks for pointing to device_span, @jrhemstad. Would it be a good idea to submit a PR to put it in RAFT, so that cuML can also use it? (I don't think cuML has direct access to libcudf's headers.)

cc @dantegd

I'm hoping to ultimately get it into Thrust. That would be the ideal place for it to live.

@jrhemstad
Copy link
Contributor

@trivialfis thank you for pointing out this mistake. Looks like a perfect storm of typos/misunderstanding and a dropped unit test led to this being uncaught for a very long time. I've corrected it here: rapidsai/cudf#6696

@jrhemstad
Copy link
Contributor

I've opened NVIDIA/cccl#752. We'll work with the Thrust team to try and get this added in the medium term. It probably won't be ready in the short term, so if you want something sooner than later, copying what is in cudf is probably the best thing to do.

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@trivialfis
Copy link
Member

Hi, I think it will take some time before thrust adding span, and some more time before we can use it. Any chance we can make our own implementation in cuml as a temporary solution?

@trivialfis
Copy link
Member

Being worked on here: rapidsai/raft#399 .

@trivialfis
Copy link
Member

@hcho3 It's merged into raft now.

@hcho3 hcho3 closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA / C++ CUDA issue feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants