-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cuda::device::barrier_arrive tx #358
Add cuda::device::barrier_arrive tx #358
Conversation
Pull requests from external contributors require approval from a |
Please advise how to fulfill the items on the checklist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am no expert on barrier, but some general libcu++ comments
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
I am not convinced about only supporting this on sm_90+ and on shmem barriers. For a future memcpy_async_tx, we'll just do a fallback and never issue tx-based instructions for those cases, so if arrive_tx is just an arrive there (essentially discarding the tx count), that should be fine, no? I'd like our users to be able to just write the same code everywhere, have it use hardware features where available, and do a fallback everywhere else where possible (so only trap for cases where we can't tell what the correct thing to do is, like in the barrier in cluster shmem case). |
I am loving what @griwes said. Use the new hot stuff when available and fall back to the old behavior when not. Maybe we can add some warning so the user acknowledges it? |
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
904ab27
to
1863492
Compare
Good point. The backwards-compatibility story was not on my mind (yet). I have changed the code to simply arrive on previous architectures. I have updated the PR to take this into account and also updated the tests to (hopefully) compile with NVRTC.. |
Can somebody give a push to start the CI? Is there something I can do to start the CI? |
/ok to test |
@jarmak-nv Could you have a look why the CI is currently stalled? |
1863492
to
8c4f4bb
Compare
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the tests a tiny bit. The main issue is that in our current test setup we usually only launch with a single thread. Thats why I have split it up into three different tests.
I have fixed one potential unused variable warning out of pure paranoia.
I have also added a feature test macro and a test for it.
@griwes please give it a final review
@jrhemstad @ahendriksen I have only enabled the feature test macro for SM_90 Shout if you have any objections and we should relax that to SM_70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @miscco. I have left some comments on the feature flag.
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
But do not support any template parameter that is not equal to the default completion function.
Instead of using the PTX docs as a reference, use the C++ barrier docs as a reference.
8e3a10d
to
41303da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All feedback has been incorporated:
- Docs changes have been implemented
- one NVRTC test has been disabled
- other small changes also done.
libcudacxx/.upstream-tests/test/cuda/barrier/arrive_tx_device.runfail.cpp
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/barrier_arrive_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
thus progress the `cuda::barrier` towards the completion of the current phase. | ||
This may complete the current phase. | ||
|
||
### Phase Completion of a `cuda::barrier` with tx-count support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write "Modify [....linked section of standard....] of ISO/IEC IS 14882 (the C++ Standard) as follows:", then use quote (>
) to quote the standard text, then use bold to highlight the modifications, like it is done everywhere else in the docs:
- https://nvidia.github.io/libcudacxx/extended_api/memory_model.html#data-races
- https://nvidia.github.io/libcudacxx/standard_api/time_library/chrono.html
EDIT: there is no need to invent some new way of documentation; just be consistent with how libcu++ already documents these things elsewhere.
If you want to propose a new way to document these changes, then open an issue, and then it needs to be changed throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, lines 19-20 of this file (https://github.com/NVIDIA/cccl/pull/358/files#diff-2f7a92f8ce801caa513c4c823f3982c625d0815166ee448df9788b6f937a46a1R19-R20) currently contradict this..
It has the same interface and semantics as [
cuda::std::barrier
], with the following additional operations
I think this section needs to be moved to the top, and it has to be clarified that cuda::barrier
has the same interface but not the same semantics as cuda::std::barrier
.
It needs to say that,
- if
!(scope == thread_block_scope && __isShared(this))
, then the semantics are the same ascuda::std::barrier
, - otherwise, the semantics are that of "modify [....] of the c++ standard as follows:".
> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**. | ||
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.** | ||
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`. | ||
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts. | |
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, and the next phase starts. |
Since the transaction count is zero, it does not need to be reset.
> | ||
> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**. | ||
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.** | ||
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`. | |
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, **`cuda::device::barrier_arrive_tx`**, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`. |
@gonzalobg we decided to merge the PR as is and add a follow up PR for the documentation fixes |
Please follow up with @ahendriksen for the remaining documentation fixes before 12.4 cut-off deadline to avoid having to revert this if fundamental issues are discovered. |
Thanks a lot for that great contribution 🎉 |
Description
Closes #357
This PR adds the
cuda::device::arrive_tx
function for shared memory barriers.Checklist