Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda::ptx::* namespace #574

Merged
merged 52 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
16ad54a
Initial proof-of-concept for PTX header
ahendriksen Oct 17, 2023
9b31cc8
Add docs
ahendriksen Oct 17, 2023
229704a
Reformat docs
ahendriksen Oct 17, 2023
dad93de
Use PTX wrapper in internal code
ahendriksen Oct 17, 2023
220d475
Apply suggestions from code review
ahendriksen Oct 18, 2023
ae1a084
Address review comments
ahendriksen Oct 18, 2023
ecbb6fe
Apply suggestions from code review
ahendriksen Oct 18, 2023
cf19e53
Address review comments
ahendriksen Oct 18, 2023
b159338
Merge branch 'main' into pr/ahendriksen/574
miscco Oct 25, 2023
1d57b02
Fix typo
miscco Oct 25, 2023
21050e8
Add targeting macros and a few more helper functions
ahendriksen Oct 18, 2023
986d990
Add PTX ISA 8.3 macro
ahendriksen Oct 25, 2023
82d1b85
Improve code organization
ahendriksen Oct 25, 2023
e356271
Format code
ahendriksen Oct 25, 2023
bb91eb7
Fix test and ifdefs
ahendriksen Oct 25, 2023
b514e2d
Update ptx.md
ahendriksen Oct 25, 2023
e351c79
Use numerical PTX ISA/SM target macros
ahendriksen Oct 25, 2023
9006317
Move bulk of ptx header into detail/ptx.h
ahendriksen Oct 25, 2023
42710f9
Rename include guards
ahendriksen Oct 25, 2023
4144d43
Fix missing includes
ahendriksen Oct 25, 2023
8a609cd
Remove redundant comment
ahendriksen Oct 25, 2023
6953ea0
Rename __as_smem_ptr -> __as_ptr_smem for disambiguation
ahendriksen Oct 25, 2023
eae5df6
Use uint32_t
ahendriksen Oct 25, 2023
eda6d93
Update libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barri…
ahendriksen Oct 25, 2023
f262f6c
Apply suggestions from code review
ahendriksen Oct 25, 2023
5701b9f
Use <nv/target>
ahendriksen Oct 25, 2023
7a54b19
Reorder PTX ISA target macros
ahendriksen Oct 25, 2023
d4ec10f
Add .op
ahendriksen Oct 25, 2023
dd57648
Improve backward compatibility and docs
ahendriksen Oct 26, 2023
e168815
Use cuda code-blocks for syntax highlighting
ahendriksen Oct 26, 2023
bd967f0
Use backward-compatible PTX spelling
ahendriksen Oct 27, 2023
db33678
Use linker-error trick to enable architecture selection
ahendriksen Oct 27, 2023
6a1b36e
Use const references
ahendriksen Oct 27, 2023
594c82f
Do not name unused parameters
ahendriksen Oct 27, 2023
6b4d380
Add PTX ISA target macros for CUDA 11.X
ahendriksen Oct 27, 2023
87f300c
Use _CUDA_VPTX in barrier.h
ahendriksen Oct 27, 2023
3535036
Replace internal use of mbarrier.arrive with cuda::ptx::mbarrier_arrive
ahendriksen Oct 27, 2023
82db00d
Guard for PTX ISA version in test
ahendriksen Oct 27, 2023
e9abe97
Remove __cccl_ptx_sm targeting macros
ahendriksen Oct 27, 2023
f806ca0
Prevent unused compiler warnings in test
ahendriksen Oct 27, 2023
6917e60
Use extern "C" error function declaration
ahendriksen Oct 27, 2023
6a5b423
Fix wrapping of ifdef and NV_IF_TARGET for Windows
ahendriksen Oct 27, 2023
d376cba
Merge branch 'main' into pr/ahendriksen/574
miscco Nov 1, 2023
7d6d4d5
Try and fix CI issues
miscco Nov 1, 2023
bd24265
Rename space_shared_cluster -> space_cluster
ahendriksen Nov 2, 2023
4f26aa2
Ensure PTX test is actually assembled
ahendriksen Nov 2, 2023
9555532
Rename test
ahendriksen Nov 2, 2023
ffa1f30
Stay closer to original PTX exposure
ahendriksen Nov 2, 2023
90df5a4
Merge branch 'main' into pr/ahendriksen/574
miscco Nov 3, 2023
8b03da3
Address review feedback
miscco Nov 3, 2023
614326b
Do not require set arch
miscco Nov 3, 2023
9e9fb70
Do not expose remote mbarrier arrive with .cta scope
ahendriksen Nov 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
//===----------------------------------------------------------------------===//
//
// Part of libcu++, the C++ Standard Library for your entire system,
// under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
// SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES.
//
//===----------------------------------------------------------------------===//
// UNSUPPORTED: libcpp-has-no-threads
// UNSUPPORTED: pre-sm-90

// <cuda/ptx>

#include <cuda/ptx>
ahendriksen marked this conversation as resolved.
Show resolved Hide resolved

#include <cuda/std/utility>

#include "concurrent_agents.h"
#include "cuda_space_selector.h"
#include "test_macros.h"

int main(int, char**)
{
NV_DISPATCH_TARGET(
NV_IS_HOST, (
// Required by concurrent_agents_launch to know how many we're
// launching. This can only be an int, because the nvrtc tests use grep
// to figure out how many threads to launch.
cuda_thread_count = 1;
),
NV_IS_DEVICE, (
ahendriksen marked this conversation as resolved.
Show resolved Hide resolved
// Do not execute. Just check if this compiles (that is: assembles) without error.
if (false) {
using cuda::ptx::sem_release;
using cuda::ptx::space_shared_cluster;
using cuda::ptx::space_shared;
using cuda::ptx::scope_cluster;
using cuda::ptx::scope_cta;

__shared__ uint64_t bar;
cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cta, space_shared, &bar, 1);
cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cluster, space_shared, &bar, 1);

cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cta, space_shared_cluster, &bar, 1);
cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cluster, space_shared_cluster, &bar, 1);
}
)
);

return 0;
}
2 changes: 2 additions & 0 deletions libcudacxx/docs/extended_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ nav_order: 3

{% include_relative extended_api/functional.md %}

{% include_relative extended_api/ptx.md %}

[Thread Scopes]: ./extended_api/memory_model.md#thread-scopes
[Thread Groups]: ./extended_api/thread_groups.md

68 changes: 68 additions & 0 deletions libcudacxx/docs/extended_api/ptx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
## PTX instructions

The `cuda::ptx` namespace contains functions that map one-to-one to PTX
[instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html). These can be used for maximal control of the generated code, or to
ahendriksen marked this conversation as resolved.
Show resolved Hide resolved
experiment with new hardware features before a high-level C++ API is available.

### Shared memory barrier (mbarrier)

| Instruction | Compute capability | CUDA Toolkit |
|----------------------------------------|--------------------|--------------|
| `cuda::ptx::mbarrier_arrive_expect_tx` | 9.0 | CTK 12.4 |


#### [`cuda::ptx::mbarrier_arrive_expect_tx`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive)

```cuda
template <dot_scope _Sco>
__device__ inline
uint64_t mbarrier_arrive_expect_tx(sem_release_t sem, scope_t<_Sco> scope, space_shared_t spc, uint64_t* addr, uint32_t tx_count);

template <dot_scope _Sco>
__device__ inline
void mbarrier_arrive_expect_tx(sem_release_t sem, scope_t<_Sco> scope, space_shared_cluster_t spc, uint64_t* addr, uint32_t tx_count);
```

Usage:

```cuda
#include <cuda/ptx>
#include <cuda/barrier>
#include <cooperative_groups.h>

__global__ void kernel() {
using cuda::ptx::sem_release;
using cuda::ptx::space_shared_cluster;
using cuda::ptx::space_shared;
using cuda::ptx::scope_cluster;
using cuda::ptx::scope_cta;

using barrier_t = cuda::barrier<cuda::thread_scope_block>;
__shared__ barrier_t bar;
init(&bar, blockDim.x);
__syncthreads();

NV_IF_TARGET(NV_PROVIDES_SM_90, (
// Arrive on local shared memory barrier:
uint64_t token;
token = cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cta, space_shared, &bar, 1);
token = cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cluster, space_shared, &bar, 1);

// Get address of remote cluster barrier:
namespace cg = cooperative_groups;
cg::cluster_group cluster = cg::this_cluster();
unsigned int other_block_rank = cluster.block_rank() ^ 1;
uint64_t * remote_bar = cluster.map_shared_rank(&bar, other_block_rank);

// Sync cluster to ensure remote barrier is initialized.
cluster.sync();

// Arrive on remote cluster barrier:
cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cta, space_shared_cluster, remote_bar, 1);
cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cluster, space_shared_cluster, remote_bar, 1);
)
}
```



Loading
Loading