Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Commit

Permalink
Try #494:
Browse files Browse the repository at this point in the history
  • Loading branch information
bors[bot] authored Feb 1, 2020
2 parents e4cbe2f + 93c77bc commit 200ac58
Show file tree
Hide file tree
Showing 10 changed files with 1,213 additions and 2 deletions.
3 changes: 3 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ julia:nightly:
- .test
tags:
- nvidia
- sm_75
variables:
CI_THOROUGH: 'true'
allow_failure: true


Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ function main()
],
"Device" => [
"device/cuda.md",
"device/wmma.md",
"device/array.md"
]
]
Expand Down
178 changes: 178 additions & 0 deletions docs/src/device/wmma.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# WMMA

This section details CUDAnative's interface to CUDA's warp matrix multiply-accumulate (WMMA) operations.
This interface enables programmatic access to Tensor Cores, a new hardware feature in Volta that performs mixed precision matrix MAC operations.

Access to WMMA using CUDAnative is available in two levels: low level wrappers around the LLVM intrinsics, and a higher-level API, similar to that of CUDA C.

Note that to use the WMMA intrinsics, you need a sufficiently recent version of Julia: `v1.4.0-DEV.666` or later.
You can check this by running the following in the REPL:
```julia
VERSION >= v"1.4.0-DEV.666"
```

!!! note

If you're running into any of following errors while using the WMMA interfaces:
```
LLVM error: Do not know how to split the result of this operator!
```
or
```
CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
ptxas application ptx input, line <line>; error : .aligned modifier required for instruction '<instr>'
```
then make sure you are running Julia v1.4.0-DEV.666 or later!

## Introduction of Terminology

The WMMA operations perform a matrix multiply-accumulate.
More concretely, it calculates ``D = A \cdot B + C``, where ``A`` is a ``M \times K`` matrix, ``B`` is a ``K \times N`` matrix, and ``C`` and ``D`` are ``M \times N`` matrices.

Note that not all values of ``M``, ``N`` and ``K`` are allowed.
The tuple ``(M, N, K)`` is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:
- Load the matrices ``A``, ``B`` and ``C`` from memory to registers using a WMMA load operation.
- Perform the matrix multiply-accumulate of ``A``, ``B`` and ``C`` to obtain ``D`` using a WMMA MMA operation. ``D`` is stored in hardware registers after this step.
- Store the result ``D`` back to memory using a WMMA store operation.

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep.
Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers.
In WMMA parlance, this part is referred to as a "fragment".
Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant ``D`` matrix can be used as a ``C`` matrix for a subsequent multiply-accumulate.
This is useful if one needs to calculate a sum of the form ``\sum_{i=0}^{n} A_i B_i``, where ``A_i`` and ``B_i`` are matrices of the correct dimension.

## LLVM Intrinsics

The LLVM intrinsics are accessible by using the one-to-one Julia wrappers.
The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic.
For example, LLVM's `[8 x <2 x half>]` becomes `NTuple{8, NTuple{2, VecElement{Float16}}}` in Julia.
In essence, these wrappers return the SSA values returned by the LLVM intrinsic.
Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend.
For more information about the PTX instructions, please refer to the [PTX Instruction Set Architecture Manual](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions).

The LLVM intrinsics are subdivided in three categories: load, store and multiply-accumulate.
In what follows, each of these will be discussed.

### Load matrix
```@docs
CUDAnative.WMMA.llvm_wmma_load
```

### Perform multiply-accumulate
```@docs
CUDAnative.WMMA.llvm_wmma_mma
```

### Store matrix
```@docs
CUDAnative.WMMA.llvm_wmma_store
```

### Example

````@eval
lines = readlines("../../../examples/wmma/low-level.jl")
start = findfirst(x -> x == "### START", lines) + 1
stop = findfirst(x -> x == "### END", lines) - 1
example = join(lines[start:stop], '\n')
using Markdown
Markdown.parse("""
```julia
$(example)
```
""")
````

## CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA.
For example, it ensures that the ``A`` fragment argument to the MMA instruction was obtained by a `load_a` call, and not by a `load_b` or `load_c`.
Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism.
As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.


Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration.
All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument.
As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage ([`WMMA.Fragment`](@ref)) and configuration ([`WMMA.Config`](@ref)).
Instead of taking the resultant fragment by reference, the Julia functions just return it.
This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction.
Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA.
This is accomplished by a separate "WMMA configuration" (see [`WMMA.Config`](@ref)) that you create once, and then give as an argument to all intrinsics.

### Fragment
```@docs
CUDAnative.WMMA.FragmentLayout
CUDAnative.WMMA.RowMajor
CUDAnative.WMMA.ColMajor
CUDAnative.WMMA.Unspecified
CUDAnative.WMMA.Fragment
```

### WMMA configuration
```@docs
CUDAnative.WMMA.Config
```

### Load matrix
```@docs
CUDAnative.WMMA.load_a
CUDAnative.WMMA.load_b
CUDAnative.WMMA.load_c
```

### Perform multiply-accumulate
```@docs
CUDAnative.WMMA.mma
```

### Store matrix
```@docs
CUDAnative.WMMA.store_d
```

### Fill fragment
```@docs
CUDAnative.WMMA.fill_c
```

### Element access and broadcasting

Similar to the CUDA C++ WMMA API, [`WMMA.Fragment`](@ref)s have an `x` member that can be used to access individual elements.
Note that, in contrast to the values returned by the LLVM intrinsics, the `x` member is flattened.
For example, while the `Float16` variants of the `load_a` instrinsics return `NTuple{8, NTuple{2, VecElement{Float16}}}`, the `x` member has type `NTuple{16, Float16}`.

Typically, you will only need to access the `x` member to perform elementwise operations.
This can be more succinctly expressed using Julia's broadcast mechanism.
For example, to double each element in a fragment, you can simply use:
```julia
frag = 2.0f0 .* frag
```

### Example

````@eval
lines = readlines("../../../examples/wmma/high-level.jl")
start = findfirst(x -> x == "### START", lines) + 1
stop = findfirst(x -> x == "### END", lines) - 1
example = join(lines[start:stop], '\n')
using Markdown
Markdown.parse("""
```julia
$(example)
```
""")
````
46 changes: 46 additions & 0 deletions examples/wmma/high-level.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Need https://github.com/JuliaLang/julia/pull/33970
# and https://github.com/JuliaLang/julia/pull/34043
if VERSION < v"1.4.0-DEV.666"
exit()
end

using CUDAnative
if CUDAnative.current_capability() < v"7.0"
exit()
end

### START
using CUDAnative
using CuArrays
using Test

a = rand(Float16, (16, 16))
b = rand(Float16, (16, 16))
c = rand(Float32, (16, 16))

a_dev = CuArray(a)
b_dev = CuArray(b)
c_dev = CuArray(c)
d_dev = similar(c_dev)

function kernel(a_dev, b_dev, c_dev, d_dev)
conf = WMMA.Config{16, 16, 16, Float32}

a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)

c_frag = 0.5f0 .* c_frag

d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)

WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)

return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
d = Array(d_dev)

@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))
### END
42 changes: 42 additions & 0 deletions examples/wmma/low-level.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Need https://github.com/JuliaLang/julia/pull/33970
# and https://github.com/JuliaLang/julia/pull/34043
if VERSION < v"1.4.0-DEV.666"
exit()
end

using CUDAnative
if CUDAnative.current_capability() < v"7.0"
exit()
end

### START
using CUDAnative
using CuArrays
using Test

# Generate input matrices
a = rand(Float16, (16, 16))
a_dev = CuArray(a)
b = rand(Float16, (16, 16))
b_dev = CuArray(b)
c = rand(Float32, (16, 16))
c_dev = CuArray(c)

# Allocate space for result
d_dev = similar(c_dev)

# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)

d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)

WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
@test all(isapprox.(a * b + c, Array(d_dev); rtol=0.01))
### END
1 change: 1 addition & 0 deletions src/device/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ include("cuda/assertion.jl")
include("cuda/memory_dynamic.jl")
include("cuda/atomics.jl")
include("cuda/misc.jl")
include("cuda/wmma.jl")

# functionality from libdevice
#
Expand Down
5 changes: 3 additions & 2 deletions src/device/cuda/memory_shared.jl
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,9 @@ end
initializer!(gv, null(gv_typ))
end
# by requesting a larger-than-datatype alignment, we might be able to vectorize.
# we pick 16 bytes since this is the largest transaction size as supported by PTX.
alignment!(gv, Base.max(16, datatype_align(T)))
# we pick 32 bytes here, since WMMA instructions require 32-byte alignment.
# TODO: Make the alignment configurable
alignment!(gv, Base.max(32, datatype_align(T)))

# generate IR
Builder(JuliaContext()) do builder
Expand Down
Loading

0 comments on commit 200ac58

Please sign in to comment.