extensions/khr/GLSL_KHR_cooperative_matrix.txt

Name
GL_KHR_cooperative_matrix

Contact
Jeff Bolz, NVIDIA (jbolz 'at' nvidia.com)

Contributors
Kevin Petit, Arm (kevin.petit 'at' arm.com)
Neil Hickey, Arm (neil.hickey 'at' arm.com)

Notice
Copyright (c) 2022-2023 The Khronos Group Inc. Copyright terms at http://www.khronos.org/registry/speccopyright.html

Status
Complete.

Version
Last Modified: July 21, 2023
Revision: 1


Dependencies
This extension can be applied to OpenGL GLSL versions 4.50
(#version 450) and higher.
This extension can be applied to OpenGL ES ESSL versions 3.20
(#version 320) and higher.
All these versions map GLSL/ESSL semantics to the same SPIR-V 1.5 semantics (approximating the most recent versions of GLSL/ESSL).

This extension interacts with physical_storage_buffer,EXT_shader_explicit_arithmetic_types

Overview

This extension adds a new set of types known as "cooperative matrix" types,
where the storage for and computations performed on the matrix are spread
across a set of invocations such as a subgroup. These types give the
implementation freedom in how to optimize matrix multiply operations.

This extension introduces the types and built-in functions, but does not
specify rules about what sizes/combinations are valid. This is left to
the Vulkan extension specifications, and it is expected that different
implementations may support different sizes. To help accommodate this,
the dimensions of the cooperative types are parameterized and can be
specialized via specialization constants.

This extension introduces limited support for parameterized types, with
the parameters specified as in C++ template syntax. The new built-in type
"coopmat" is the only type that can be parameterized, and its parameters
are a scalar type for the component type, an integer value that controls
the scope of the type, the number of rows and columns in the matrix, and
which matrix accumulation argument the matrix can be.

Cooperative matrix types are only supported in certain shader stages, and
the supported stages can be queried from the API. There are no compile-time
checks to disallow cooperative matrix types in any shader stage.

Mapping to SPIR-V
-----------------

For informational purposes (non-normative), the following is an
expected way for an implementation to map GLSL constructs to SPIR-V
constructs:

coopmat -> OpTypeCooperativeMatrixKHR
coopmat constructor from scalar value -> OpConstantComposite
coopmat constructor from coopmat -> Op*Convert, OpConvert*To*
coopmat.length() -> OpCooperativeMatrixLength
coopmat[i] -> OpCompositeExtract/OpCompositeInsert/OpAccessChain
+, -, *, / -> OpFAdd, OpFNegate/OpFSub, OpFMul/OpMatrixTimesScalar, OpFDiv
              (similarly for integer types)
coopmatLoad -> OpCooperativeMatrixLoadKHR
coopmatStore -> OpCooperativeMatrixStoreKHR
coopmatMulAdd -> OpCooperativeMatrixMulAddKHR

Modifications to the OpenGL Shading Language Specification, Version 4.60

Including the following line in a shader can be used to control the
language features described in this extension:
#extension GL_KHR_cooperative_matrix : <behavior>
where <behavior> is as specified in section 3.3.
New preprocessor #defines are added to the OpenGL Shading Language:

#define GL_KHR_cooperative_matrix 1

Modify Section 3.6, Keywords
(add to list of keywords)
coopmat

Add a new Section 4.1.X, Cooperative Matrix Types

Cooperative matrix types are matrix types where the storage for, and
computations performed on, the matrix are spread across a set of
invocations such as a subgroup. These types give the implementation
freedom in how to optimize matrix multiply operations.

Cooperative matrices (coopmat) are supported in
the language, and are parameterized by five type parameters: type
per component, scope, rows, columns, and use. The parameters are specified
in order between angle brackets ('<' and '>') and are comma-separated.
The scope, rows, and columns parameters can be constant expressions or
specialization constant expressions, and no error checking is performed
on their values at compile time. It is left to the Vulkan specification
to define what combinations of values are valid post-specialization.
The type per component must be a scalar numerical type.

Example cooperative matrix declarations:

coopmat<float32_t, gl_ScopeSubgroup, 8, 8, gl_MatrixUseAccumulator> mat1; // float32, subgroup scope, 8 rows, 8 columns, Accumulator operand
coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseA> mat2; // float16, subgroup scope, 16 rows, 8 columns, A operand
layout(constant_id = 0) const int scope = 0;
layout(constant_id = 1) const int rows = 0;
layout(constant_id = 2) const int cols = 0;
coopmat<float16_t, scope, rows, cols, gl_MatrixUseB> mat3; // scope/rows/columns specified at pipeline creation time

Cooperative matrix types can be used as global variables, local
variables, function parameters, and function return values. They must not
be used in uniform, buffer, or shared memory, or in input/output storage
classes.

There are no implicit type conversions between cooperative matrix types.

gl_MatrixUse* are constant integer values which can be used for the MatrixUse template
parameter in cooperative matrix types. These control which operand of coopMatMulAdd the
type can be used as.

    const int gl_MatrixUseA             = 0;
    const int gl_MatrixUseB             = 1;
    const int gl_MatrixUseAccumulator   = 2;

Add a new Section 5.4.X, Cooperative Matrix Type Constructors

Cooperative matrices can be constructed from a single scalar value whose
type matches the matrix's component type (or any value that can be
implicitly converted to that type). This initializes all components of the
matrix to that same value.

Cooperative matrices can be constructed from another cooperative matrix
type with the same scope, number of rows, number of columns, and use.
This performs a component-wise type conversion to initialize the new cooperative matrix.

Add a new Section 5.X, Cooperative Matrix Components

The components of a cooperative matrix are spread across the invocations
in its scope, in an implementation-dependent manner. The components owned
by a given invocation can be accessed using array subscripting syntax,
and the number of components owned by each invocation can be queried
using the *length* method. The type returned by *length* is an int.
There is no compile-time bounds checking of array indices.

This can be used, for example, to perform component-wise operations on
all components of a cooperative matrix:

coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> m;
...
for (int i = 0; i < m.length(); ++i) {
	m[i] = f(m[i]);
}

Modify Section 5.9, Expressions

The arithmetic binary operators add (+), subtract (-), multiply (*), and
divide (/) operate on cooperative matrix types and perform the operation
component-wise. The operands must have identical types.

The arithmetic binary operator multiply (*) also operates on a cooperative
matrix type and a scalar (in either order) and performs the multiply
component-wise. The scalar type must match the component type of the
matrix.

The arithmetic unary operator negate (-) operates on cooperative matrix
types and performs the operation component-wise.

Conversions are allowed between cooperative matrix types assuming the
element types, scope, row size, column size, and use are the same.

Add a new Section 8.X, Cooperative Matrix Functions

The following functions are used to load and store cooperative matrix
values from and to memory. In memory, the matrices are stored as arrays
of their element type and size. In the following functions, the generic type
"coopmat" can accept a coopmat type with any type parameters. The
"buf" arrays must be in either buffer storage or shared storage, and the
array that is passed in can be sized or unsized.

For all of these functions, for a given dynamic instance of the function
call, all function parameters must be the same for all invocations in a
given scope instance (where the scope is the scope the cooperative matrix
type(s) were created with). All invocations in a given scope instance must
be active or all must be inactive.

void coopMatLoad(out coopmat m, volatile coherent int8_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent int16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent int32_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent int64_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent uint8_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent uint16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent uint32_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent uint64_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent float16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent float[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent float64_t[] buf, uint element, uint stride, int matrixLayout);

void coopMatLoad(out coopmat m, volatile coherent i8vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i64vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u8vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u64vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f64vec2[] buf, uint element, uint stride, int matrixLayout);

void coopMatLoad(out coopmat m, volatile coherent i8vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent i64vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u8vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent u64vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatLoad(out coopmat m, volatile coherent f64vec4[] buf, uint element, uint stride, int matrixLayout);

Description: Load a cooperative matrix from buf. matrixLayout indicates
the layout of the matrix values in memory.

gl_CooperativeMatrixLayout* are constant integer values which can be used for
the matrixLayout parameter in the load/store functions.

    const int gl_CooperativeMatrixLayoutRowMajor             = 0;
    const int gl_CooperativeMatrixLayoutColumnMajor          = 1;

If matrixLayout is gl_CooperativeMatrixLayoutRowMajor, then elements (row,*) of the result are taken in
order from contiguous locations starting at buf[element + row*stride].
If matrixLayout is gl_CooperativeMatrixLayoutColumnMajor is true, then elements (*,col) of the result are taken in
order from contiguous locations starting at buf[element + col*stride].
The memory locations for other layouts are defined by those extensions that introduce the layouts.

void coopMatStore(coopmat m, volatile coherent out int8_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out int16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out int32_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out int64_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out uint8_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out uint16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out uint32_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out uint64_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out float16_t[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out float[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out float64_t[] buf, uint element, uint stride, int matrixLayout);

void coopMatStore(coopmat m, volatile coherent out i8vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i64vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u8vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u64vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f16vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f32vec2[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f64vec2[] buf, uint element, uint stride, int matrixLayout);

void coopMatStore(coopmat m, volatile coherent out i8vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out i64vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u8vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out u64vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f16vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f32vec4[] buf, uint element, uint stride, int matrixLayout);
void coopMatStore(coopmat m, volatile coherent out f64vec4[] buf, uint element, uint stride, int matrixLayout);

Description: Store a cooperative matrix to buf. matrixLayout indicates
the layout of the matrix values in memory.

If matrixLayout is gl_CooperativeMatrixLayoutRowMajor, then elements (row,*) of m are stored in order to
contiguous locations starting at buf[element + row*stride].
If matrixLayout is gl_CooperativeMatrixLayoutColumnMajor, then elements (*,col) of m are stored in order to
contiguous locations starting at buf[element + col*stride].
The memory locations for other layouts are defined by those extensions that introduce the layouts.

coopmat coopMatMulAdd(coopmat A, coopmat B, coopmat C, int matrixOperands = 0);

Description: Linear-algebraic matrix multiply of A by B and then
component-wise add C. The order of the operations is implementation
dependent. The internal precision of the operations is defined by the
Vulkan specification.

The dimensions of A, B, and C, must form a valid matrix multiply (e.g.
the number of columns of A must match the number of rows of B). A, B,
and C must have the same scope. A's type must use gl_MatrixUseA.
B's type must use gl_MatrixUseB. C's type must use gl_MatrixUseAccumulator.
The type of the result matches the type of C.

gl_MatrixOperands* are constant integer values which can be used for the matrixOperands
parameter in coopMatMulAdd.

    const int gl_MatrixOperandsSaturatingAccumulation   = 0x10;

The behavior of gl_MatrixOperandsSaturatingAccumulation and whether it is supported
is documented in the Vulkan specification.

Modify Section 9, Shading Language Grammar for Core Profile
(Add to tokens list)
COOPMAT

(modify type_specifier to add type_parameter_specifier_opt)

type_specifier:
type_specifier_nonarray type_parameter_specifier_opt
type_specifier_nonarray type_parameter_specifier_opt array_specifier

(new rules)

type_parameter_specifier_opt:
type_parameter_specifier
/*empty*/

type_parameter_specifier:
LEFT_ANGLE type_parameter_specifier_list RIGHT_ANGLE

type_parameter_specifier_element:
type_specifier
unary_expression

type_parameter_specifier_list:
type_parameter_specifier_element
type_parameter_specifier_list COMMA type_parameter_specifier_element


Interactions with GL_EXT_shader_explicit_arithmetic_types

If GL_EXT_shader_explicit_arithmetic_types_float16 is not supported,
remove the coopMatLoad/coopMatStore overloads that use float16_t.

If GL_EXT_shader_explicit_arithmetic_types_float64 is not supported,
remove the coopMatLoad/coopMatStore overloads that use float64_t.

If GL_EXT_shader_explicit_arithmetic_types_int8 is not supported,
remove the coopMatLoad/coopMatStore overloads that use int8_t or uint8_t.

If GL_EXT_shader_explicit_arithmetic_types_int16 is not supported,
remove the coopMatLoad/coopMatStore overloads that use int16_t or uint16_t.

If GL_EXT_shader_explicit_arithmetic_types_int64 is not supported,
remove the coopMatLoad/coopMatStore overloads that use int64_t or uint64_t.

Issues

(1) What are the grammar rules for type parameters?

DISCUSSION: C++ template syntax has a parsing problem, because the
rules allow a "conditional_expression" for the template parameters,
which creates an ambiguity (shift/reduce conflict) where the parser
can't easily tell whether a '>' is a greater-than operator or the end
of the type parameter list. This means it's hard to parse something
like

coopmat<float16, gl_ScopeSubgroup, 16, A>B?16:8, gl_MatrixUseAccumulator>

because it's unclear that the columns parameter is a ternary expression
without looking ahead. The obvious way to make this example more clear
is to add parentheses:

coopmat<float16, gl_ScopeSubgroup, 16, (A>B?16:8), gl_MatrixUseAccumulator>

This can be parsed as a "unary_expression" rather than
"conditional_expression", and doesn't really lose any flexibility
because unary_expression indirectly includes the pretty general

"LEFT_PAREN expression RIGHT_PAREN" rule.

RESOLVED: We diverge from the C++ grammar and use unary_expression
for type parameters rather than conditional_expression.

(2) What alignment rules should we have for buf/element/stride parameters
in the load/store built-in functions?

RESOLVED: The Vulkan SPIR-V environment appendix is responsible for
documenting this. To summarize, the start of the matrix and the stride
must be at least as aligned as the smaller of 16B or the size of a
row/column of the matrix.

(3) For the load/store functions, can the component type mismatch the array
element type?

RESOLVED: Yes, this makes it easier to efficiently load matrix data into
shared memory. The stride parameter is interpreted in units of the
pointed-to type, not in units of the matrix's component type. This
extension includes overloads for 8 through 64-bit integers, and
uvec2/uvec4.


Revision History

Revision 1

- Internal revisions.