-
Notifications
You must be signed in to change notification settings - Fork 101
/
Copy pathGLSL_NV_cooperative_vector.txt
553 lines (412 loc) · 25.6 KB
/
GLSL_NV_cooperative_vector.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
Name
GL_NV_cooperative_vector
Contact
Jeff Bolz, NVIDIA (jbolz 'at' nvidia.com)
Contributors
Karthik Vaidyanathan, NVIDIA
Yury Uralsky, NVIDIA
Sean Treichler, NVIDIA
Eric Werness, NVIDIA
Status
Complete
Version
Last Modified: January 30, 2025
Revision: 1
Dependencies
This extension can be applied to OpenGL GLSL versions 4.50
(#version 450) and higher.
This extension can be applied to OpenGL ES ESSL versions 3.20
(#version 320) and higher.
All these versions map GLSL/ESSL semantics to the same SPIR-V 1.5 semantics (approximating the most recent versions of GLSL/ESSL).
This extension interacts with physical_storage_buffer,
EXT_shader_explicit_arithmetic_types, and GL_KHR_cooperative_matrix.
Overview
This extension adds a new set of types known as "cooperative vector" types.
Unlike cooperative matrix types, a variable with a cooperative vector type
is logically stored in the invocation it belongs to, but they can opportunistically
cooperate behind the scenes when performing matrix-vector multiplies. Cooperative
vectors don't require a fully occupied subgroup or uniform control flow like
cooperative matrices, although these do increase the likelihood of being on
the fast path. And unlike normal vector types, they have arbitrary length
and support a relatively limited set of operations. These types are intended
to help accelerate the evaluation of small neural networks, where each
invocation is performing its own independent evaluation of the network.
This extension introduces the types and built-in functions, but does not
specify rules about what types are supported. This is left to the Vulkan
extension specifications, and it is expected that different implementations
may support different types.
This extension relies on the parameterized type support from
GL_KHR_cooperative_matrix, but cooperative vectors and cooperative matrices
don't directly interact with each other, and cooperative matrix support is
not required for this extension. The new built-in type "coopvecNV" can be
parameterized, and its parameters are a scalar type for the component type
and an integer number of elements in the vector.
Cooperative vector types may only be supported in certain shader stages, and
the supported stages can be queried from the API. There are no compile-time
checks to disallow cooperative vector types in any shader stage.
Mapping to SPIR-V
-----------------
For informational purposes (non-normative), the following is an
expected way for an implementation to map GLSL constructs to SPIR-V
constructs:
coopvecNV -> OpTypeCooperativeVectorNV
coopvecNV constructor from components -> OpConstantComposite, OpCompositeConstruct
coopvecNV constructor from coopvecNV -> Op*Convert, OpConvert*To*
coopvecNV.length() -> OpConstant
coopvecNV[i] -> OpCompositeExtract/OpCompositeInsert/OpAccessChain
+, -, *, / -> OpFAdd, OpFNegate/OpFSub, OpFMul/OpVectorTimesScalar, OpFDiv
coopVecMatMulNV -> OpCooperativeVectorMatrixMulNV
coopVecMatMulAddNV -> OpCooperativeVectorMatrixMulAddNV
coopVecOuterProductAccumulateNV -> OpCooperativeVectorOuterProductAccumulateNV
coopVecReduceSumAccumulateNV -> OpCooperativeVectorReduceSumAccumulateNV
coopVecLoadNV -> OpCooperativeVectorLoadNV
coopVecStoreNV -> OpCooperativeVectorStoreNV
Modifications to the OpenGL Shading Language Specification, Version 4.60
Including the following line in a shader can be used to control the
language features described in this extension:
#extension GL_NV_cooperative_vector : <behavior>
where <behavior> is as specified in section 3.3.
New preprocessor #defines are added to the OpenGL Shading Language:
#define GL_NV_cooperative_vector 1
Modify Section 3.6, Keywords
(add to list of keywords)
coopvecNV
Add a new Section 4.1.X, Cooperative Vector Types
Cooperative vector types are vector types with an arbitrary number of
components, and are optimized for matrix-vector multiplies.
Cooperative vectors (coopvecNV) are supported in the language, and are
parameterized by two type parameters: type per component, and number
of components. The parameters are specified in order between angle
brackets ('<' and '>') and are comma-separated. The number of
components can be a constant expression or specialization constant
expression, and no error checking is performed on the value at
compile time. The type per component must be a scalar numerical type.
It is left to the Vulkan specification to define which types are valid.
Example cooperative vector declaration:
coopvecNV<float16_t, 6> vec1; // float16, 6 components
Cooperative vector types can be used as global variables, local
variables, function parameters, and function return values. They must not
be used in uniform, buffer, or shared memory, or in input/output storage
classes.
There are no implicit type conversions between cooperative vector types.
Add a new Section 5.4.X, Cooperative Vector Type Constructors
Cooperative vectors can be constructed from a single scalar value whose
type matches the vector's component type (or any value that can be
implicitly converted to that type). This initializes all components of the
vector to that same value.
Cooperative vectors can be constructed from another cooperative vector
type with the same number of components. This performs a component-wise
type conversion to initialize the new cooperative vector.
Cooperative vectors can be constructed from a set of scalars, vectors,
or matrices, with rules similar to vector constructors, as defined in
Section 5.4.2. Vector and Matrix Constructors.
Add a new Section 5.X, Cooperative Vector Components
The components of a cooperative vector are logically owned by the
invocation that declared it, and invocations cannot directly observe each
other's cooperative vector values. The components can be accessed using
array subscripting syntax, and the number of components in the vector can
be queried using the *length* method. The type returned by *length* is an
int. There is no compile-time bounds checking of array indices.
This can be used, for example, to perform component-wise operations on
all components of a cooperative vector:
coopvecNV<float16_t, 6> v;
...
for (int i = 0; i < v.length(); ++i) {
v[i] = f(v[i]);
}
Note that component-wise access may be suboptimal, and performing vector
operations on the cooperative vector type is strongly preferred.
Modify Section 5.9, Expressions
The arithmetic binary operators add (+), subtract (-), multiply (*), and
divide (/) operate on cooperative vector types and perform the operation
component-wise. The operands must have identical types.
The arithmetic binary operator multiply (*) also operates on a cooperative
vector type and a scalar (in either order) and performs the multiply
component-wise. The scalar type must match the component type of the
vector.
The arithmetic unary operator negate (-) operates on cooperative vector
types and performs the operation component-wise.
The bitwise operators &, ^, |, and ~, and the shift operators << and >>
are supported for integer cooperative vector types, and perform the
operation component-wise.
Conversions are allowed between cooperative vector types with the same
number of components.
The built-in functions fma, exp, log, tanh, atan, min, max, clamp, and step,
are supported for a cooperative vector type if the function is supported
for the vector's component type. The operation is performed component-wise.
Add a new Section 8.X, Cooperative Vector Functions
The following functions perform a matrix-vector multiplication using a matrix
loaded from memory and a vector passed as a parameter. The input vector has K logical
components and is left-multiplied by an MxK matrix to produce a result with
M components that is stored in the output parameter 'result'. One function
also loads a 'bias' vector with M components from memory, which is added to
the product before it is stored in 'result'.
void coopVecMatMulAddNV(out coopvecNV<ResultTy, ResultComps> result,
coopvecNV<InputTy, InputComps> input,
int inputInterpretation,
const MatrixTy[] matrix,
uint matrixOffset,
int matrixInterpretation,
const BiasTy[] bias,
uint biasOffset,
int biasInterpretation,
uint M,
uint K,
int matrixLayout,
bool transpose,
uint matrixStride);
void coopVecMatMulNV (out coopvecNV<ResultTy, ResultComps> result,
coopvecNV<InputTy, InputComps> input,
int inputInterpretation,
const MatrixTy[] matrix,
uint matrixOffset,
int matrixInterpretation,
uint M,
uint K,
int matrixLayout,
bool transpose,
uint matrixStride);
Description: Linear-algebraic matrix multiply of an MxK matrix by a
K-component column vector input, with bias added to the result. The order of
the operations is implementation-dependent. The internal precision of the
operations is defined by the Vulkan specification.
The input, matrix, and bias each have a physical storage type (InputTy,
MatrixTy, BiasTy) and an "interpretation" parameter that specifies how the
values are interpreted. The interpretation parameters take gl_ComponentType*
values, and the behavior and interactions between physical types and
interpretations is as specified below.
ResultTy is the actual type of the result (no reinterpretation), and must be
float16_t, float32_t, uint32_t, or int32_t. ResultComps must equal M.
The input vector is converted to the type indicated by inputInterpretation.
This conversion step allows the input type to be converted to a smaller type
that the shading language may not natively support. Non-"Packed" types are
used to request arithmetic conversions. "Packed" types are used to request
a bitcast conversion, e.g. if the shader wants to convert to the smaller
type manually before the call.
If the inputInterpretation is not a Packed enum value, then the conversion
is an arithmetic conversion. InputTy must be float16_t, float32_t,
uint32_t, or int32_t. Integer to integer conversion saturates. Float to
float conversion is implementation-dependent but preserves the value as
accurately as reasonably possible. Float to integer conversion is
round-to-nearest-even and saturating. Integer to float conversion is
round-to-nearest-even.
If the inputInterpretation is a Packed enum value, then the conversion is a
bitcast where element(s) of InputTy are bitcast to element(s) of the type
described by the enum. InputTy must be uint32_t.
The input vector must have enough components to hold K values of the packed
type. If the packed type is not a power of two number of bits, then the
extension that introduces the enum defines how bits are packed. Packed
types with a power of two number of bits are tightly packed with lower
numbered components stored in lower bits.
MatrixTy can be any scalar type, and is ignored. The matrix is loaded
starting from a byte offset of matrixOffset from the start of the array,
and raw data is loaded according to matrixInterpretation. No conversion
is performed.
BiasTy can be any scalar type, and is ignored. The bias is loaded starting
from a byte offset of biasOffset from the start of the array, and raw data
is loaded according to biasInterpretation. No conversion is performed.
matrixOffset must be 64B aligned. biasOffset must be 16B aligned.
These alignment requirements also apply to the base of the array and the
buffer that contains it.
The matrix array must be in buffer storage, and the array that is passed
in can be sized or unsized. If the matrixLayout is RowMajorNV or
ColumnMajorNV, then matrixStride is the number of bytes to add to the
pointer to go from one row or column to the next, and must be a multiple of
16B. For optimal layouts, matrixStride is ignored unless otherwise
specified.
Similarly, the bias is loaded from memory starting at the requested offset
in "bias". M consecutive elements are loaded. The bias array must be in
buffer storage, and the array that is passed in can be sized or unsized.
The Vulkan implementation advertises supported combinations of ResultTy,
inputInterpretation, matrixInterpretation, and biasInterpretation.
M is the output vector size and K is the logical input vector size. The
matrix is MxK if transpose is false and KxM (before transposing) if
transpose is true.
gl_ComponentType* are constant integer values which can be used for the
inputInterpretation, matrixInterpretation, and biasInterpretation
parameters in coopVecMatMulAddNV and coopVecMatMulNV. Values match
the VkComponentTypeKHR enum.
const int gl_ComponentTypeFloat16NV = 0;
const int gl_ComponentTypeFloat32NV = 1;
const int gl_ComponentTypeFloat64NV = 2;
const int gl_ComponentTypeSignedInt8NV = 3;
const int gl_ComponentTypeSignedInt16NV = 4;
const int gl_ComponentTypeSignedInt32NV = 5;
const int gl_ComponentTypeSignedInt64NV = 6;
const int gl_ComponentTypeUnsignedInt8NV = 7;
const int gl_ComponentTypeUnsignedInt16NV = 8;
const int gl_ComponentTypeUnsignedInt32NV = 9;
const int gl_ComponentTypeUnsignedInt64NV = 10;
const int gl_ComponentTypeSignedInt8PackedNV = 1000491000;
const int gl_ComponentTypeUnsignedInt8PackedNV = 1000491001;
const int gl_ComponentTypeFloatE4M3NV = 1000491002;
const int gl_ComponentTypeFloatE5M2NV = 1000491003;
The transpose parameter indicates that the matrix is transposed before
performing the multiply. Transposing is not supported for the
RowMajorNV/ColumnMajorNV layouts. Not all component types support transposing.
It is left to the Vulkan specification to define which types support
transposing.
gl_CooperativeVectorMatrixLayout* are constant integer values which can be used for
the matrixLayout parameter in coopVecMatMulAddNV.
const int gl_CooperativeVectorMatrixLayoutRowMajorNV = 0;
const int gl_CooperativeVectorMatrixLayoutColumnMajorNV = 1;
const int gl_CooperativeVectorMatrixLayoutInferencingOptimalNV = 2;
const int gl_CooperativeVectorMatrixLayoutTrainingOptimalNV = 3;
If matrixLayout is gl_CooperativeVectorMatrixLayoutRowMajorNV, then
contiguous ranges of K elements in memory form the row vectors of the
matrix that are dotted with the input vector. That is,
result[j] = sum_{k<K} input[k] * matrix[matrixOffsetInElements + strideInElements*j + k].
If matrixLayout is gl_CooperativeVectorMatrixLayoutColumnMajorNV, then
contiguous ranges of M elements in memory form the column vectors of the
matrix. That is,
result[j] = sum_{k<K} input[k] * matrix[matrixOffsetInElements + strideInElements*k + j].
Optimal matrix layouts use an implementation-dependent layout that may not
be publicly documented. The Vulkan extension specification offers commands to
convert a matrix into an optimal layout on the host or device, and to compute the
size of a matrix in an optimal layout. This allows applications to reserve
the appropriate amount of memory in buffers. All optimal layouts have the
property that initializing the whole region of memory to zero is equivalent
to initializing all elements with a bit pattern of zero. This conversion
command can also perform type conversions, to fill a matrix in a type
matching the matrixInterpretation that will be used in the shader.
The inputInterpretation, matrixInterpretation, biasInterpretation, M, K,
matrixLayout, and transpose parameters must be constant expressions.
Memory loads performed by these functions are performed as if the memory
were private, readonly, and restrict. This means the matrix and bias values
must not be modified while a shader might be using them.
The following function loads a cooperative vector from memory:
void coopVecLoadNV(out coopvecNV<VectorElemTy, NumComps> v, volatile coherent ArrayElemTy[] buf, uint offset);
ArrayElemTy can be any scalar or vector type, and is ignored. The vector is
loaded starting from a byte offset of 'offset' from the start of the 'buf'
array. No conversion is performed. The conditions under which bounds
checking is performed are left to the Vulkan extension.
The following function stores a cooperative vector to memory:
void coopVecStoreNV(coopvecNV<VectorElemTy, NumComps> v, volatile coherent ArrayElemTy[] buf, uint offset);
ArrayElemTy can be any scalar or vector type, and is ignored. The vector is
stored starting at a byte offset of 'offset' from the start of the 'buf'
array. No conversion is performed. The conditions under which bounds
checking is performed are left to the Vulkan extension.
For both coopVecLoadNV and coopVecStoreNV, 'v' can be a cooperative vector
type with any supported type parameters. The 'buf' arrays must be in either
buffer or shared storage, and the array that is passed in can be sized or
unsized. The load or store is done using memory qualifiers taken from the
declaration of 'buf'. offset must be a multiple of 16. For buffer storage,
the start of 'buf' must be 16B aligned.
The following function computes the outer product between column vectors v1
and v2, i.e. v1*transpose(v2), and the resulting MxN matrix is atomically
(with device scope) accumulated in memory.
void coopVecOuterProductAccumulateNV(const coopvecNV<T, M> v1, const coopvecNV<T, N> v2,
T[] buf, uint offset, uint stride,
int matrixLayout, int matrixInterpretation);
The "buf" array must be in buffer storage, and the array that is passed in
can be sized or unsized. The starting offset of the buf array must be 16B
aligned, and the offset must be a multiple of 16.
coopVecOuterProductAccumulateNV is only supported for certain
component types, as defined by the Vulkan extension specification.
If the matrixLayout is RowMajorNV or ColumnMajorNV, then stride is the
number of bytes to add to the pointer to go from one row or column to
the next, and must be a multiple of 16. For optimal layouts, the
stride is ignored unless otherwise specified.
The matrixLayout parameter must be a constant expression.
matrixInterpretation selects the type used for the accumulation (i.e. the
type of elements of the outer product matrix).
The following function component-wise atomically (with device scope) adds
components of the vector v to the corresponding elements of an array in
memory.
void coopVecReduceSumAccumulateNV(const coopvecNV<VectorElemTy, NumComps> v,
T[] buf, uint offset);
The "buf" array must be in buffer storage, and the array that is passed in
can be sized or unsized. The starting offset of the buf array must be 16B
aligned, and offset must be a multiple of 16.
coopVecReduceSumAccumulateNV is only supported for certain component types,
as defined by the Vulkan extension specification.
All functions in this section are supported in all shader stages (subject
to API-specific limitations) and don't require uniform control flow or
fully occupied subgroups.
Modify Section 9, Shading Language Grammar for Core Profile
(Add to tokens list)
COOPVECNV
Issues
(1) Do we really need "cooperative vectors" or can they just be "vectors"
and we happen to optimize around the matrix multiply function? Type name
could just be vec<T, K>.
RESOLVED: Calling them generic vectors would be more generally useful,
and may require fewer new types and instructions. But having dedicated
types makes it easier to limit the scope of this functionality, to have
an implementation designed around matrix multiplies, and to avoid having
to deal with the full generality of spec interactions.
This extension uses a dedicated type, but this may be generalized in the
future.
(2) Do we need special functions to load/store cooperative vectors from
buffer/shared storage? Options include: (A) functions to load/store a
cooperative vector from an array, (B) use normal
loads and then construct the cooperative vector (and use component access
and normal stores to store to memory), (C) allow cooperative vector types
to be placed in buffer/shared storage.
RESOLVED: Resolved to use option (A). We expect some performance benefits
from being able to do larger loads, and for loads directly from shared
memory.
(3) Should we have functions to convert to/from a temporary array, or
use constructor syntax and compoment access as with normal vectors?
RESOLVED: Be consistent with vector types.
(4) Should we require cooperative vector support in all shader stages?
RESOLVED: The types and matrix multiply should be supported in all stages,
but leave it to the API to advertise which stages support it, in case some
implementations have unforeseen limitations.
(5) Should we allow specialization constants for the number of components?
RESOLVED: Yes.
(6) Should we reuse gl_CooperativeMatrixLayout values or create a new enum
type?
RESOLVED: Use separate enums.
(7) Does coopVecMatMulAddNV need to support mixing component types?
RESOLVED: Yes, we should not preclude different input and output types.
Particularly with low precision weights, we shouldn't prevent the shader
from seeing a higher precision result that it can condition before reducing
the precision.
(8) How should we support transposing matrices?
RESOLVED: Add a transpose parameter to coopVecMatMulAddNV, only supported for
"easy" cases (fp16 for NVIDIA). This works for optimal layouts,
which spoofing the layout in memory (swap K and M, and swap row/col major)
would not.
(9) Should we support comparisons and boolean vectors?
RESOLVED: In the long run it's desirable to support these (see issue 1),
but it significantly increases the testing surface so for now these are not
supported. Many common activation functions that include comparisons are
more efficiently implemented using max or min, or failing that the 'step'
function can be used to compare floating point numbers, which can be used
to emulate booleans. Reverse activations (activation function derivatives)
for piecewise continuous activation functions could benefit from a builtin
function to select (component-wise) between two values based on a comparison.
(10) With what scope is coopVecOuterProductAccumulateNV atomic? Does it need a
scope operand to select the scope?
RESOLVED: Assume device scope.
(11) Does there need to be a way to zero-initialize the storage for
coopVecOuterProductAccumulateNV?
RESOLVED: These can generally be zero-initialized outside of the shader.
(12) How should coopVecReduceSumAccumulateNV work? Should it accumulate into
a register, or accumulate directly into memory?
RESOLVED: The reduced value is accumulated directly into memory.
(13) Can the training functions support only TrainingOptimal layout? If so,
we can remove the layout and stride parameters.
RESOLVED: No restriction on layout, at least not at the language/IR level.
Example syntax:
restrict buffer {
float16_t matrixData[];
} matrixBuf;
const int inputDim = 6;
coopvecNV<float16_t, inputDim> inputVec = coopvecNV<float16_t, inputDim>(materialstate, shininess, ... );
const int MLPDim = 32;
coopvecNV<float16_t, MLPDim> mlpVec;
coopVecMatMulNV(mlpVec, inputVec, gl_ComponentTypeFloat16NV, matrixBuf.matrixData, offset1, gl_ComponentTypeFloat16NV, MLPDim, inputDim, gl_CooperativeVectorMatrixLayoutRowMajorNV, false, MLPDim*sizeof(float16_t));
// ReLU activation
mlpVec = max(coopvecNV<float16_t, MLPDim>(0), mlpVec);
coopVecMatMulNV(mlpVec, mlpVec, gl_ComponentTypeFloat16NV, matrixBuf.matrixData, offset2, gl_ComponentTypeFloat16NV, MLPDim, MLPDim, gl_CooperativeVectorMatrixLayoutRowMajorNV, false, MLPDim*sizeof(float16_t));
// tanh activation
mlpVec = tanh(mlpVec);
const int resultDim = 8;
coopvecNV<float16_t, resultDim> resultVec;
coopVecMatMulNV(resultVec, mlpVec, gl_ComponentTypeFloat16NV, matrixBuf.matrixData, offset3, gl_ComponentTypeFloat16NV, resultDim, MLPDim, gl_CooperativeVectorMatrixLayoutRowMajorNV, false, resultDim*sizeof(float16_t));
// use resultVec[...]
Revision History
Revision 1
- Internal revisions.