[RFC] Add 2:4 structured sparsity support to XLA:GPU #5873

aartbik · 2023-09-25T17:40:16Z

aartbik
Sep 25, 2023

RFC: Add 2:4 structured sparsity support to XLA:GPU

This RFC proposes integrating support for NVidia’s 2:4 structured sparsity into the XLA:GPU compiler with the objective of speeding up matrix multiplications where one matrix is or can be pruned into a format where every four consecutive elements exactly contain two nonzeros and two zeros by means of the hardware acceleration provided for this format on A100 and H100 GPUs. The objective of this RFC is to make 2:4 structured sparsity very rapidly available to model writers by means of special operations that deal with the sparsity explicitly (in contrast with the longer term implicit sparsity proposal that introduces sparse tensor types, described in the StableHLO sparsity RFC). Note that, although this RFC focuses on 2:4 structured sparsity, with minor changes the RFC can also be generalized to include block sparsity.

2:4 Structured Sparsity Storage Format

The NVidia 2:4 structured sparsity format is described in the data type section of the cusparselt library documentation and consists of two buffers, i.e. an array with compressed 2-bit indices and an array with the nonzero values. This enforces a 1:2 relation between sparse matrices and the actual metadata buffers. This RFC proposes a simple design that allows model writers to quickly speedup 2:4 structured sparse matrix multiplication operations with explicit sparsity, i.e. the user is aware of all the metadata for sparse matrices and inserts a special operation for the matrix multiplication. The drawbacks of this approach are that users must be aware of all details of the sparsity format and keep metadata together, users must explicitly introduce the accelerated matrix multiplication through special operations, and the compiler provides almost no safety support on the metadata. However, the advantage of the explicit sparsity approach is that this approach will put accelerated sparse performance much quicker in the hands of model writers with minimal changes to the XLA infrastructure (IR and ops).

Jax Changes

This proposal completely avoids sparse tensor types at JAX level (as was the approach taken in jax.experimental.sparse), since this will complicate ABI requirements (i.e. a sparse tensors map to more than one buffer). Instead, a few lower-level primitives expose the metadata of the sparse storage formats together with a special operation for matrix multiplication for 2:4 structured sparsity, operating directly on the metadata, as shown below.

// Takes the m x n dense array, prunes 2:4 to maximize the L1-norm,
// returns dense array (with zeros)
a_pruned = structured_sparsity.prune(a, m, n)

// Takes the m x n dense array (PRUNED!), and returns the compressed values
// and index buffers. Conceptually, the metadata tuple (values, indices, m, n)
// now forms a user visible sparse representation of the original matrix
a_values, a_indices = structured_sparsity.compress(a_pruned, m, n)

// Performs 2:4 matrix multiplication C = A x B with A sparse
dense_result = structured_sparsity.matmul(values_a, indices_a, matrix_b, m, n, k)

Initially, the former two operations can solely remain in JAX land as a library. Only the matmul primitive is recognized as a custom operation that maps to a special accelerated operation. Over time, accelerated versions of the prune and compress step that run on the GPU can be made available as well.

An element-wise operation on the original sparse matrix can simply be applied to the values buffer, as shown below. At every point, the model writer must be fully aware of what metadata represents what matrix and is fully responsible for preserving the integrity of metadata (e.g. scaling the indices array would corrupt the sparse storage scheme; the compiler provides no safety support against such corruption).

// Sparse matrix is initially represented by (data, indices, m, n)
scaled_data = data * 2.0
// Scaled sparse matrix now is represented by (scaled_data, indices, m, n)

The operations manifest themselves as dense operations to the backend, which fully enables the typical optimization such as fusion (but the special accelerated matrix multiplication operation needs some attention).

Note that it is probably a good idea to follow up the initial implementation with some additional JAX work to “hide” the metadata in a single JAX construct (not a full sparse tensor type, but some wrapper that hides some of the details from the user, e.g. a struct with values, indices, m and n, together with some zero-overhead setter and getter methods).

HLO and StableHLO Changes

Since the user deals with operations on the metadata explicitly, the only addition to the IR involves providing a special sparse matrix multiplication operation. As mentioned earlier, an audit of the “dense” fusion optimization is required to ensure this special sparse operation fuses with dense operations to make sure no performance is lost (element-wise operations manifest themselves as dense operations on the values part of the metadata, so no special attention is needed for these).

%output = hlo.accelerated_matrix_mult24(data, indices,m, n)

For a very first exploration of adding 2:4 support (e.g. in private fork), StableHLO doesn't necessarily need to change because there is a process for exposing HLO/MHLO features to JAX and other frameworks without going through StableHLO. However, before submitting the 2:4 code into the XLA repository, the special operation needs to be introduced to StableHLO as well, which will need to be proposed as RFC, and, once approved by the governance body, added to the StableHLO repository with compliance tests.

Code Generation for GPU

The objective of introducing the special sparse matrix multiplication operation is ultimately to map this onto an efficient usage of the mma.sp instruction. The following three approaches to generation code for the special operation are possible, listed in increasing order of difficulty.

Special operations could be directly mapped to methods in the cusparselt library. This is probably the easiest way, and has the advantage that the library also provides direct support for the pruning, verification of pruning, and compression step. The drawbacks are that code generation can only use the API currently available in this library, which includes calls to setup and breakdown sparse environments.
Code generation could be added to e.g. Triton, which enables programmers with limited CUDA experience to rapidly write efficient GPU code. This provides more flexibility than the first approach, but still requires careful design of the actually generated code, since currently no 2:4 structured sparsity is available in Triton.
The special operations lower directly into PTX that uses the mma.sp instruction effectively. This approach potentially can yield the best performance, but is also the most difficult, since the programmer must carefully deal with all intricacies of setting up the threads, warps, and group of four threads such that the metadata for the instruction is available where needed. See the instruction documentation for more details.

Implementation Status

This RFC is meant to solicit early feedback and also acts as a call for volunteers interested in contributing to this work. The JAX operations have been implemented as a small library already, but the proper connection with HLO, StableHLO, and all the backend work still has to start.

GleasonK · 2023-09-26T21:20:24Z

GleasonK
Sep 26, 2023
Maintainer

The objective of this RFC is to make 2:4 structured sparsity very rapidly available to model writers by means of special operations that deal with the sparsity explicitly (in contrast with the longer term implicit sparsity proposal that introduces sparse tensor types, described in the StableHLO sparsity RFC).

Could you speak more about what happens to this short-term solution in the long term? I.e. after we have sparse types, would the accelerated_matrix_mult24 op be redundant / deprecated?

However, before submitting the 2:4 code into the XLA repository, the special operation needs to be introduced to StableHLO as well

Have you considered using a custom_call for this new operation in StableHLO/MHLO? I'm curious what the drawbacks of doing so are. I can see rationale for a new HLO, since this is a feature specific to XLA GPU, but am currently unclear if this should be standardized at the StableHLO layer yet.

Some thoughts on custom_call:

This is a very fast way to enable this feature being used, especially is we are considering this to be somewhat experimental / short-term, also enables making changes without worrying about compatibility guarantees if new things are discovered during implementation.
We wouldn't have full StableHLO compatibility guarantees, stability of custom_call is mostly dictated by the consumer of said call, in this case XLA, which seems reasonable.
Less standardized, which means less easy to target from multiple frameworks / support on multiple backends. There seems to currently be some strong ties to JAX->XLA GPU currently, so in this sense we may not need to standardize this in StableHLO yet.
Perhaps this approach doesn't play well with fusions? Maybe this can be mitigated, in the HLO layer custom_call -> new HLO op.

1 reply

aartbik Sep 26, 2023
Author

Thanks for your feedback, much appreciated!

As for the long-term solution, it is no secret I am a fervent proponent of a clean sparse ecosystem where sparsity is purely defined by means of types and the actual computation can remain sparsity-agnostic (as far as the model writer is concerned, since a 'sparse compiler' or 'sparsifier' will deal with that automatically). It is my hope that eventually we can connect a user-facing 2:4 sparse type with the back-end work done for this RFC and get easy of use together with high performance!

As for 'special op' vs 'custom_call', your last argument (lack of fusion) was indeed the motivation. However, anything that maps to something that eventually is well-understood by the back-end will serve its purpose, so indeed this RFC can be seen as merely suggesting to add new op as one alternative that reaches some IR that "maps onto an efficient usage of the mma.sp instruction".

cheshire · 2023-09-27T12:17:23Z

cheshire
Sep 27, 2023

hlo.accelerated_matrix_mult24

Indeed, if the proposed operation is only available on the NVIDIA hardware, and nowhere else, custom_call seems like a better fit, and is normally used for backend-specific operations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add 2:4 structured sparsity support to XLA:GPU #5873

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[RFC] Add 2:4 structured sparsity support to XLA:GPU #5873

aartbik Sep 25, 2023

RFC: Add 2:4 structured sparsity support to XLA:GPU

2:4 Structured Sparsity Storage Format

Jax Changes

HLO and StableHLO Changes

Code Generation for GPU

Implementation Status

Replies: 2 comments · 1 reply

GleasonK Sep 26, 2023 Maintainer

aartbik Sep 26, 2023 Author

cheshire Sep 27, 2023

aartbik
Sep 25, 2023

Replies: 2 comments 1 reply

GleasonK
Sep 26, 2023
Maintainer

aartbik Sep 26, 2023
Author

cheshire
Sep 27, 2023