[RFC] FP8 in XLA #22

reedwm · 2022-11-01T01:11:18Z

reedwm
Nov 1, 2022
Collaborator

RFC: FP8 in XLA

Overview

NVIDIA is introducing support for new 8-bit floating-point formats, collectively referred to as FP8, in their upcoming Hopper GPUs. FP8 results in a 1.2x to 1.5x end to end speedup vs 16-bit training for large language models. According to NVIDIA, there is no degradation in accuracy for most image classification, image detection, GAN, and NLP models. This RFC proposes a design for adding FP8 support to XLA.

Our goal is to have initial FP8 XLA support for Hopper GPUs by the end of 2022.

This RFC has been approved and therefore closed for review on December 9, 2022. You are still free to comment with any questions or thoughts on this design.

/CC @burmako @choucc34 @d0k @hawkinsp @abattery @stellaraccident @nluehr

Summary

Background Summary

Hopper supports two FP8 data types: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). Both will be supported in XLA. Other companies are also proposing their own FP8 E4M3 and E5M2 data types, but they differ in minor details such as the NaN encoding. XLA will initially only support NVIDIA's proposed FP8 types, since their hardware supporting FP8 will be among the first to be released.

FP8 has a low dynamic range and is prone to underflow and overflow. Therefore, NVIDIA recommends each FP8 tensor has an associated scale, where the true value of the tensor is the FP8 tensor multiplied by the scale. This type of scaling is a form of symmetric quantization.

The scale is dynamically computed during training. For performance reasons, it is impossible to compute the optimal scale and use it during the same step. Therefore NVIDIA recommends that each step uses the scale from the previous step and computes the scale for the next step.

Design summary

In HLO, MHLO, and StableHLO, two new dtype enum values will be added: f8E5M2 and f8E4M3, corresponding to the NVIDIA dtypes supported in Hopper. This is the only change made to the HLO, MHLO, and StableHLO format.

Scaling will be represented using existing multiply and divide HLO instructions. In general, to run an op such as Dot with FP8 and scaling, the FP8 inputs will be cast to FP16, then multiplied by the input scales. Then the Dot will be run with FP16 inputs and outputs. A Reduce op will calculate the maximum value of the FP16 Dot output, which is used to compute the new scale for the next step. Then, the FP16 outputs are divided by the output scale and cast back to FP8. This whole process will be fused, so we don't actually pay the cost of running the Dot in FP16. (See section "Scaling" for details.)

StableHLO has special quantized types and ops, which could represent FP8 scaling. For now, we choose not to use them, since these types/ops do not yet support dynamic scales and are not yet supported in HLO. We will consider using them in the future. (See section "StableHLO quantization types and ops" for details.)

cuBLAS/cuDNN directly supports scaling for matmuls and convolutions. XLA will use pattern matching to rewrite Dot and Convolution ops with scaling via Multiply/Divide ops into cuBLAS/cuDNN calls. For non-matmul non-convolution ops with scaling, XLA will fuse them. (See section "XLA GPU codegen" for details.)

FP8 convergence and performance will be tested by training a ResNet50 and BERT model in FP8. (See section "Testing plan" for details.)

Background

Hopper supports two FP8 data types:

E4M3: 4 exponent bits, 3 mantissa bits
E5M2: 5 exponent bits, 2 mantissa bits

E4M3 has more precision but less dynamic range than E5M2. E5M2 is similar to FP16, the only difference being E5M2 has 8 fewer mantissa bits. This is similar to how bfloat16 is identical to FP32 except it has 16 less mantissa bits. E4M3 is more unusual in that it doesn't support infinities and only has two representations for NaN. For more details on these two formats, see this whitepaper.

NVIDIA, ARM, and Intel are working towards standardizing these two FP8 data types, as described in this blog post. Other companies are proposing slightly different versions of FP8, however. These proposals also have an E4M3 and an E5M2 data type, but differ in details such as support for infinities, NaN, and negative zero. For example, while both the NVIDIA types support negative zero, GraphCore and AMD are proposing an FP8 standard where neither E4M3 nor E5M2 support negative zero. Tesla proposed an FP8 format where neither E4M3 nor E5M2 has Inf or NaN, but both have negative zero.

In XLA, we plan on initially supporting the dtypes proposed by NVIDIA, ARM, and Intel, because NVIDIA Hopper GPUs will likely be very popular and we want XLA to have optimal performance on such hardware in the short term. In the future, we will consider supporting other vendors' FP8 data types.

FP8 in machine learning

During training, NVIDIA found that E4M3 should be used on the forward pass, and E5M2 on the backward pass for models to converge to good quality. The forward pass requires the extra bit of precision, while the backward pass requires the increased dynamic range.

Since these types have a very small precision and reduced dynamic range, they are particularly prone to overflow and underflow, especially E4M3. To address this, NVIDIA recommends each tensor should have a scale factor, similar to how integer quantization typically uses a scale and offset (although FP8 only needs a scale according to NVIDIA, not an offset). Given an FP32 value, the quantized value is obtained by dividing by the scale and casting to FP8. Given an FP8 quantized value, the FP32 non-quantized value is obtained by casting to FP32 and multiplying the scale:

fp8_val = cast_to_fp8(fp32_val / scale)
fp32_val = cast_to_fp32(fp8_val) * scale

Tensor scaling allows values in the tensor to be brought into a representable range.

Most tensors on the forward pass and many on the backward pass require their own scale. The optimal value of the scale is such that it causes the tensor to barely not overflow. In other words, the optimal value is max(fp32_tensor) / max_representable_fp8_value. max(fp32_tensor) refers to the maximum absolute value of the tensor, and is often referred to as "amax".

For example, suppose a full precision 3-element tensor has values [2^-14, 2, 7], and that we would like to represent it as an E4M3 tensor. The max E4M3 value is 448, and so the optimal scale is 7 / 448 = 1/64 ≈ 0.016. This means the FP8 tensor is represented by [2^-8, 128, 448], which barely does not overflow and brings the first element to the representable value of 2^-8 (the minimum positive E4M3 number is 2^-9). Typically scales will be less than one, and so FP8 tensors will have larger values than the corresponding full precision tensors.

Scaling in this way is a form of symmetric quantization, which historically has been done on integer tensors, not floating-point tensors. A significant difference between FP8 quantization and integer quantization is that FP8 quantization is done during training, which means we must both compute the scale and use the scale to quantize tensors for each training step. Integer quantization aware training also is done during training, but unlike FP8 quantization, quantization aware training typically does not significantly improve per-step training performance.

Unfortunately, during FP8 training, it is not feasible to efficiently compute tensor and max(tensor), then use max(tensor) to quantize tensor to FP8. This is because computing max(tensor) requires iterating over tensor in a wider precision, but we want to start quantizing certain elements of tensor before computing other elements of tensor to avoid storing all of tensor in the wider precision. The solution suggested by NVIDIA is that we use max(tensor) to compute the scale for the next step, not the current step. For example, a quantized matmul would numerically be done in the following way during training:

def quantized_matmul(quantized_x, quantized_y, x_scale, y_scale, z_scale):
  # Dequantize x and y
  f32_x = cast(quantized_x, float32) * x_scale
  f32_y = cast(quantized_y, float32) * y_scale

  # Do the matmul
  z = f32_x @ f32_y

  # Compute the new scale
  slack = 1.1
  max_e4m3 = 448
  new_z_scale = slack * max(abs(z)) / max_e4m3

  # Quantize the matmul output with the old scale
  return cast(z / z_scale, fp8_e4m3), new_z_scale

The inputs and outputs of quantized_matmul are FP8, and the three scales are taken in as inputs. The function returns the quantized output and the new scale. new_z_scale will be the scale of z for the next step. slack is used to increase the scale slightly, in case the matmul output is slightly higher in the next step. If the matmul output is significantly higher in the next step, it may overflow in the next step, and so FP8 should use saturation on overflow, which results in the max FP8 value on overflow instead of Inf.

In this example, the wider precision that quantized_x and quantized_y are cast to is FP32, but it can also be FP16 or BF16. The choice is up to the user, although frameworks like TensorFlow and Keras and JAX may be opinionated on the wider type. BF16 may be preferred since BF16 arithmetic is faster than FP32 on most backends and unlike FP16, BF16 has significantly higher dynamic range than E5M2.

NVIDIA recommends the new scale should be calculated based on the maximum amax value over a window of the past N steps, to ensure a step with an unusually amax value does not negatively affect the next step. The choice of how to compute the scale is up to the user and does not affect the compiler design. This RFC describes the new scale being solely a function of the amax value of the previous step, since it makes the examples simpler.

The above example only shows what is numerically done, not what the hardware executes in practice.

The use of symmetric quantization for FP8 is recommended by NVIDIA and directly supported by the cuBLAS and cuDNN libraries, but other researchers have experimented with using FP8 without scaling. For example this paper from GraphCore achieves good results without scaling (although it does use different E4M3 exponent biases for weights vs activations). Ultimately, it will be up to users to decide, although higher level frameworks may be opinionated on how FP8 should be used.

Design

In HLO, MHLO, and StableHLO, two new dtype enum values will be added: f8E5M2 and f8E4M3. This will correspond to the dtypes as proposed by NVIDIA, Intel, and ARM and supported by Hopper.

E5M2 was recently added to MLIR and the LLVM helper class APFloat and E4M3 will follow, although these dtypes are not being added (yet) to LLVM IR. See the LLVM RFC here. MLIR support for these dtypes is a prerequisite for adding them to StableHLO and XLA.

The naming convention of E5M2 and E4M3 in HLO will follow MLIR's naming convention. E5M2 was already added to MLIR as Float8E5M2, and so the HLO/MHLO/StableHLO type will similarly be named f8E5M2.

The E4M3 type hasn't been added to MLIR yet. Since the dtype has unusual non-IEEE compliant semantics, it may have a more NVIDIA-specific name. The dtype name is referred to as f8E4M3 in this RFC but will likely be different in practice. Because the dtype name for E4M3 will be decided by MLIR and not XLA, it is not further discussed in this RFC.

If other vendors wish to support their own FP8 dtypes, they should first propose adding it to MLIR. Once accepted and implemented, we can consider supporting such types in StableHLO and XLA on a case by case basis. For now, our focus is on the two FP8 dtypes supported by Hopper GPUs.

Scaling

As stated in the background section, NVIDIA recommends FP8 be used with symmetric quantization. The scaling for FP8 symmetric quantization will be represented using normal multiply and divide ops in HLO and StableHLO.

There are several possibilites on how to represent scaling using multiply and divide ops. In this section, we first present a generic approach that will work with any op. The subsection "Alternative way to scale" will present an alternative representation for Dot and Conv which closely matches what Hopper hardware (and likely other FP8 hardware) executes in practice.

With the generic approach to scaling, running an op, such as Dot or Add, when training an FP8 model will be represented by the following steps in HLO and StableHLO:

Cast each input from FP8 to a wider type, such as FP16.
Unscale each input by multiplying each input by the corresponding input scale.
Run the op, such as Dot, on the FP16 inputs, getting a FP16 output.
Compute the maximum value of the output (amax).
Scale the output by dividing the output by the output scale.
Cast the output from FP16 back to FP8. Since saturation should be done on overflow, this is represented by a Clamp instruction followed by a Convert instruction.
Calculate the new scale based on the amax value.

Here is an abridged example of how an FP8 matmul which multiplies an input with itself would look like in StableHLO during training. Some lengthy sections of code are replaced by "..." for brevity.

// FP8 input and its scale. The scale was computed in the previous step.
%x_f8 = ... : tensor<2x2xf8E4M3>
%x_scale = ... : tensor<f16>
// The output scale. Computed in the previous step
%z_scale = ... : tensor<f16>

// Step 1: Cast the input to fp16
%x_f16 = "stablehlo.convert"(%x_f8) : (tensor<2x2xf8E4M3>) -> tensor<2x2xf16>

// Step 2: Unscale the input.
%x_scale_broadcast = "stablehlo.broadcast"(%x_scale) ...
%x_f16_unscaled = "stablehlo.multiply"(%x_f16, %x_scale_broadcast) :
    (tensor<2x2xf16>, tensor<2x2xf16>) -> tensor<2x2xf16>

// Step 3: Compute the matmul with fp16 inputs and outputs
%z_f16_unscaled = "stablehlo.dot"(%x_f16_unscaled, %x_f16_unscaled) :
    (tensor<2x2xf16>, tensor<2x2xf16>) -> tensor<2x2xf16>

// Step 4: Compute the max of z_f16_unscaled
%z_max = "stablehlo.reduce"(%z_f16, %negative_inf) ...

// Step 5: Scale the output of the matmul
%z_scale_broadcast = "stablehlo.broadcast"(%z_scale) ...
%z_f16 = "stablehlo.divide"(%z_f16_unscaled, %z_scale_broadcast) :
    (tensor<2x2xf16>, tensor<2x2xf16>) -> tensor<2x2xf16>

// Step 6: Cast the output to FP8. Clamp because we want to saturate on overflow.
%z_f16_clamped = "stablehlo.clamp"(%z_f16, ...) ...
%z_f8 = "stablehlo.convert"(%z_f16_clamped) : (tensor<2x2xf16>) -> tensor<2x2xf8E4M3>

// Step 7: Compute the new scale. This can be computed by, e.g.,
// 1.1 * (z_max / 448). During the next step, z_scale will be this value of
// z_new_scale. Optionally a window of the past N amax values can be used to 
// compute the new scale.
%z_new_scale = ... : tensor<f16>

Note that the input is unscaled after being cast to FP16, and the output is scaled before being cast to FP8. There should never be an unscaled FP8 tensor, because otherwise the FP8 tensor may underflow or overflow.

In this example, the new scale is computed as 1.1 * (z_max / 448). The (z_max / 448) part is to create a scale that will cause the FP8 tensor to barely not overflow, since 448 is the maximum representable E4M3 value. The scale is multiplied by 1.1, a "slack" value, in case the tensor during the next step has a slightly higher maximum value. See the "Background" section for details.

On NVIDIA Hopper GPUs, steps (1)-(6) can all be done by cuBLAS function, but requires a minor modification to step (5). cuBLAS requires the inverse output scale (i.e. 1/z_scale) to be passed instead of the output scale itself. Instead of dividing the output by the output scale, cuBLAS multiplies the output by the inverse output scale. This is mathematically equivalent, but requires XLA to compute 1/z_scale before passing it to XLA. The way XLA GPU will handle this is described in the "XLA GPU codegen" section.

When doing inference with a static scale, steps (4) and (7) are not needed since the scale is not updated. The other five steps are identical to the training case.

For dynamic range inference quantization, where the scale is dynamic during inference, steps (1)-(7) can be done similarly to training. Traditionally dynamic range quantization computes the scale for the given step and uses it the same step, instead of computing the scale for the next step. The example above can be modified to do this by running Step 7 before Step 5 and using the newly computed scale in Step 7 to scale the output in Step 5. But on Hopper GPUs, this will be significantly slower, likely reducing performance to be worse than even FP16 performance.

Alternative way to scale

In the above section, ops like Dot have inputs and outputs in FP16 (or BF16 or FP32). However, Hopper hardware directly supports matmuls with FP8 inputs and FP16/BF16 outputs. To better have HLO match what hardware supports, we can represent Dot and Conv ops in HLO with FP8 inputs and FP16/BF16 outputs as well. This approach does not work arbitrary ops such as Add however.

To show how this alternative approach can be done, we show the generic scaling example in the above section using Python-like pseudocode, which has a Dot with FP16 inputs and outputs. We then show a new equivalent example that instead has a Dot with FP8 inputs and FP16 outputs

# This is equivalent to the StableHLO example in the example above.
def quantized_dot_generic(x_f8, x_scale, z_scale):
  x_f16_unscaled = cast(x_f8, f16) * x_scale
  z_f16_unscaled = dot(x_f16_unscaled, x_f16_unscaled)
  z_max = max(abs(z_f16_unscaled))
  z_f8 = cast(z_f16_unscaled / z_scale, f8E4M3)
  z_new_scale = ...
  return z_f8, new_scale

# This is the alternative way to represent FP8 scaling. Note
# the Dot has FP8 inputs and FP16 outputs
def quantized_dot_alt(x_f8, x_scale, z_scale):
  z_f16_input_scaled = dot(x_f8, x_f8, output_type=f16)
  z_f16_unscaled = z_f16_input_scaled * x_scale * x_scale
  z_max = max(abs(z_f16_unscaled))
  z_f8 = cast(z_f16_unscaled / z_scale, f8E4M3)
  z_new_scale = ...
  return z_f8, new_scale

Note both functions are identical except for the first two lines:

# First two lines of quantized_dot_orig
x_f16_unscaled = cast(x_f8, f16) * x_scale
z_f16_unscaled = dot(x_f16_unscaled, x_f16_unscaled)

# First two lines of quantized_dot_alt
z_f16_input_scaled = dot(x_f8, x_f8, output_type=f16)
z_f16_unscaled = z_f16_input_scaled * x_scale * x_scale

Both functions are mathematically and roughly numerically equivalent. The former function unscales inputs then runs the dot. The latter function runs the dot on the scaled inputs, resulting in a scaled output, then unscales the output using the input scales. This relies on the fact that it's equivalent to scale the input or the outputs. That is, we have the property that given any matrices x and y and any scalars xs and ys, we have

(x * xs) @ (y * ys) = (x @ y) * (xs * ys)

This property is true only for a limited set of ops, notably Dot and Conv. Therefore, this alternative representation of scaling can not be used for arbitrary ops.

Hopper supports FP8 arithmetic through matmuls with FP8 inputs and FP16/BF16 outputs. Therefore, the alternative representation closely matches what Hopper hardware executes. Other FP8 hardware will likely also execute FP8 matmuls similarly. This will make it easier for compiler backends to emit code for Dot and Conv instructions (Conv is typically implemented via matmuls).

Currently in XLA GPU, we plan on only supoprt FP8 Dot and Conv instructions through cuBLAS and cuDNN, and neither representation makes this easier than the other. But the altnerative representation will potentially make emitting Dot and Conv instructions easier on other backends, as well as the XLA-GPU backend if it ever chooses to generate its own Dot and Conv code instead of going through cuBLAS and cuDNN. XLA GPU will pattern match both representations to cuBLAS/cuDNN calls.

StableHLO quantization types and ops

StableHLO has special quantized types and ops, which support both symmetric quantization and asymmetric quantization. This quantization support would allow us to directly represent the scaling done by FP8 symmetric quantization without needing to use explicit multiply and divide ops. This would make pattern matching to cuBLAS calls or other backend-specific ops much simpler.

However, we will initially still use multiply and divide ops to represent scaling instead of the quantized types and ops, using either the orignal scaling representation or the altnerative. The primary reason is that fully supporting the quantized types/ops will take a considerable amount of work, and so relying on them will not allow us to support FP8 on Hopper GPUs by the end of the year. In particular, using the quantized types/ops for FP8 requires the following:

Supporting dynamic scales. Currently, MLIR's UniformQuantizedType type, which StableHLO's quantized types use, only supports compile-time constant scales. FP8 training requires the scale to change each step, which means a dynamic scale is needed.
HLO does not support the quantized types and ops; only StableHLO does. Because most of the XLA-GPU compiler operates on HLO and the legacy tf2xla bridge converts TensorFlow graphs to HLO, HLO support is necessary if we wish to use the quantized types and ops with GPUs.
The legacy tf2xla bridge and the MLIR-based tf2xla bridge need to support outputting the quantized types and ops (which requires (2) to be implemented in the case of the legacy tf2xla bridge, since it outputs HLO instead of StableHLO).

In the future, we will consider using StableHLO's quantized types and ops. This RFC makes no recommendation on whether to switch to these types and ops in the future.

The appendix has more information on how FP8 can be represented with StableHLO's quantized types and ops in the future.

When scaling: multiply vs. divide

Recall from the "Background" section that when quantizing from FP32 to FP8, the FP32 value is divided by the scale. When dequantizing from FP8 to FP32, the FP8 value is multiplied by the scale:

fp8_val = cast_to_fp8(fp32_val / scale)
fp32_val = cast_to_fp32(fp8_val) * scale

This is consistent with integer symmetric quantization, where the FP32 value is converted to INT8 by dividing by the scale.

However, FP16 loss scaling traditionally takes the opposite approach: an FP32 value is multiplied by the scale to convert to FP16, and an FP16 value is divided by the scale to convert to FP32. Unlike both FP8 quantization and integer quantization, the wider type is multiplied by the scale to get to the narrower type

In this doc, we choose to express FP8 quantization similarly to symmetric quantization, in that the wider type is divided by the scale to get the narrower type. Despite the fact that both FP8 and FP16 are floating-point types, FP8 scaling is more similar to integer quantization than to FP16 loss scaling, and so it makes more sense to follow the integer quantization convention.

Some models may use both loss scaling and FP8 quantization. In this case, converting from FP32 to FP8 involves multiplying by the loss scale and dividing by the tensor scale (in practice, intermediate tensors are almost never multiplied or divided by the loss scale, however):

fp8_val = cast_to_fp8(fp32_val * loss_scale / tensor_scale)
fp32_val = cast_to_fp32(fp8_val) / loss_scale * tensor_scale

Note that cuBLAS uses one convention for the inputs and the other for the outputs. In particular, the FP8 inputs are multiplied by a value to convert them to a wider type, but the outputs are also multiplied by a value to convert them to FP8. By the convention of this doc, we say the FP8 inputs are multiplied by the input scales, and the outputs of a wider type are multiplied by the inverse output scale.

The choice of convention of whether to multiply or divide does not significantly affect the compiler design itself, but mostly affects the description of scaling in this doc itself. When pattern matching to cuBLAS calls, XLA will support any combination of multiplying or dividing the inputs by a scale and multiplying or dividing the ouput by a scale. The choice of convention will have a significant impact on high-level ML frameworks which support quantization, such as Keras.

FP8 arithmetic

The "Scaling" section has shown that when scaling is used, no arithmetic ops such as Dot should ever run with both non-quantized FP8 inputs and non-quantized FP8 outputs. This is because every non-quantized FP8 tensor should represent a scaled tensor when scaling is used, but arithmetic operations do not take in a scale.

Running arithmetic ops like Dot with FP8 inputs and outputs will still be supported, however. For example, the following will be allowed:

%z = "stablehlo.dot"(%x, %x) :
    (tensor<2x2xf8E4M3>, tensor<2x2xf8E4M3>) -> tensor<2x2xf8E4M3>

NVIDIA recommends the use of scaling in all cases. Other companies, such as GraphCore, have successfully used FP8 without scaling. Users of XLA can ultimately choose whether to scale or not. If scaling ends up being highly recommended in many cases, frameworks such as Keras and JAX can choose, if they want, to warn when FP8 is used without scaling.

NVIDIA GPUs do not directly support arithmetic operations on FP8 values, but a pass similar to bf16-normalization can upcast the inputs to arithmetic ops to get numerically equivalent results.

If FP8 arithmetic was not directly supported in HLO and StableHLO, it could still be emulated by converting the input tensors to FP16, running the arithmetic instruction in FP16, then converting the output back to FP8. This is effectively running steps (1)-(6) above with a scale of 1. But doing this would be tricky for users compared to directly using FP8 tensors and would make FP8 inconsistent with other floating-point dtypes, which is why FP8 arithmetic will be supported.

Rounding

When converting to FP8, XLA will use the typical round-to-even behavior as used in other floating-point dtypes. However, in practice, FP8 should saturate on overflow, because the scale might end up being slightly too large. Therefore, when casting to FP8, frameworks like TensorFlow and JAX should emit a Clamp instruction, to clamp to the highest possible FP8 value, before emitting the Convert instruction.

Note that with FP8 arithmetic, as described in the above section, there will be no option to clamp FP8 outputs of arithmetic ops.

The ReducePrecision instruction can model the E5M2 type but not the E4M3 type, since E4M3 lacks Inf values and ReducePrecision assumes normal IEEE-like semantics. It is currently unclear how to extend ReducePrecision to support FP8 types, since the existing FP8 implementations differ in terms of Inf, NaN, and -0 representations. Future FP8 types may differ in other ways. Therefore, we will wait until FP8 becomes available on more hardware before deciding whether and how to add E4M3 support to ReducePrecision.

Stochastic rounding has been shown in many cases to result in better model quality compared to round-to-even, especially for low-precision dtypes such as FP8. A StochasticConvertType instruction was recently added, and support for this instruction is being added to XLA backends. Since stochastic rounding is not FP8-specific, it is not further considered in the FP8 design, although it may be important in achieving optimal model quality.

XLA GPU codegen

cuBLAS directly supports scaling and computing the maximum output value for matmuls. See the documentation for details. As stated before, steps (1)-(6) in the "Scaling" section above can be run with a single cuBLAS function call, with a minor modification: For the output scale, cuBLAS requires the inverse output scale to be passed in to the matmul function, and cuBLAS multiples the output with the inverse output scale, instead of dividing the output by the output scale. This allows cuBLAS to avoid many costly divisions, and the caller only has to pay the cost of a single scalar reciprocal.

XLA will pattern match steps (1)-(6) into a cuBLAS call. This rewrite will be done in the gemm-rewriter pass. Because these steps divide the output by the output scale and cublas takes in the inverse output scale, XLA will additionally insert a divide instruction on the output scale before passing it to the cuBLAS call. A horizontal fusion pass can later fuse these scalar divisions into a single kernel.

XLA will also pattern match a version of steps (1)-(6) where the output is multiplied by the inverse scale instead of divided by the scale. In this case, the gemm-rewriter pass does not need to insert a divide instruction to compute the reciprocal. NVIDIA recommends frameworks like TensorFlow do this by computing the scale and inverse scale at the same time. However, this approach will make using the current form of the StableHLO quantized types and ops more awkward, as the stablehlo.uniform_quantize op expects a scale, not an inverse scale.

Additionally, XLA will pattern match the pattern described in the section "Alternative way to scale", which is numerically equivalent to steps (1)-(6).

cuDNN also supports FP8 with scaling for convolutions, but currently only on the forward pass. As with matmuls in cuBLAS, we will rewrite FP8 convolutions to cuDNN calls.

When emitting LLVM IR, XLA will represent FP8 as int8, using the NVIDIA Hopper hardware instructions to convert to wider types to do arithmetic. The PTX cvt instruction to convert types currently only supports covnerting a vector of two FP8 values, but since this is difficult to support, XLA codegen will initially convert only a single FP8 value at a time, passing an unused placeholder value for the second input.

When the HLO directly does FP8 arithmetic, a pass similar to bf16-normalization will upcast tensors so that no FP8 arithmetic is done.

As of commit 72eb5d2b, XLA supports fusing steps (1)-(6) in the "Scaling" section (FP8 is not yet supported but XLA can fuse (1)-(6) when higher-precision dtypes are used). For instructions other than convolutions and dots, fusing steps (1)-(6) are not done by an FP8-specific pass but instead are done as part of the general fusion passes.

Testing plan

Unit tests will be added to XLA that test FP8 correctness.

Convergence and performance testing on Hopper GPUs will be done through TensorFlow and JAX. Both frameworks plan on adding FP8 support by the end of 2022, although FP8 will not necessarily be easy to use at first. We will find a TensorFlow or JAX ResNet50 model and a TensorFlow or JAX BERT model. For each, we will fork it and add FP8 support to both, then run performance and convergence tests.

Unfortunately, we do not have a baseline for FP8 performance. If, say, we find BERT is 20% faster on Hopper using FP8 compared to FP16, we will not know if the FP8 speedup is close to optimal. We will work with NVIDIA to determine if our performance results are acceptable.

As a stretch goal, we will also port FP8 to a GPT-3-like model and test convergence and performance.

Unresolved issue: When to scale

This section describes an unresolved issue that affects the boundary between frameworks such as TensorFlow/JAX and XLA. Suppose a user writes the follow function (using TensorFlow notation):

def my_function(x):  # x, y, and z are tensors
  y = x * 2.
  z = y + 1.
  return z

To convert to FP8, the user needs to add casting, scaling, and reduce_max computations. Suppose the user chooses to represent x and z as FP8 tensors, keeping y in BF16.

def my_function(x, x_scale, z_scale):
  x = tf.cast(x, tf.bfloat16) * x_scale
  y = x * 2.
  z = y + 1.
  z_max = tf.reduce_max(tf.math.abs(z))
  z = tf.cast(z / z_scale, tf.fp8_e4m3)
  return z, z_max

The user starts by converting x into a BF16 tensor. After computing y and z in BF16, the user computes the max of z and converts the result back to FP8.

Why did the user keep y in BF16? The reason is that this is necessary for FP8 to have optimal performance with a compiler. In the original BF16 function, the compiler will fuse the multiplication and addition. By making x and z FP8, the compiler must additionally fuse the casts, scalings, and reduce_max into the computation.

However, if y were additionally made FP8, more scaling ops and another reduce_max would be added, to compute the max of y. This would lead to an unnecessary performance loss, and would likely split the fusion into multiple fusions. In general, there is no reason to use FP8 within a fusion. Outside convolutions and matmuls (which are handled by cuBLAS/cuDNN), the main purpose of FP8 is that it takes less memory. However, within a fusion, intermediate values are kept in registers, not GPU memory.

The big question is this: how does the user or framework know what tensors should be in FP8? The user/framework does not know what the compiler will fuse ahead of time, so it cannot ensure that tensors within fusions are BF16 and inputs/outputs to fusion are FP8.

There are no plans to address this initially by the end of 2022. TensorFlow/JAX users will have to be roughly aware of what is fused in order to get performance benefits in FP8. XLA has a flag, xla_allow_excess_precision, allowing it to increase precision of tensors, but this doesn't allow it to skip the computation of the scaling ops or the reduce_max call. One solution for the long term is to develop a mechanism where XLA can skip the scaling and the reduce_max call if it increases the precision of the corresponding tensor.

Appendix: Details on StableHLO quantization types and ops

The "Scaling" section briefly described how StableHLO has special quantized types and ops, but that these will not be initially used for FP8 symmetric quantization. This section describes how they could potentially be used with FP8 in the future.

We start by giving an example of how to quantize a tensor using StableHLO quantized types/ops:

%x = stablehlo.constant dense<[[2.0,4.0],[6.0,8.0]]> : tensor<2x2xf32>
%qx = stablehlo.uniform_quantize %x : 
  (tensor<2x2xf32>) -> tensor<2x2x!quant.uniform<i8:f32, 2.0:1>>

%qx is the quantized version of the floating-point tensor %x. Let's start by examining %qx's element type, which is !quant.uniform<i8:f32, 2.0:1>. The type !quant.uniform type is defined in the MLIR repository itself and in this example is paramerized with <i8:f32, 2.0:1>. The i8:f32 parameters means the storage type is i8 while the expressed type, which is the type the tensor is approximating, is f32. The 2.0:1 parameters means the scale is 2 and the zero point is 1, which indicate how to convert between the quantized and real values. The formulas to convert from a quantized integer value to a real value and vice versa are:

quantized_val = real_val / scale + zero_point
real_val = (quantized_val - zero_point) * scale

The stablehlo.uniform_quantize op converts a floating point tensor to an integer quantized tensor using the formula above. Because %x is [[2., 4.], [6., 8.]], the integer representation of %qx is x / 2 + 1 = [[2, 3], [4, 5]].

Quantized types can be directly passed to ops. Let's continue the example above by passing %qx to the stablehlo.add op:

%qy = stablehlo.add %qx, %qx : tensor<2x2x!quant.uniform<i8:f32, 2.0:1>>

When quantized types are passed directly to ops, the op takes into account the scale and zero-point. In the above example, the addition does not just add the two integer tensors as if they were non-quantized. Instead, it does the equivalent of dequantizing the inputs into a floating-point tensors, running the floating-point addition, then quantizing the output back to an integer format. Therefore, the result of the addition, %qy, represents the FP32 tensor 2 * x = [[4., 8.], [12., 16.]], and its quantized integer representation in memory is (2 * x) / 2 + 1 = [[3, 5], [7, 9]].

For FP8 training, the scale should not be a compile-time constant, but instead should be dynamically computed and updated at runtime. Unfortunately, this is not yet possible with the !quant.uniform type, which is one of the reasons why FP8 quantization will not initially use the quantized types, but support for runtime scales may be added in the future.

Once (or if) there is support for dynamic scales in MLIR's !quant.uniform, running an op such as Dot when training an FP8 model can be represented by the following steps in StableHLO using the quantized types and ops:

Run the op, such as Dot, on the FP8 quantized inputs, getting an FP16 output.
Compute the maximum value of the FP16 output (amax).
Quantize the output using stablehlo.uniform_quantize.
Calculate the new scale based on the amax value. Optionally the scale can be calculated as a moving average.

Here is an abridged example of how an FP8 matmul which multiplies an input with itself would look like in StableHLO during training.

// FP8 input. Has a runtime scale instead of a compile-time scale, which we assume 
// is indicated by "?". The "0" indicates no zero-point.
%x_f8 = ... : tensor<2x2x!quant.uniform<f8E4M3:f16, ?:0>>

// The output scale. Computed in the previous step
%z_scale = ... : tensor<f16>

// Step 1: Compute the matmul, taking the FP8 tensor and returning the fp16 output
%z_f16_unscaled = "stablehlo.dot"(%x_f8, %x_f8) :
    (tensor<2x2x!quant.uniform<f8E4M3:f16, ?:0>>, 
     tensor<2x2x!quant.uniform<f8E4M3:f16, ?:0>>) ->
    tensor<2x2xf16>

// Step 2: Compute the max of z_f16_unscaled
%z_max = "stablehlo.reduce"(%z_f16_unscaled, %negative_inf) ...

// Step 3: Quantize the output of the matmul
%z_f8 = stablehlo.uniform_quantize %z_f16_unscaled, %z_scale : (tensor<2x2xf16>) ->
  tensor<2x2x!quant.uniform<f8E4M3:f16, ?:0>>

// Step 4: Compute the new scale. This can be computed by, e.g.,
// 1.1 * (z_max / 448). During the next step, z_scale will be this value of
// z_new_scale. Optionally a window of the past N amax values can be used to
// compute the new scale.
%z_new_scale = ... : tensor<f16>

The "stablehlo.dot" operation returns an FP16 output instead of a quantized FP8 output because the maximum value of the FP16 tensor first needs to be computed before the tensor is quantized.

As stated earlier, this RFC does not make a recommendation on whether the StableHLO quantized ops and types will be used in the future. Initially, they will not be used, as scaling will be represented using multiply and divide ops.

awf · 2022-11-01T16:27:25Z

awf
Nov 1, 2022

I believe many hardware vendors will want to allow matmuls to be performed directly in FP8, with hardware support for scaling. This is what enables users to benefit from the FLOPs improvement of FP8 (in contrast to memory savings).

That is, there will be an operation

F = float32 # Or, maybe more likely as discussed above, a float16
def matmul_scaled(X : Tensor[float8], Y : Tensor[float8], scale : float) -> Tensor[F]:
  """
  Returns (X @ Y) * scale
  """
  ...

In terms of which a version of quantized_matmul can be written

def quantized_matmul_fast(quantized_x : Tensor[float8], quantized_y : Tensor[float8], x_scale, y_scale, z_scale):
  # Do the matmul, and divide by the old scale
  z = matmul_scaled(quantized_x, quantized_y, x_scale * y_scale / z_scale)

  # Compute the new scale
  new_z_scale = fn(z)

  # Quantize the matmul output (already scaled by the old scale)
  return cast(z, fp8_e4m3), new_z_scale

Note that matmul_scaled could choose to use any precision for its partials, so the above can be exactly the same computation as quantized_matmul from above, just faster; i.e. this is another potential fusion.

3 replies

reedwm Nov 1, 2022
Collaborator Author

Although a scaled op, such as a scaled matmul, will be represented with separate matmul, multiply, and divide ops, pattern matching can be used to rewrite the ops into a single custom call representing what the hardware supports.

In your example, the quantized_matmul function can be rewritten into the quantized_matmul_fast using pattern matching. The pattern matching logic would look for the following (I use Python syntax but in reality this would be HLO):

  # Dequantize x and y
  f32_x = cast(quantized_x, float32) * x_scale
  f32_y = cast(quantized_y, float32) * y_scale

  # Do the matmul
  z = f32_x @ f32_y

  scaled_z = z / z_scale

And replace it with the following:

  # This would be represented as a custom call
  scaled_z = matmul_scaled(quantized_x, quantized_y, x_scale * y_scale / z_scale)

The pattern is always represented with quantize + op + amax + dequantize, since this pattern works for any op, and pattern matching can rewrite it. For matmul, multiplying the outputs by the input scales is equivalent to multiplying the inputs by the input scales, but this is not true for other ops, like sqrt.

awf Nov 2, 2022

Yes, I was thinking of this rewrite when I wrote "the above can be exactly the same computation", but I was wondering too about the case where the matmul_scaled function is not equivalent, e.g. it accumulates in lower precision. Then the user might want to enable/disable that rewrite for specific occurrences of the target pattern.

I can see ways to do that, e.g. the pattern only matches when the fp32 matmul is spelled z = matmul_possibly_infp8(f32_x, f32_y) rather than z = f32_x @ f32_y. That seems acceptable to me, but just want to check if there's considered to be another way.

reedwm Nov 2, 2022
Collaborator Author

Yes, I was thinking of this rewrite when I wrote "the above can be exactly the same computation"

Ah I missed that. And you also implied this when saying "i.e. this is another potential fusion".

but I was wondering too about the case where the matmul_scaled function is not equivalent, e.g. it accumulates in lower precision. Then the user might want to enable/disable that rewrite for specific occurrences of the target pattern.

If matmul_scaled only differs based on numeric differences like accumulation precision, I think the rewrite can always occur, since XLA does not specify exact numeric behavior for ops like Dot. The backend can choose to only do the rewrite for certain values of the ComputePrecision field, allowing the user to effectively enable/disable the rewrite. Steps (1)-(6) in the "Scaling" section do not imply the Dot must be done in full precision. Note that even today, if the HLO just uses FP32 and no lower precision, Dots will still use a lower-precision 19-bit format, called TensorFloat-32, by default for the multiplications (but not accumulation) in the matmul.

I can see ways to do that, e.g. the pattern only matches when the fp32 matmul is spelled z = matmul_possibly_infp8(f32_x, f32_y) rather than z = f32_x @ f32_y.

I assume matmul_possibly_infp8 means to use a custom call. That is a possibility but I think even rewriting f32_x @ f32_y is OK if matmul_possibly_infp8 is equivalent to @ other than the fact it uses lower precision internally.

nouiz · 2022-11-03T14:23:41Z

nouiz
Nov 3, 2022
Collaborator

I'm not sure what exactly is the formula for the "zero point". So it would be useful to add it.
I would add it around the sentence: The 2.0:1 means the scale is 2 and the zero point is 1.

1 reply

reedwm Nov 3, 2022
Collaborator Author

Good point! I added this information, and also split the example into two pieces to make it more clear.

jekbradbury · 2022-11-05T08:01:53Z

jekbradbury
Nov 5, 2022

The approach proposed here involves what's often known as "fake quantization" around the matmul in HLO that's then pattern matched into a true quantized fp8 matmul. I'd like to propose an alternative that involves a direct fp8 matmul in HLO. JAX code that's 1:1 with the two HLO encodings:

# RFC approach with fake quantization
def matmul_fp8_rfc(x_bf16, y_bf16, x_amax, y_amax, z_amax):
  x_rounded = unscale_to_bf16(scale_to_fp8(x_bf16, x_amax), x_amax)
  y_rounded = unscale_to_bf16(scale_to_fp8(y_bf16, y_amax), y_amax)
  z_bf16 = jnp.dot(x_rounded, y_rounded)
  new_z_amax = amax(z_bf16)
  return z_bf16, new_z_amax

# Alternative approach that directly represents what cuBLASLt is doing
def matmul_fp8_alt(x_bf16, y_bf16, x_amax, y_amax, z_amax):
  x_fp8_scaled = fp8(scale(x_bf16, x_amax))
  y_fp8_scaled = fp8(scale(y_bf16, y_amax))
  z_bf16_scaled_x_y = jnp.dot(x_fp8_scaled, y_fp8_scaled, precision='fp8', preferred_element_type=jnp.bfloat16)
  z_bf16 = unscale(z_bf16_scaled_x_y, x_amax, y_amax)
  new_z_amax = amax(z_bf16)
  return z_bf16, new_z_amax

The advantages of the RFC approach in a high-level frontend are:

in principle, no need for actual fp8 dtypes, or at least no need to be able to run ops besides unscale on them
similar frontend logic (fake quantization) for ops that will run with fp8 compute and those that just involve fp8 storage, and similar frontend logic for running with and without fp8 quantization

But the RFC is proposing an implementation for an IR, not a high-level frontend. The XLA GPU backend is not the only consumer of HLO IR, and the more that HLO IR diverges from the operational intent of the user (even if it continues to have about the same numerical results), the more obligatory transformations a backend has to implement before it's operationally correct.

A few concrete examples of things that would be easier with the alternative approach:

implementing any backend that supports fp8 matmuls but doesn't lower through something with cuBLAS's manually-fused interface
using HLO or an IR that needs to stay close to 1:1 with HLO (i.e., MHLO, StableHLO, or the jaxpr IR in JAX) for something other than end-to-end compilation (e.g. counting flops at different precisions to estimate performance, or identifying subgraphs to outline for another compiler)

I was hoping that if the approach in the RFC is implemented, we in JAX would still be able to expose the alternative approach to our users (including library authors), many of whom are likely to value clear operational semantics ("the type you write is the type you get") more than maximal similarity to unquantized code. But I think that would require us to do a pattern match, since the transformation wouldn't be local to the matmul lowering (we need to introduce an unscale, not just a cast to bf16), and we don't really want to/aren't really able to do pattern matches in JAX (hence the desire for jaxpr to stay fairly close to HLO, with only local transformations during lowering).

I recognize that the same need for frontend -> IR pattern matching is true in the RFC -> alt direction, although TF does have the tooling to implement nonlocal lowerings to HLO; if TF would like to generate code with the RFC approach then I'm advocating that XLA support matching both patterns.

The alternative approach is also already the approach used by JAX library authors to implement int8 quantization-aware training (e.g. AQT), so having to follow the RFC approach for fp8 would be inconsistent with our approach to integer quantization.

8 replies

awf Nov 6, 2022

Maybe a good principle is that rewrites should be describable within the IR upon which they apply.
That requires a way to express the RHS of the rewrite in the IR.

stellaraccident Nov 6, 2022

+1 - I ascribe many of the issues I've seen with historic quantization setups to violations of that principle

reedwm Nov 7, 2022
Collaborator Author

I tentatively think we should have XLA-GPU pattern-match both the original RFC's approach and @jekbradbury's alternative approach. Our three choices are (1) support just the original pattern for Dot/Conv, (2) support just the alternative pattern for Dot/Conv and (3) support both. For non Dot/Conv ops, the original pattern would still be used in all cases. The main issue with (3) is we set precedent that other backends should pattern-match both patterns, which is a pain. Maybe in the long-term the StableHLO quantized types/ops could replace both patterns.

I agree with @jekbradbury that supporting both is probably best. For now, I'll update the RFC stating this. I'll update it again if we conclude otherwise.

I personally prefer the RFC's original approach, particularly due to the advantage James mentioned: "similar frontend logic (fake quantization) for ops that will run with fp8 compute and those that just involve fp8 storage". This advantage is also true of the StableHLO quantized types/ops. It seems convenient for users and frameworks to have a single pattern to express FP8 quantization.

/CC @cantonios @burmako.

@smit-hinsu @paynecl, do either of you know how feasible it would be to represent FP8 quantization as matmul_fp8_rfc in the original post but have the bridge rewrite it to matmul_fp8_alt?

Maybe a good principle is that rewrites should be describable within the IR upon which they apply.
That requires a way to express the RHS of the rewrite in the IR.

I don't quite understand here. Can you clarify what it means for a rewrite to be describable in IR?

reedwm Nov 8, 2022
Collaborator Author

I updated the RFC with the alternative approach. I wrote that both patterns will be pattern-matched into cuBLAS calls. If anyone thinks only @jekbradbury's pattern should be pattern-matched, please comment.

I think I understand now what "rewrites should be describable within the IR" means: the output of the rewrite should still just be standard HLO without any special custom ops. But I don't think this is possible currently -- we need to rewrite into a cuBLAS custom call, which inherently requires a lot of pattern matching since cuBLAS presents a fairly high-level API. The "matmul" cuBLAS call takes in the FP8 inputs and the scales, and outputs the matmul output and its amax.

reedwm Nov 9, 2022
Collaborator Author

@smit-hinsu @paynecl, do either of you know how feasible it would be to represent FP8 quantization as matmul_fp8_rfc in the original post but have the bridge rewrite it to matmul_fp8_alt?

Actually, I realize now this would be a bad idea, even if technically feasible, so let's not go with this approach.

stellaraccident · 2022-11-07T01:57:34Z

stellaraccident
Nov 7, 2022

As the author of the quantized types many years ago, I definitely want to review closely any use of them which introduces symbols as this RFC suggests -- that was listed as future so I reserved comment. Imo, for a dialect like StableHLO, the only valid use of quantized types is as a higher level "sugaring" for the sake of some of the traditional frontends/tools that reason in those terms: they should be reducible to concrete IR in StableHLO that is correct and performant on mainline platforms (which may require additional ops to express in a hardware aligned way).

That is also what the principle we were discussing means: the result of any pattern matching or desugaring should be expressible in StableHLO itself. I would probably go a step further and suggest that it should be expressible without information loss that will necessitate further pattern matching to recover (but that is quite subjective and needs to be evaluated case by case).

5 replies

mjsML Nov 7, 2022

meta (pun not indented) point here, +1 on :

the result of any pattern matching or desugaring should be expressible in StableHLO itself. I would probably go a step further and suggest that it should be expressible without information loss that will necessitate further pattern matching to recover

re presenting useful errors to frontend users on specific "compiler magic" and or retargeting for a different platform.

(but that is quite subjective and needs to be evaluated case by case)

and I don't know if there is a "standard" way to do this (i.e. preserve the info and or (de)serialize it, and apply and undo transforms) ... I would imagine if this was enforced as "StableHLO 2.0" contract this would be "just" a matter of implementing specific apply(compiler_magic) and undo(compiler_magic) transforms for each graph cluster/ pattern. I would also imagine this would be a breaking change 😓 food for thought.

stellaraccident Nov 7, 2022

There's a lot of ways to go on my vague statement, but mainly I was thinking that any sequences of ops on sugared types must be able to be desugared to similar granularity ops on basis types, and those ops should be a part of StableHLO and eligible to be targeted directly. There are examples of this, specifically for sugared quantized types, in the TOSA dialect, but it likely takes a bit of unpacking to fully recover the rationale.

reedwm Nov 8, 2022
Collaborator Author

I listed as an advantage of using the StableHLO quantized ops/types:

This would make pattern matching to cuBLAS calls or other backend-specific ops much simpler.

Based on your comments @stellaraccident, this seems to not be accurate. Since we would "desugar" these ops/types into multiply/divide ops before pattern-matching to cuBLAS. Of course, we would still be violating your suggestion that "it should be expressible without information loss that will necessitate further pattern matching to recover", but that is difficult to address as long as we use cuBLAS. Is my understanding correct here?

And yeah, since this is future work and the RFC makes no recommendation, it's not a big deal in any case.

stellaraccident Nov 8, 2022

Based on your comments @stellaraccident, this seems to not be accurate. Since we would "desugar" these ops/types into multiply/divide ops before pattern-matching to cuBLAS. Of course, we would still be violating your suggestion that "it should be expressible without information loss that will necessitate further pattern matching to recover", but that is difficult to address as long as we use cuBLAS. Is my understanding correct here?

Consider my principle violable but also informed by a lot of battle scars: I've worked on a lot of quantized IRs, and every one of them which did not have native op support representing some level of granularity aligned with the predominant hardware aligned libraries ended up causing obtuse representation issues. It might be that the mul/div scaling in this case is local enough that it doesn't cause any trouble, but we'll see.

nouiz Nov 8, 2022
Collaborator

What about adding an optional attribute to the custom op?
It would be a region that represent the computation done by this custom op.
This would prevent all information loss when used.

jakeh-gc · 2022-11-12T17:14:06Z

jakeh-gc
Nov 12, 2022

NVIDIA, ARM, and Intel are working towards standardizing these two FP8 data types, as described in this blog post. Other companies are proposing slightly different versions of FP8, however. These proposals also have an E4M3 and an E5M2 data type, but differ in details such as support for infinities, NaN, and negative zero

Given future hardware implementations of FP8 are also in mind. What is the plan to extend this?
Do you foresee us having many vendor-specific FP8 types or something else?

1 reply

stellaraccident Nov 12, 2022

The support in OpenXLA ultimately derives from high level datatypes in llvm. Following the lead of nvidia, arm, and Intel, Google broached the topic there and is getting the basic support in for the types they define. We're unlikely to take steps ourselves to extend this support further unless if Google's interests and expected adoption would lead us to prioritize it. However, these are open platforms and we are trying to leave a reasonable path for folks with an interest in other formulations to implement them.

The first step would be appropriate type definitions in LLVM/MLIR. In the RFC, I tried to establish some "namespace" for alternatives so the path is easier for the next folks: https://discourse.llvm.org/t/rfc-add-apfloat-and-mlir-type-support-for-fp8-e5m2/65279

We still have some patches on the original types. Next one: https://reviews.llvm.org/D137760

After basic type support is added, there are some tool and backend plumbing needed. We'll try to leave breadcrumbs for that as well.

vinodgro · 2022-11-13T20:43:53Z

vinodgro
Nov 13, 2022

What are the valid types for scale`?
replace unscale with scale in the sentence under the scalingsection? i.e. "scale" each input by multiplying each input by the corresponding input scale.

1 reply

reedwm Nov 15, 2022
Collaborator Author

What are the valid types for scale`?

In principle, the scale can be any dtype, since it's up to the user generating the StableHLO/MHLO/HLO. cuBLAS only supports FP32 scales according to the documentation. Probably the pattern matching in XLA-GPU to rewrite to FP8 cuBLAS LT calls will only accept FP32 scales initially, since otherwise the pattern matching would also have to insert casts to convert the scales.

replace unscale with scale in the sentence under the scalingsection? i.e. "scale" each input by multiplying each input by the corresponding input scale.

The doc's convention uses division to scale and multiplication to unscale. See section "When scaling: multiply vs. divide".

DavidNorman · 2022-11-14T07:24:05Z

DavidNorman
Nov 14, 2022

“ XLA will initially only support NVIDIA's proposed FP8 types, since their hardware supporting FP8 will be among the first to be released”

Graphcore already has FP8 hardware available. It seems very shortsighted not to support the GC/AMD variety too.

2 replies

stellaraccident Nov 14, 2022

Imo, this falls into "you gotta start somewhere" territory -- and in fact, identifying a starting point that got this project concrete and going was the source of much debate and little progress. The path laid out here (and already started on the LLVM side) is very much a "patches welcome" approach.

HaiShaw Nov 21, 2022

GC/AMD/Qualcomm FP8 variety works for both non-scaling, and FP8 tensor scaling based solutions.
It would be necessary to cover both, or not to prioritize one over another at least.
For front end users perspective, it should look better if those could be abstracted.

pjannaty · 2022-11-16T03:52:26Z

pjannaty
Nov 16, 2022
Collaborator

(I also raised this during the OpenXLA meeting) We should proactively think about ways to alert the user in case their usage of the contracts introduced in the RFC fails to pattern-match to genuine FP8 calls, perhaps due to the user failing to strictly follow the steps put forth and/or the pattern-matching heuristics not being encompassing enough.

2 replies

reedwm Nov 16, 2022
Collaborator Author

This is similar to the issue described in the section "Unresolved issue: When to scale". That section raises the point that the user doesn't know what the compiler will fuse. If you consider the cuBLAS pattern matching a form of fusion, then this also is an issue where the user doesn't know what the compiler will fuse.

@hawkinsp pinged me with an idea that perhaps there should be a way to indicate to XLA that a set of ops must be fused, and error out (or maybe warn) otherwise. This is similar in spirit to tf.function(jit_compile=True), which errors out if any op is unsupported by XLA. Implementing this would mean whether XLA errors out would be backend dependent. @hawkinsp @cheshire @burmako any opinions here?

Also, this would be less of an issue if XLA emitted its own dot and convolutions. @thomasjoerg do you know if there are plans to do this?

thomasjoerg Nov 17, 2022
Maintainer

Yes, there are plans for native matmul code generation in XLA CPU and GPU. Matmul codegen was recently added to XLA CPU (still in prototype stage) and work has not started on the XLA GPU side yet.

pjannaty · 2022-11-16T04:25:06Z

pjannaty
Nov 16, 2022
Collaborator

A discussion for the framework front-end design, I realize, but it'd be nice if we could come up with a clever bookkeeping abstraction that encapsulates the FP8 tensor values along with the respective scales so that the user doesn't have to keep dragging the two components separately if they choose to.

1 reply

reedwm Nov 16, 2022
Collaborator Author

Currently we have no concrete plans on the TF side, but I agree we should do this in the long term. A TF quantization API is being worked on. The focus is on integers, but we can later integrate integrate FP8 support. CC @cantonios @donglimm

I'm not sure if there are any plans on the JAX side. CC @zhangqiaorjc, @hawkinsp, @jekbradbury

Doing this in the framework side will probably be necessary if we want to use StableHLO's quantized types and ops.

In general, I don't think we should finalize any design that ties an FP8 tensor along with the scale at this point. We should add support for FP8 itself so that users can first experiment with it, before coming up with a high-level design that makes this simple and easy. But, it doesn't hurt to start discussing it now.

reedwm · 2022-11-18T00:17:49Z

reedwm
Nov 18, 2022
Collaborator Author

Last Tuesday (Nov 17, 2022), I presented this RFC at the XLA community meeting. At the end of my presentation, there was a Q&A section. I'll summarize the questions and answers here.

Q: Is there work in supporting FP8 in frameworks like TensorFlow and JAX?
A: Yes. TensorFlow and JAX plan on adding FP8 dtypes. If this RFC is accepted in some form, these frameworks will utilize XLA's FP8 support.

Q: The FP8 proposes adding a lot of additional scaling ops, like multiply/divide ops. Won't this increase the size of the IR?
A: Yes it will. In the future, we can potentially use StableHLO's quantized types and ops, which would reduce the amount of scaling ops. The quantized types and ops are discussed in the RFC in the section "StableHLO quantization types and ops", but I didn't present this in the presentation due to a lack of time.

Q: The example shown in the presentation has a matmul multiplying an input with itself, which results in a large value, which can overflow. (In the RFC, the example is the first example in the "Scaling" section). Will this cause problems in practice?
A: The output of the matmul is FP16, which is less likely to overflow than F8. In practice, the output can also be BF16 or FP32, making the chances of overflow very unlikely.

Q: How well supported are StableHLO's quantized types and ops?
A: They are supported within StableHLO. However, they are not yet supported in XLA-GPU (or any XLA backend) and they do not yet support dynamic scales which is needed for FP8 training. Therefore, we cannot yet use them, but are considering using them in the future.

Q: If the user writes a pattern to do a scaled matmul but the pattern matching fails, will the user be alerted in any way?
A: This is answered here: #22 (comment). The short answer is, unfortunately, no in the short term.

0 replies

reedwm · 2022-11-18T17:36:38Z

reedwm
Nov 18, 2022
Collaborator Author

Please note that this RFC will be open for review until ~~December 2~~ (now December 9), 2022. You are free to comment after that date but we may not be able to take into account your feedback. If there are no major objections to the design by that date, we will consider the RFC approved, otherwise we will schedule a design review meeting.

I also updated the intro of the RFC with this information. I'm also adding a comment here because notifications are sent out for comments but not RFC edits.

1 reply

reedwm Nov 22, 2022
Collaborator Author

To give more time for feedback, I've extended the review period until December 9.

HaiShaw · 2022-11-21T19:18:49Z

HaiShaw
Nov 21, 2022

I think the line z_f16_unscaled = x_f8 * x_scale * x_scale below should be z_f16_unscaled = z_f16_input_scaled * x_scale * x_scale

# This is the alternative way to represent FP8 scaling. Note
# the Dot has FP8 inputs and FP16 outputs
def quantized_dot_alt(x_f8, x_scale, z_scale):
  z_f16_input_scaled = dot(x_f8, x_f8, output_type=f16)
  z_f16_unscaled = x_f8 * x_scale * x_scale
  z_max = max(abs(z_f16_unscaled))
  z_f8 = cast(z_f16_unscaled / z_scale, f8E4M3)
  z_new_scale = ...
  return z_f8, new_scale

1 reply

reedwm Nov 21, 2022
Collaborator Author

Good catch! I fixed the example.

HaiShaw · 2022-11-21T19:47:02Z

HaiShaw
Nov 21, 2022

For the scaling factor determination, I think Nvidia's logic is a little more sophisticated than a static hypothesis as max(fp32_or_fp16_tensor) / max_representable_fp8_value.
Do we need a power_of_2 scaling, as how it is used below?
https://github.com/NVIDIA/TransformerEngine/blob/6d2294b25465cff5ca0d01f33549464822fc1c1e/transformer_engine/pytorch/fp8.py#L327-L341

2 replies

reedwm Nov 22, 2022
Collaborator Author

Interesting, I didn't realize they did this. I don't think it's necessary, since NVIDIA never recommended we do this. Multiplying/dividing by powers of 2 has the advantage in that doing so does not change the mantissa, making such multiplications give exact results instead of rounded result. However, since multiplications/divisions of scales always occur in a wider type such as BF16, the loss of precision from non-powers-of-2 is likely negligible compared to the loss of precision from FP8.

In any case, the choice of determining the scale from amax is up the framework and does not impact the compiler design.

CC @nluehr @pjannaty

eric-k256 Nov 28, 2022

For someone developing FP8 hardware implementations, limiting to powers of 2 scaling is a notable implementation benefit. Even if the implementation is capable of doing a full multiply/divide, power-wise scaling by powers of 2 is much more efficient. As you say, it is eventually up to the framework, it would be good to find ways to encourage them to limit the scaling behavior. From the linked Nvidia code, they are already using powers of 2, so accuracy doesn't seem to be a problem.

HaiShaw · 2022-11-28T17:21:56Z

HaiShaw
Nov 28, 2022

Rounding
When converting to FP8, XLA will use the typical round-to-even behavior as used in other floating-point dtypes.
... ...
Stochastic rounding has been shown in many cases to result in better model quality compared to round-to-even, especially for low-precision dtypes such as FP8. A StochasticConvertType instruction was recently added, and support for this instruction is being added to XLA backends. Since stochastic rounding is not FP8-specific, it is not further considered in the FP8 design, although it may be important in achieving optimal model quality.

It would still be necessary to insert a flag for instructing conversion mode, either RNE (default) or SR, from XLA perspective, or it's immediate next upper at least. We just don't want a lack of this option when dealing with buck data (tensor) conversions for FP8 as dest. type, in the context of FP8 stack main fast-path (vs. casual/low-performance/low-volume or non-fusible SR conversions).

1 reply

reedwm Nov 29, 2022
Collaborator Author

ConvertElementType has round-half-to-even behavior and StochasticConvertType has stochastic rounding behavior. So we don't need an explicit flag to switch between the two. Note stochastic rounding is also being added to TensorFlow.

There isn't any explicit op to do a saturating cast, but we can use pattern matching to convert clamp and convert instructions to a single hardware instruction.

HaiShaw · 2022-11-28T17:30:43Z

HaiShaw
Nov 28, 2022

A technical note: we may want to use F8_MAX (system definable) to replace 448 to be more vendor friendly and generic - different float8 implementation may come up with a different range, OpenAI/Triton's e.g.

1 reply

reedwm Nov 29, 2022
Collaborator Author

The initial E5M2 and E4M3 types being added to XLA refer specifically to the FP8 dtypes proposed by NVIDIA, Intel, and ARM in this paper: https://arxiv.org/abs/2209.05433. The E4M3 type has a max of 448 and so there is no reason to have a F8_MAX value that differs among vendors. If other vendors have different FP8 dtypes, they can propose adding such dtypes to LLVM/MLIR, followed by StableHLO/XLA.

Note that the E4M3 dtype in MLIR is called f8E4M3FN. The "FN" means only finite and NaN values are supported. StableHLO also uses this name as it uses the type from MLIR, and HLO uses this name for consistency. Another vendor can propose adding a different E4M3 type with a different name. Granted, the GraphCore/AMD proposed E4M3 type in this paper is different but also only supports finite and NaN values, but can be given a different name to disambiguate it from the NVIDIA type.

HaiShaw · 2022-11-28T18:03:07Z

HaiShaw
Nov 28, 2022

Another notes, regarding delayed scaling, Amax, and some generic thoughts.

To what said:

Notes are:

Amax is prone to errors from outliner (source can be anywhere from data to transfer)
Delayed scaling is sub-optimal till better ones kick in (its problem seems to be more obvious in DNNs to real time or mission critical use cases)
For the similar concern as 2. I'm not convinced the static scale would be very effective for inferencing. Nevertheless, how the scale factor is stored and imported later?
We can make this delayed scale dynamic during inference, but the incoming sample (current X to predict) can be very different than previous ones, even they all reside in the same distribution as anticipated.

3 replies

reedwm Nov 29, 2022
Collaborator Author

Amax is prone to errors from outliner (source can be anywhere from data to transfer)

The fact amax is prone to outliers is problematic. Perhaps this problem could be alleviated with specialized algorithms for obtaining the scale given the list of amax values for previous steps. In any case, this is something that can be done at the framework level by the user, and not by the compiler. Despite this issue, the paper from NVIDIA, Intel, and ARM showed good empirical results from using FP8. The paper does not state what algorithm was used to compute the scale from the amax values.

Delayed scaling is sub-optimal till better ones kick in (its problem seems to be more obvious in DNNs to real time or mission critical use cases)

I'm not sure if there's a feasible alternative to delayed scaling that does not reduce performance to be worse than FP16/BF16. The paper from NVIDIA, Intel, and ARM linked above shows FP8 works as well as FP16 on a large variety of deep learning models, but it doesn't mean FP8 will work on every model. In some use cases, FP8 might not be feasible, just as sometimes FP32 or even FP64 sometimes need to be used instead of lower precisions.

For the similar concern as 2. I'm not convinced the static scale would be very effective for inferencing. Nevertheless, how the scale factor is stored and imported later?

In TensorFlow, the scale factor will be implemented using tf.Variables and stored in checkpoints/SavedModels. I imagine other frameworks will also store the scale factor similarly to how weights are stored.

We can make this delayed scale dynamic during inference, but the incoming sample (current X to predict) can be very different than previous ones, even they all reside in the same distribution as anticipated.

There isn't a strong reason to make the scale dynamic during inference for the reason you stated: The incoming sample might be very different from the previous one. Since the weights are constant, there isn't a reason to change the scale over time.

HaiShaw Nov 29, 2022

The last point made is - this per tensor scale factor is build on hypothesis from previous data, which is very different than the billions of weights that are learned from many rounds for just a few high level scores. To this end, I'm feeling this mechanism [even with dynamic scaling] may not work well at certain inferencing scenario, particular where each accuracy matters in an online setting.

reedwm Dec 2, 2022
Collaborator Author

IIUC, your point is that since training changes weights over time, the scale for the final values of the weights will only be determined for a single batch of training examples. It's likely this batch, by chance, will not be representative of the potentially billions of possible inputs during inference.

This could be solved by running the training scaling procedure (steps (1)-(7) in the "Scaling" section) during inference as well for some number of steps, to determine the optimal scale. Step 7, "calculate the new scale based on the amax value", would require a different algorithm for computing the new scale from the amax values compared to training, say by choosing the scale for the 99th percentile amax value. I hope this will not be necessary in most cases, since it adds additional complexity on the user to run inference. Frameworks and users will have to decide whether this should be done or not.

Tixxx · 2022-11-29T16:38:42Z

Tixxx
Nov 29, 2022
Collaborator

I have a minor comment regarding patterns that will be matched in XLA for the scales, ie:

fp8_val = cast_to_fp8(fp32_val / scale)

The division can be expressed as multiplication of reciprocal of the actual scale. Since the actual implementation can vary in TF or Jax, I think it'd better to match patterns with both multiplication and division for scaling.

3 replies

cantonios Nov 29, 2022

+1

This exact issue has come up in TF with integer quantization - that TF is somewhat inconsistent, which sometimes leads to different numerical results depending on which device the model is being run on. Though my true preference would be to standardize on one for consistency.

Tixxx Nov 29, 2022
Collaborator

Agreed on standardizing for consistency. However with XLA being the backend for multiple frameworks, one of which being Jax that provides great flexibility, sometimes it'd difficult to standardize on usages of scales. In terms of cublas, since it only supports float32 for scales, numerical differences between scaling ops can be minimized.

reedwm Dec 1, 2022
Collaborator Author

+1. I added the sentence to the RFC:

XLA will support any combination of multiplying or dividing the inputs by a scale and multiplying or dividing the ouput by a scale.

The "XLA GPU codegen" section also states that either will be matched. And BTW, thank you @Tixxx for working on the implementation here!

ekuznetsov139 · 2022-12-01T21:52:12Z

ekuznetsov139
Dec 1, 2022

"When converting to FP8, XLA will use the typical round-to-even behavior as used in other floating-point dtypes. However, in practice, FP8 should saturate on overflow, because the scale might end up being slightly too large."

FP8 should saturate on overflow unless the conversion is to E5M2 and loss scaling is enabled.

1 reply

reedwm Dec 1, 2022
Collaborator Author

Good point. Added the paragraph

When E5M2 is used on the backwards pass with loss scaling, round-to-even should be used instead of saturation, since Inf is used as a signal to the loss scaling algorithm (such as Keras's LossScaleOptimizer) to reduce the scale. Frameworks like TensorFlow/JAX and users of such frameworks must be aware of this, but this does not impact the compiler design.

jayfurmanek · 2022-12-07T22:00:56Z

jayfurmanek
Dec 7, 2022

What's proposed in this RFC follows the Transformer Engine approach, where per-tensor scaling is used and the scaling factor was determined by amax of the previous iteration. While this is one interesting approach, we think it is important to be inclusive of other approaches developed by other vendors (AMD, Graghcore, etc) to give the user more options.

Graphcore has shown that training with fp8 without per-tensor scaling would work on a bunch of models and AMD has seen similar. So it would be beneficial to support such scheme. While the scheme with per-tensor scaling could be retrofitted with setting the scale to 1, does this occur performance overhead compared with a purely unscaled scheme?

The recipe for automatically deciding the scaling factor using delayed scaling might be sub-optimal and the user might want to have more control regarding which layers to use fp8 and what scaling factor to use for each layer. Has the current design considered this option and reserved hooks to enable this?

2 replies

awf Dec 8, 2022

While this is one interesting approach, we think it is important to be inclusive of other approaches developed by other vendors (AMD, Graghcore, etc) to give the user more options.

Coming from Graphcore, with FP8 hardware already available for purchase, I agree :)

All we really need are the dtypes. Ours are slightly different to the nVidia types (we have more dynamic range, they have more special codes..), so we might like to propose another chain of diffs like [https://reviews.llvm.org/D138075, tsl::float8_e4m3, https://github.com/tensorflow/tensorflow/pull/58720], but we are happy to wait until the precise form of that chain is finalized. The general consensus appears to be that having several FP8 types in the enum is acceptable - not all hardware can and will support all formats efficiently, just as is true today for (float64, float32, float16, bfloat16), and will fall back to upcast-op-downcast. And we know we will need at least two more when IEEE float8 comes, so we expect to see a near-doubling of the range of types:

(float64, float32, float16, bfloat16, float8_e5m2_fn, float8_e4m3_fn, float8_e5m2_ieee, float8_e4m3_ieee, float8_e5m2_fz, float8_e4m3_fz)

Again this makes sense: there's only one float32, two float16s, so more float8s follows that pattern.

The reason we need only the dtypes is that everything else can be done with a custom call. Yes, it's nice to harden common operations into a list of ops that "must be supported to claim full compliance", e.g. https://github.com/openxla/stablehlo/blob/main/stablehlo/dialect/StablehloOps.td, but given the current state of FP8 implementation and research, most experimentation can proceed with custom ops.

When I train a model with FP8 weights, I probably plan to deploy it on my FP8 hardware, and I am likely to make sure I am using ops that are fast on that hardware.

All the discussion above about scale factors will of course be part of my training/inference code, but all I need to do that is a small set of operators:

def matmul_pow2scaled(X : Tensor[float8], Y : Tensor[float8], log2scale : int) -> Tensor[float16]:
  """
  Returns ret = (X @ Y) * (2 ** log2scale)
  """
  # ... fast code for devices with configurable exponent bias (e.g. Graphcore, Tesla)
  # Could also return `max(ret)`, if hardware makes that worthwhile

def cast_8_to_16_then_pow2scale(X : Tensor[float8], log2scale : int) -> Tensor[float16]:
  """
  Returns cast(X, float16) * (2 ** log2scale)
  """
  # ... fast code for devices with configurable exponent bias

def pow2scale_then_cast_16_to_8(X : Tensor[float16], log2scale : int) -> Tensor[float8]:
  """
  Returns cast(X * (2 ** log2scale), float8) 
  """
  # ... fast code for devices with configurable exponent bias

Given these operators, I can implement all the schemes described above while achieving maximum FLOP rate.

So:

please add the dtypes, and show a clear path to adding new dtypes.
make sure custom calls work with FP8
add whatever ops the community deems important (my view is that this can wait until they have been proven as custom calls first)

reedwm Dec 13, 2022
Collaborator Author

Arg I had a long reply written but I forgot to click "submit" then closed the tab. I'll rewrite a more concise reply.

In short, scaling is optional. The only change to StableHLO/HLO is adding the FP8 dtypes and allowing ops to take FP8 inputs/outputs. The only code changes in XLA to support scaling will be in XLA-GPU to pattern-match dot+scaling ops into cuBLAS calls, as well as potential similar changes to other backends. There is no overhead from not scaling, and if you simply use FP8 in ops like dot, no scaling will be done. Similarly, the delayed scaling algorithm is not hardcoded into XLA but instead is specified through HLO ops like multiply/divide. XLA users are free to use any scaling algorithm they wish, or no scaling algorithm at all.

In TF, arithmetic ops like Add and Matmul will not support FP8, unlike HLO. But you still don't need to do scaling. To represent an FP8 matmul, you can simply cast the inputs to BF16, run the matmul, and cast the ouputs back to FP8. The compiler can convert this to an optimized FP8 Matmul. When tf.function is not used with jit_compile=True, this will be slow, but FP8 with scaling will be even slower. In general it will be required to pass jit_compile=True to get a performance gain with FP8.

As for @awf's comments: I pretty much agree with what you said. Adding several float8 types to the compiler stack is the most sensible approach, as you said. I don't think custom calls are necessary from user code though. The three set of operators you gave can be pattern-matched into custom calls by the backend (such as the GraphCore backend), which is the approach we're taking for XLA-GPU.

For you three asks: We will add the NVIDIA/Intel/ARM dtypes initially, and hopefully there will be a clear path to adding other dtypes once the initial support for the NVIDIA/Intel/ARM dtypes is implemented. You can start proposing adding the Graphcore/AMD types to MLIR/APFloat at any time, which is the first step in supporting these types in XLA. We will make sure custom calls work in FP8. As for "add whatever ops the community deems important": FP8 will be supported in any op that supports FP16. But in practice, since I suspect most hardware doesn't support FP8 arithmetic for most ops, a pass similar to bfloat16_normalization will convert FP8 ops to FP16 or FP32. Once such a pass is implemented for XLA-GPU, it won't be hard to also use it for other backends which support FP8.

reedwm · 2022-12-13T01:56:37Z

reedwm
Dec 13, 2022
Collaborator Author

The RFC is now approved, since the review period ended December 9 and there has been no major objections to the design. The RFC originally stated:

This RFC will be open for review until December 9, 2022. You are free to comment after that date but we may not be able to take into account your feedback. If there are no major objections to the design by that date, we will consider the RFC approved, otherwise we will schedule a design review meeting.

I replaced that paragraph with:

This RFC has been closed for review on December 9, 2022. You are still free to comment with any questions or thoughts on this design.

Despite being closed for review, if you have any more questions or feedback, please comment. Although it is too late to make any major changes to the design for the initial FP8 implementation, we are still interested in your comments, as we may evolve the design in the future (such as using StableHLO's quantized ops/types).

0 replies

yiakwy-xpu-ml-framework-team · 2023-10-30T11:03:56Z

yiakwy-xpu-ml-framework-team
Oct 30, 2023

@reedwm Hi reedwm sorry late rely, Since FP8 SE4M3FUZ has been merged (3b96f8f), I am really curious about your arithmetic assumptions on FP8 :

why do we cast FP8 back to FP32 to do the arithmetics

In graphcore, you can do FP8 multiplication directly to produce FP16 output

FP16_out = ( FP8_SE4M3FUZ_LHS (log scaled by q) @ FP8_SE4M3FUZ_RHS (log sacled by q) ) * inv(q) // handled by AMP in 1 cycle

If you cast back data to FP32, that means the datatype is simulated not supported. Perhaps only useful to reduce model size and I/O burdens.

While supported FP8 means double speed both in throughput (larger) and latency (faster).

1 reply

reedwm Oct 30, 2023
Collaborator Author

Even though we cast to FP32 (or FP16 or BF16) before executing the dot, the hardware can still do a native FP8 computation. On NVIDIA GPUs, we pattern match the pattern into a native cuBLAS LT call which does a native FP8 matmul.

The "Alternative way to scale" section more closely matches what hardware supports. This instead has an HLO dot with FP8 inputs and FP16 outputs. We do not yet pattern match this pattern for NVIDIA GPUs, but can in the future if there is demand for this. If this pattern is easier for codegen for on GraphCore GPUs, you can use it.

The reason we do not use the "Alternative way to scale" is that TensorFlow does not support matmuls with a different output type than input types.

yiakwy-xpu-ml-framework-team · 2023-10-30T12:40:19Z

yiakwy-xpu-ml-framework-team
Oct 30, 2023

Dr Sergio Perez in the latest accepted paper (open reviewed version : https://openreview.net/pdf?id=nErbvDkucY) has demonstrated Graphcore-AMD-Qualcomm 's FP8 (e.g.: SE4M3FUZ) can be effectively used in large model inference and training without per-channel scaling for weights!

You can see that 1% error can be observed in 70B llama inference and 13B GPT finetune tasks.

Graphcore-AMD-Qualcomm's actually works well with scaling.

1 reply

reedwm Oct 30, 2023
Collaborator Author

That is good to hear. Note that XLA supports the FP8 dtypes proposed by Graphcore

xla/xla/xla_data.proto

Lines 85 to 100 in 38c9f65

    
           // FP8 dtypes, as described in this paper: https://arxiv.org/abs/2206.02915 
        
           // 
        
           // F8E5M2FNUZ has 5 exponent bits and 2 mantissa bits. 
        
           // F8E4M3FNUZ has 4 exponent bits and 3 mantissa bits. 
        
           // 
        
           // The "FNUZ" means only Finite and NaN values are supported; zero is 
        
           // unsigned. Unlike IEEE types, infinities are not supported.  NaN is 
        
           // represented when the exponent and mantissa bits are all 0s with a sign bit 
        
           // of 1. All other values are finite. 
        
           // 
        
           // These differences mean there's an additional exponent value available. To 
        
           // keep the same dynamic range as an IEEE-like FP8 type, the exponent is 
        
           // biased one more than would be expected given the number of exponent bits 
        
           // (8 for Float8E4M3FNUZ and 16 for Float8E5M2FNUZ). 
        
           F8E5M2FNUZ = 24; 
        
           F8E4M3FNUZ = 25;

On NVIDIA GPUs, these types are only supported through emulation since NVIDIA GPUs don't natively support them (and on CPUs, no FP8 type is natively supported). I'm not sure the status of its support on other backends

[RFC] FP8 in XLA #22

reedwm Nov 1, 2022 Collaborator

RFC: FP8 in XLA

Overview

Summary

Background Summary

Design summary

Background

FP8 in machine learning

Design

Scaling

Alternative way to scale

StableHLO quantization types and ops

When scaling: multiply vs. divide

FP8 arithmetic

Rounding

XLA GPU codegen

Testing plan

Unresolved issue: When to scale

Appendix: Details on StableHLO quantization types and ops

Replies: 22 comments · 41 replies

reedwm Nov 1, 2022 Collaborator Author

reedwm Nov 2, 2022 Collaborator Author

nouiz Nov 3, 2022 Collaborator

reedwm Nov 3, 2022 Collaborator Author

reedwm Nov 7, 2022 Collaborator Author

reedwm Nov 8, 2022 Collaborator Author

reedwm Nov 9, 2022 Collaborator Author

reedwm Nov 8, 2022 Collaborator Author

nouiz Nov 8, 2022 Collaborator

reedwm Nov 15, 2022 Collaborator Author

pjannaty Nov 16, 2022 Collaborator

reedwm Nov 16, 2022 Collaborator Author

thomasjoerg Nov 17, 2022 Maintainer

pjannaty Nov 16, 2022 Collaborator

reedwm Nov 16, 2022 Collaborator Author

reedwm Nov 18, 2022 Collaborator Author

reedwm Nov 18, 2022 Collaborator Author

reedwm Nov 22, 2022 Collaborator Author

reedwm Nov 21, 2022 Collaborator Author

reedwm Nov 22, 2022 Collaborator Author

reedwm Nov 29, 2022 Collaborator Author

reedwm Nov 29, 2022 Collaborator Author

reedwm Nov 29, 2022 Collaborator Author

reedwm Dec 2, 2022 Collaborator Author

Tixxx Nov 29, 2022 Collaborator

Tixxx Nov 29, 2022 Collaborator

reedwm Dec 1, 2022 Collaborator Author

reedwm Dec 1, 2022 Collaborator Author

reedwm Dec 13, 2022 Collaborator Author

reedwm Dec 13, 2022 Collaborator Author

reedwm
Nov 1, 2022
Collaborator

Replies: 22 comments 41 replies

reedwm Nov 1, 2022
Collaborator Author

reedwm Nov 2, 2022
Collaborator Author

nouiz
Nov 3, 2022
Collaborator

reedwm Nov 3, 2022
Collaborator Author

reedwm Nov 7, 2022
Collaborator Author

reedwm Nov 8, 2022
Collaborator Author

reedwm Nov 9, 2022
Collaborator Author

reedwm Nov 8, 2022
Collaborator Author

nouiz Nov 8, 2022
Collaborator

reedwm Nov 15, 2022
Collaborator Author

pjannaty
Nov 16, 2022
Collaborator

reedwm Nov 16, 2022
Collaborator Author

thomasjoerg Nov 17, 2022
Maintainer

pjannaty
Nov 16, 2022
Collaborator

reedwm Nov 16, 2022
Collaborator Author

reedwm
Nov 18, 2022
Collaborator Author

reedwm
Nov 18, 2022
Collaborator Author

reedwm Nov 22, 2022
Collaborator Author

reedwm Nov 21, 2022
Collaborator Author

reedwm Nov 22, 2022
Collaborator Author

reedwm Nov 29, 2022
Collaborator Author

reedwm Nov 29, 2022
Collaborator Author

reedwm Nov 29, 2022
Collaborator Author

reedwm Dec 2, 2022
Collaborator Author

Tixxx
Nov 29, 2022
Collaborator

Tixxx Nov 29, 2022
Collaborator

reedwm Dec 1, 2022
Collaborator Author

reedwm Dec 1, 2022
Collaborator Author

reedwm Dec 13, 2022
Collaborator Author

reedwm
Dec 13, 2022
Collaborator Author