Support FP8 constant #4222

htyu · 2024-06-27T17:29:01Z

To unblock the compilation of kernels like below which don't operate arithmetically on FP8.

@triton.jit
def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 400624
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 784
    x1 = (xindex // 784)
    x2 = xindex
    tmp0 = x0
    tmp1 = tl.full([1], 769, tl.int64)
    tmp2 = tmp0 < tmp1
    tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0)
    tmp4 = tmp3.to(tl.float8e4nv)
    tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype)
    tmp6 = tl.where(tmp2, tmp4, tmp5)
    tl.store(out_ptr0 + (x2), tmp6, xmask)

htyu · 2024-06-27T17:34:49Z

I'll add a lit test if this sounds a reasonable fix.

htyu · 2024-06-27T17:50:51Z

I was hitting

  %2 = "llvm.mlir.constant"() <{value = 0.000000e+00 : f8E4M3FNUZ}> : () -> f8E4M3FNUZ loc(#loc1)

error: 'llvm.mlir.constant' op result #0 must be LLVM dialect-compatible type, but got 'f8E4M3FNUZ'

ThomasRaoux · 2024-06-27T17:54:31Z

I was hitting

  %2 = "llvm.mlir.constant"() <{value = 0.000000e+00 : f8E4M3FNUZ}> : () -> f8E4M3FNUZ loc(#loc1)

error: 'llvm.mlir.constant' op result #0 must be LLVM dialect-compatible type, but got 'f8E4M3FNUZ'

you're right, makes sense. Does the conversion to LLVM dialect work for scalar constant?

htyu · 2024-06-27T17:58:31Z

I was hitting
  %2 = "llvm.mlir.constant"() <{value = 0.000000e+00 : f8E4M3FNUZ}> : () -> f8E4M3FNUZ loc(#loc1)

error: 'llvm.mlir.constant' op result #0 must be LLVM dialect-compatible type, but got 'f8E4M3FNUZ'
you're right, makes sense. Does the conversion to LLVM dialect work for scalar constant?

Good point. Probably not. Actually the conversion being changed is for scalar constant, which is then broadcasted to a tensor in registers.

htyu · 2024-06-27T18:00:27Z

Oh I see your point. The original constant on TTGIR is a tensor
%cst = arith.constant dense<0.000000e+00> : tensor<1024xf8E4M3FNUZ, #blocked> loc(#loc1)

I'll add support for scalar constant.

ThomasRaoux · 2024-06-27T18:03:08Z

I was hitting
  %2 = "llvm.mlir.constant"() <{value = 0.000000e+00 : f8E4M3FNUZ}> : () -> f8E4M3FNUZ loc(#loc1)

error: 'llvm.mlir.constant' op result #0 must be LLVM dialect-compatible type, but got 'f8E4M3FNUZ'
you're right, makes sense. Does the conversion to LLVM dialect work for scalar constant?
Good point. Probably not. Actually the conversion being changed is for scalar constant, which is then broadcasted to a tensor in registers.

I believe we rely on upstream pattern for scalar arith.constant. It would be worth checking if it works and maybe we need to upstream a fix

htyu · 2024-06-27T19:05:05Z

Confirmed that scalar const lowering works with LLVM:

`%cst = arith.constant 0.000000e+00 : f8E4M3FNUZ`

=>
%0 = llvm.mlir.constant(0.000000e+00 : f8E4M3FNUZ) : i8

ThomasRaoux · 2024-06-27T20:10:17Z

python/triton/language/semantic.py

-    x, y = binary_op_type_checking_impl(x, y, builder, True, True)
+    # Bypass arithmetic type check for FP8 types where they are not supported.
+    is_fp8 = x.type == y.type and x.type.is_fp8() and y.type.is_fp8()
+    x, y = binary_op_type_checking_impl(x, y, builder, True, True, not is_fp8)


so fp8 would not have auto-promote but other types would? Seems a bit odd. I would add support for fp8 in binary_op_type_checking_impl instead

The current logics does not support fp8 because fp8 arithmetic is not available on hardware (except for dot). Do I understand it correctly?

Also, I'm not sure we need promote here, as in the example kernel fp8 is not used for arithmetic, rather, the kernel loads fp32, convert it to fp8, and conditionally stored out.

Or do you think it's safe to not promote anything when x.type == y.type?

I meant in general if we have a mix-mode op we promote to the highest format. We could do that for fp8 as well right?

Oh yeah we should do that.

htyu · 2024-06-27T20:58:29Z

Bypass arithmetic type check when inputs are same-typed.

To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ```

Update Update Update Update Add a more meaningful check to make sure we are not merging blocks (#4186) This is a follow-up to #4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy. [AMD] Skip mfma layout in maybeDuplicate (#4170) The workaround introduced in #4048 "forgot" to skip mfma layout. [TEST] Merge duplicate `max_num_imprecise_acc` tests and improve code (#4191) [DOCS][NFC] Fix doc formatting problems (#4195) 1. f-string cannot be used as docstrings in Python. 2. URLs should follow the reStructuredText format. 3. Code snippets in a code block should be indented. Tested and passed on a local machine. [BACKEND] Fix regression in pipeliner pre-checks. (#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( Remove tl.multiple_of call from tma persistent kernel (#4198) [AMD] Guard against null in `BypassEpilogueSMEM` (#4203) `val.getDefiningOp()` can return `nullptr`. In this case, we must fail the `BypassEpilogueSMEM` rewrite pass for the given op. This prevents run-time crashes. [FRONTEND][NFC] Fix type checking, conditional logic, and loop structures for improved readability and performance (#4208) Document TRITON_HOME (#4210) Document the existence of `TRITON_HOME` environment variable. The `TRITON_HOME` variable controls the location of the `.triton` directory that stores, among other things, the files downloaded during a `pip install -e python` virtualenv build. By default, this is located in the user's home directory, at `~/.triton`. I was trying to build Triton on my system on a large local disk, but with limited network home directory space, and the `pip` command kept failing with out of disk space errors. It turned out that during installation, large files were downloaded to the `~/.triton` directory causing failure. After checking that it was not `pip` doing this, I found the `TRITON_HOME` variable which allowed me to workaround the issue and build Triton successfully. After seconding #4007, I decided to contribute this documentation fix. Co-authored-by: sree <sree@buckyball> [BACKEND] Fix regression in i1 reduction (#4215) Recent refactoring broke i1 shared memory load. [BUILD] update URL for LLVM tarballs (#4216) [BACKEND] Fix divisibility analysis for shift ops (#4221) Divisibility does not ensure that a value is not 0 therefore we cannot use divisibility as a minimum shifted values. Support FP8 constant (#4222) To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ``` [INTERPRETER] Implement implicit tensor conversion for assignment operators (#4214) Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

…triton-lang#1829) Closes triton-lang#1830 Signed-off-by: Anatoly Myachev <[email protected]>

Summary: Add test cases to verify that the compile of inner-padding works with the triton PR triton-lang/triton#4222. Before the triton PR, the inductor code-gen kernel fails at ``` tmp10 = tl.where(tmp6, tmp8, tmp9) TypeError: unexpected type fp8e5 and fp8e5 ``` Reviewed By: irobert0126 Differential Revision: D62003827

Summary: Pull Request resolved: pytorch#858 Add test cases to verify that the compile of inner-padding works with the triton PR triton-lang/triton#4222. Before the triton PR, the inductor code-gen kernel fails at ``` tmp10 = tl.where(tmp6, tmp8, tmp9) TypeError: unexpected type fp8e5 and fp8e5 ``` Reviewed By: irobert0126 Differential Revision: D62003827

Summary: Pull Request resolved: pytorch#858 The diff modifies the `padding` option and added tests with `compile`: * For the scaled_mm of shape MxKxN, the current `inner_padding` option only pads the `K` dimension. However, if `N` is not divisible by 16, we also got the error ``` E RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Ddesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)` ``` So, modified the pad_inner option to also pad the K dimensions. ----- * The compile of inner-padding only works with the triton PR triton-lang/triton#4222. Before the triton PR, the inductor code-gen kernel fails at ``` tmp10 = tl.where(tmp6, tmp8, tmp9) TypeError: unexpected type fp8e5 and fp8e5 ``` Reviewed By: irobert0126 Differential Revision: D62003827

To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ```

htyu requested a review from ptillet as a code owner June 27, 2024 17:29

Support FP8 constant

6877842

htyu force-pushed the hoy/fp8const branch from ac44485 to 6684d35 Compare June 27, 2024 19:05

Add a test

46895f7

htyu force-pushed the hoy/fp8const branch from 6684d35 to 46895f7 Compare June 27, 2024 19:07

ThomasRaoux reviewed Jun 27, 2024

View reviewed changes

Bypass arithmetic type check when inputs are same-typed.

b2cd3a8

htyu force-pushed the hoy/fp8const branch from cbb5208 to b2cd3a8 Compare June 27, 2024 21:50

Promote mixed fp8 types fp16

a3bf8c6

htyu force-pushed the hoy/fp8const branch from 7a1fb8b to a3bf8c6 Compare June 28, 2024 00:56

ThomasRaoux approved these changes Jun 28, 2024

View reviewed changes

htyu merged commit 938e388 into triton-lang:main Jun 28, 2024
6 checks passed

ZzEeKkAa pushed a commit to ZzEeKkAa/triton that referenced this pull request Aug 16, 2024

Support FP8 constant; porting changes from Triton PR triton-lang#4222 (…

805cb50

…triton-lang#1829) Closes triton-lang#1830 Signed-off-by: Anatoly Myachev <[email protected]>

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

y-sq mentioned this pull request Sep 9, 2024

Some changes in inner-padding option pytorch/ao#858

Open

vkuzo mentioned this pull request Oct 30, 2024

torchao.float8 + torch.compile does not work on HuggingFace's Mixtral model pytorch/ao#1200

Open

goldhuang mentioned this pull request Nov 19, 2024

https://github.com/triton-lang/triton/pull/4222 not included in triton 3.1 #5188

Closed

This was referenced Nov 24, 2024

Add fp8 (torchao)/fsdp2/torch_compile handlers and tests Lightning-AI/pytorch-lightning#20443

Closed

Add fp8 (torchao)/fsdp2/torch_compile handlers and tests Lightning-AI/pytorch-lightning#20445

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP8 constant #4222

Support FP8 constant #4222

htyu commented Jun 27, 2024 •

edited

Loading

htyu commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux commented Jun 27, 2024

htyu commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux Jun 27, 2024

htyu Jun 27, 2024

htyu Jun 27, 2024

ThomasRaoux Jun 28, 2024

htyu Jun 28, 2024

htyu commented Jun 27, 2024

Support FP8 constant #4222

Support FP8 constant #4222

Conversation

htyu commented Jun 27, 2024 • edited Loading

htyu commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux commented Jun 27, 2024

htyu commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux commented Jun 27, 2024

htyu commented Jun 27, 2024

ThomasRaoux Jun 27, 2024

Choose a reason for hiding this comment

htyu Jun 27, 2024

Choose a reason for hiding this comment

htyu Jun 27, 2024

Choose a reason for hiding this comment

ThomasRaoux Jun 28, 2024

Choose a reason for hiding this comment

htyu Jun 28, 2024

Choose a reason for hiding this comment

htyu commented Jun 27, 2024

htyu commented Jun 27, 2024 •

edited

Loading