rmsnorm optimization for M = 1 #668

xiaohuguo2023 · 2024-11-27T16:30:45Z

rmsnorm kernel optimization

enable buffer load/store
change grid size as tl.constexpr
add autotuning configs with waves_per_eu = 0
move memory allocation outside of the wrapper to reduce autotuning overheads
fix issue for no_benchmark case.
it has average 88% performance improvement over the base version.

scxiao · 2024-12-02T14:32:51Z

python/perf-kernels/rmsnorm.py

-    for row_idx in tl.range(row_start, n_rows, row_step):
+    tl.assume(input_row_stride >= 0)
+    tl.assume(output_row_stride >= 0)
+    for row_idx in tl.range(row_start, n_rows, NUM_PRGMS):


Can we move the line g = tl.load(g_ptr + col_offsets, mask=mask, other=0.0) before the loop, so we do not have to load it for each loop? Or the compiler can do that for us?

not sure why, but put in the loop gives slightly improved perf in average

| N | Triton (Old) | Triton (New) | Improvement (%) | |----------|--------------|--------------|-----------------| | 8192.0 | 2.822 | 3.035 | 7.55 | | 9216.0 | 3.776 | 4.326 | 14.57 | | 10240.0 | 4.165 | 4.734 | 13.67 | | 11264.0 | 4.599 | 5.690 | 23.71 | | 12288.0 | 5.235 | 5.265 | 0.57 | | 13312.0 | 5.541 | 5.952 | 7.41 | | 14336.0 | 6.304 | 5.941 | -5.77 | | 15360.0 | 7.544 | 7.380 | -2.18 | | 16384.0 | 7.069 | 7.664 | 8.43 | | 17408.0 | 7.652 | 8.269 | 8.07 | | 18432.0 | 8.110 | 8.330 | 2.71 | | 19456.0 | 8.712 | 9.441 | 8.37 | | 20480.0 | 8.915 | 9.488 | 6.43 | | 21504.0 | 10.047 | 10.324 | 2.76 | | 22528.0 | 9.858 | 10.207 | 3.54 | | 23552.0 | 10.062 | 9.712 | -3.48 | | 24576.0 | 11.465 | 10.408 | -9.23 | | 25600.0 | 10.968 | 10.732 | -2.15 | | 26624.0 | 12.666 | 10.422 | -17.72 | | 27648.0 | 11.786 | 12.288 | 4.26 | | 28672.0 | 13.457 | 11.591 | -13.89 | | 29696.0 | 12.321 | 13.150 | 6.73 | | 30720.0 | 12.698 | 13.575 | 6.91 | | 31744.0 | 14.271 | 15.549 | 8.95 |

Oh, really? if that is the case, we can put it in the loop.

scxiao · 2024-12-02T15:06:19Z

python/perf-kernels/rmsnorm.py

-        triton.Config({'waves_per_eu': 4}, num_warps=4, num_stages=1),
-        triton.Config({'waves_per_eu': 4}, num_warps=8, num_stages=1),
-        triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1),
+        triton.Config({'waves_per_eu': 0}, num_warps=4, num_stages=2),


I think this can be simplified as if you want:
[triton.Config({'waves_per_eu': we}, num_warps=nw, num_stages=2) for (we, nw) in itertools.product([0, 1, 2, 4], [4, 8, 16])]

vgokhale · 2024-12-02T15:16:57Z

Can you add perf before / after this PR to the description?

…tune

rahulbatra85 · 2024-12-02T17:10:48Z

python/perf-kernels/rmsnorm.py

+        x = torch.randn(args.M_start, args.N_start, device='cuda')
+        y = torch.zeros_like(x, device='cuda')
+        n_rows, n_cols = x.shape
+        blk_size = triton.next_power_of_2(n_cols)


@xiaohuguo2023 Did you notice any big performance drop if blk_size >65k?

yes, with blk_size>65k, start to have vgpr spills

my next PR will address this issue

rmsnorm opt for M=1

a586bd1

xiaohuguo2023 requested review from vgokhale and rahulbatra85 November 27, 2024 16:31

forgot update the pytest

4589d5b

scxiao reviewed Dec 2, 2024

View reviewed changes

fix --no_benchmark option

bf4d4bb

scxiao reviewed Dec 2, 2024

View reviewed changes

tidy autotune configs and add verbose option for best configs of auto…

d114cf9

…tune

rahulbatra85 reviewed Dec 2, 2024

View reviewed changes

vgokhale approved these changes Dec 2, 2024

View reviewed changes

vgokhale merged commit fc558e7 into main_perf Dec 2, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmsnorm optimization for M = 1 #668

rmsnorm optimization for M = 1 #668

xiaohuguo2023 commented Nov 27, 2024 •

edited

Loading

scxiao Dec 2, 2024

xiaohuguo2023 Dec 2, 2024 •

edited

Loading

scxiao Dec 2, 2024

scxiao Dec 2, 2024

xiaohuguo2023 Dec 2, 2024

vgokhale commented Dec 2, 2024

rahulbatra85 Dec 2, 2024

xiaohuguo2023 Dec 2, 2024

xiaohuguo2023 Dec 2, 2024

rmsnorm optimization for M = 1 #668

rmsnorm optimization for M = 1 #668

Conversation

xiaohuguo2023 commented Nov 27, 2024 • edited Loading

rmsnorm kernel optimization

scxiao Dec 2, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

scxiao Dec 2, 2024

Choose a reason for hiding this comment

scxiao Dec 2, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Dec 2, 2024

Choose a reason for hiding this comment

vgokhale commented Dec 2, 2024

rahulbatra85 Dec 2, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Dec 2, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Dec 2, 2024

Choose a reason for hiding this comment

xiaohuguo2023 commented Nov 27, 2024 •

edited

Loading

xiaohuguo2023 Dec 2, 2024 •

edited

Loading