add blocked version to address performance issue of when N is large #672

xiaohuguo2023 · 2024-12-04T14:36:10Z

blocked rmsnorm implementation

benefit when N >30K, the best BLOCK_SIZE = 65536
add unit test for N > 30K
add the logic in the wrapper to set the best BLOCK_SIZE
for non block version, this pr won't have any impart of performance.
non-blocked v.s. blocked version perf comparison

N	Triton (non blocked)	Triton (blocked)	Improvement (%)
31744.0	11.736	11.615	-1.03
41984.0	7.326	11.911	62.59
52224.0	8.381	13.114	56.42
62464.0	9.721	15.099	55.30
72704.0	1.090	14.699	1249.45
82944.0	1.272	15.360	1107.08
93184.0	1.431	16.568	1058.33
103424.0	1.614	16.212	904.21
113664.0	1.766	18.606	953.76
123904.0	1.896	18.398	870.12
134144.0	1.284	17.659	1275.78
144384.0	1.531	17.966	1073.66
154624.0	1.636	19.041	1063.99
164864.0	1.578	18.224	1054.63
175104.0	1.852	19.307	942.62
185344.0	1.951	19.420	895.32
195584.0	1.894	19.194	914.01
205824.0	1.953	19.811	914.43
216064.0	2.071	19.505	841.59
226304.0	2.163	19.914	820.58
236544.0	2.234	18.855	743.71
246784.0	2.557	21.409	737.61

rahulbatra85 · 2024-12-04T16:55:59Z

python/perf-kernels/rmsnorm.py

 def triton_rmsnorm(x, y, g, n_rows, n_cols, blk_size, epsilon=1e-6):
    BLOCK_SIZE = blk_size
+    # Use blocked approach if BLOCK_SIZE > 65536
+    USE_BLOCKED = BLOCK_SIZE > 31743


One thing that I noticed in the layernorm tutorial was also using the dtype to determine whether to use blocked or non-blocked. I think what matters is not the actual number of elements, but the total size of elements.

you are right, line 131 should be

if n_cols > 65535

Actually, I was suggesting more like this
https://github.com/ROCm/triton/blob/main_perf/python/perf-kernels/layernorm.py#L118

revised as suggested.

vgokhale · 2024-12-04T20:20:54Z

Can you please add performance comparison with and without this change?

xiaohuguo2023 · 2024-12-04T21:24:02Z

Can you please add performance comparison with and without this change?

see above

vgokhale · 2024-12-06T16:27:35Z

python/perf-kernels/rmsnorm.py

+        mask = col_offsets < n_cols
+        tl.assume(input_row_stride >= 0)
+        tl.assume(output_row_stride >= 0)
+        for row_idx in tl.range(row_start, n_rows, NUM_PRGMS):


Is this loop needed? NUM_PRGMS = n_rows in the caller. So it will never execute more than once?

yes, if N is small, and M > 304, we will need this persistent loop

In persistent kernel the grid is sized according to the number of CUs.

On line 134 and 135, the grid is set as number of rows. It is agnostic to CUs.

How is this kernel persistent?

you are correct, this is a fake persistent loop, I will remove it, and to get it work for persistent, I need a new PR for this

vgokhale · 2024-12-06T22:15:52Z

python/perf-kernels/rmsnorm.py

@@ -153,7 +217,8 @@ def benchmark(M, N, provider):
        x = torch.randn(M, N, device='cuda', dtype=dtype)
        y = torch.zeros_like(x, device='cuda')
        n_rows, n_cols = x.shape
-        blk_size = triton.next_power_of_2(n_cols)
+        MAX_FUSED_SIZE = 65536 // x.element_size()


Can you add some comments on this magic number?

sorry, just saw your messages, anything great than MAX_FUSED_SIZE, we start to have spills

vgokhale · 2024-12-06T22:16:41Z

LGTM after outstanding comments are addressed.

vgokhale · 2024-12-06T22:17:33Z

python/perf-kernels/rmsnorm.py

+        input_ptrs = tl.multiple_of(input_ptrs, (16, ))
+        g_ptrs = g_ptr + cols
+        output_ptrs = row_output_ptr + cols
+        x = tl.load(input_ptrs, mask=mask, other=0.0, cache_modifier=".cg")


Have you tried peeling the last iter? is it worth trying? Can you add a TODO to try that as part of your next PR?

the loop peeling only applied for blocked version, not this one. sure, I will try that

add blocked version to address performance issue of when N is large

a966999

xiaohuguo2023 requested review from vgokhale, scxiao and rahulbatra85 December 4, 2024 14:36

rahulbatra85 reviewed Dec 4, 2024

View reviewed changes

use_blocked should be based on size of N

e2a9d16

block size should be size of total elements

50332b5

xiaohuguo2023 and others added 2 commits December 5, 2024 18:46

Merge branch 'main_perf' into rmsnorm_v2

4f6d582

correct comments

9cee500

vgokhale reviewed Dec 6, 2024

View reviewed changes

remove fake persistent loop

ec5920a

scxiao approved these changes Dec 6, 2024

View reviewed changes

vgokhale reviewed Dec 6, 2024

View reviewed changes

vgokhale self-requested a review December 6, 2024 22:16

vgokhale reviewed Dec 6, 2024

View reviewed changes

xiaohuguo2023 merged commit 736071f into main_perf Dec 6, 2024
4 checks passed

xiaohuguo2023 deleted the rmsnorm_v2 branch December 6, 2024 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add blocked version to address performance issue of when N is large #672

add blocked version to address performance issue of when N is large #672

xiaohuguo2023 commented Dec 4, 2024 •

edited

Loading

rahulbatra85 Dec 4, 2024 •

edited

Loading

xiaohuguo2023 Dec 4, 2024

xiaohuguo2023 Dec 4, 2024

rahulbatra85 Dec 4, 2024

xiaohuguo2023 Dec 6, 2024

vgokhale commented Dec 4, 2024

xiaohuguo2023 commented Dec 4, 2024

vgokhale Dec 6, 2024

xiaohuguo2023 Dec 6, 2024

vgokhale Dec 6, 2024

xiaohuguo2023 Dec 6, 2024

vgokhale Dec 6, 2024

xiaohuguo2023 Dec 6, 2024

vgokhale commented Dec 6, 2024

vgokhale Dec 6, 2024

xiaohuguo2023 Dec 6, 2024

add blocked version to address performance issue of when N is large #672

add blocked version to address performance issue of when N is large #672

Conversation

xiaohuguo2023 commented Dec 4, 2024 • edited Loading

rahulbatra85 Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgokhale commented Dec 4, 2024

xiaohuguo2023 commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgokhale commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaohuguo2023 commented Dec 4, 2024 •

edited

Loading

rahulbatra85 Dec 4, 2024 •

edited

Loading