This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 9b7286d | Previous: 604783f | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5042 ns |
5375 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5125 ns |
5250 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6583 ns |
7708.5 ns |
0.85 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5125 ns |
5416 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
106934 ns |
113361 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2903264 ns |
2795172 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
418904 ns |
601544 ns |
0.70 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
9729.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9875 ns |
9938 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10333 ns |
10167 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9750 ns |
11063 ns |
0.88 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
509431 ns |
544547 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18145309 ns |
17852957 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
664847 ns |
629346 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1312.5 ns |
1500 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1604 ns |
1458 ns |
1.10 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1917 ns |
1771 ns |
1.08 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1312.5 ns |
1583 ns |
0.83 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
20523 ns |
20770 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1336257 ns |
1342503 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31420 ns |
30997 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3521 ns |
4104 ns |
0.86 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3959 ns |
4500 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4500 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3833.5 ns |
4333 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
136761.5 ns |
134970 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9485308 ns |
8677498 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
147442 ns |
138579 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57417 ns |
57666.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47541 ns |
46875 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47542 ns |
47125 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80541 ns |
81458 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37101 ns |
36587 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
559661 ns |
582336 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
79531 ns |
69420 ns |
1.15 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2034750 ns |
2030375 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2084521.5 ns |
2088625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2070833 ns |
2086625 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2000729 ns |
1998562 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
219850 ns |
217216 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8230111 ns |
8077777 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1184482 ns |
930850 ns |
1.27 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148042 ns |
175083 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
169209 ns |
147291 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
175125 ns |
150021 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147208 ns |
151750 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167624 ns |
166825 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8071401.5 ns |
7358467.5 ns |
1.10 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
187542 ns |
262570 ns |
0.71 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1108854.5 ns |
1115103.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1109542 ns |
1110771 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1126333 ns |
1113771 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1112291 ns |
1136250 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
648435 ns |
639845.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34776278 ns |
33057102 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1012710 ns |
864075 ns |
1.17 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4917 ns |
3792 ns |
1.30 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4812.5 ns |
4479 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5666.5 ns |
6583 ns |
0.86 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3875 ns |
6375 ns |
0.61 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
84692.5 ns |
85209.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5745043.5 ns |
5875726.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
61061 ns |
59531 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8666 ns |
8417 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8708 ns |
8750 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9041 ns |
9042 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8958 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
567531 ns |
557500.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
36159388.5 ns |
34838164 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
390569 ns |
370833 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17125 ns |
17958 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17542 ns |
16458 ns |
1.07 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19917 ns |
21125 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17437 ns |
17292 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
63455.5 ns |
63776.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3109889.5 ns |
2927491.5 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75491 ns |
82870 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218459 ns |
212625 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212021 ns |
213042 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
224041.5 ns |
212771 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214917 ns |
212291 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
330824 ns |
329859 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
14581752 ns |
12611094 ns |
1.16 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
471355 ns |
405232 ns |
1.16 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583 ns |
667 ns |
0.87 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
666 ns |
625 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
916 ns |
875 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
583 ns |
709 ns |
0.82 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
19184 ns |
19101 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1196032.5 ns |
1145778 ns |
1.04 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
30790 ns |
26409 ns |
1.17 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1375 ns |
1334 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583 ns |
1583 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
118015 ns |
117126.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8426159.5 ns |
8850213 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
123561 ns |
115676 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416.5 ns |
7375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
6041 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6084 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9875 ns |
9958 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24050 ns |
23587 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1274308.5 ns |
1261233 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47100 ns |
52723 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
230291 ns |
229167 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232000 ns |
230667 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266667 ns |
267875 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212979 ns |
257458 ns |
0.83 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
190269 ns |
182744 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
30917862 ns |
32590762.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
603736 ns |
548449.5 ns |
1.10 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23276 ns |
22860 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2010135 ns |
1933593 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48340 ns |
39504 ns |
1.22 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16833 ns |
17042 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16875 ns |
16875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17209 ns |
17083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17166 ns |
16875 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
187223 ns |
185787.5 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10116574 ns |
10029430 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
177062 ns |
162052 ns |
1.09 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
491333 ns |
491583 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
385750 ns |
385625 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
385833 ns |
386458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
847083.5 ns |
844083 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113452 ns |
113763 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
401173 ns |
418213 ns |
0.96 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
244572 ns |
388657 ns |
0.63 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2152291.5 ns |
2155583 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1860958 ns |
1863374.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1862458 ns |
1865167 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3118353.5 ns |
3377520.5 ns |
0.92 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
230368 ns |
229580 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
10676380 ns |
9922983 ns |
1.08 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
743078 ns |
610962 ns |
1.22 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6354 ns |
6500 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6437.5 ns |
5500 ns |
1.17 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6958 ns |
7667 ns |
0.91 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5167 ns |
1.14 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
84550.5 ns |
84720.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5551475 ns |
5300415 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
59440 ns |
59932 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11459 ns |
11229 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11666.5 ns |
11395.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12000 ns |
12334 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11438 ns |
10667 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
610257 ns |
602168 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37918559 ns |
38613143.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
410225 ns |
383917 ns |
1.07 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23314 ns |
23328 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2245583 ns |
2178076 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
49110 ns |
41367 ns |
1.19 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2166 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2167 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
229615 ns |
228927.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11471919 ns |
11774524 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
180602 ns |
165900 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8083 ns |
9584 ns |
0.84 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9000 ns |
8333 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10771 ns |
9895.5 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8833 ns |
8542 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
109160 ns |
105241 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3234020 ns |
3103348.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
72921 ns |
71955 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
18542 ns |
17688 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17250 ns |
16666.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18770.5 ns |
18708 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
19146 ns |
17562 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
601629.5 ns |
595171 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
18001662.5 ns |
16252508 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
380179 ns |
358129 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34903 ns |
34578 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1280813.5 ns |
1237584 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47631 ns |
41387 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
9229 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8812 ns |
8958.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9937.5 ns |
9750 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9250 ns |
8104 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
260309 ns |
257823 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20391796 ns |
18331589 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
367584 ns |
349944 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396708 ns |
397270.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288333 ns |
288083 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288042 ns |
288666.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750541 ns |
751792 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111691.5 ns |
112022 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
332135 ns |
349915 ns |
0.95 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
74831 ns |
74609 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1459375 ns |
1454270.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1133375 ns |
1130500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1131542 ns |
1131583 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2439208 ns |
2437959 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
201195 ns |
200057 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10334858 ns |
7687949 ns |
1.34 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
324153 ns |
302285 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6917 ns |
7750 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6875 ns |
7083.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7542 ns |
8312.5 ns |
0.91 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6687.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
134405 ns |
139766 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5807519.5 ns |
5685169 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59841 ns |
60383 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14750 ns |
13479.5 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14500 ns |
12750 ns |
1.14 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15625 ns |
15125 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15708 ns |
14625.5 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
889477 ns |
923489 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
43796054 ns |
42519536.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
430555 ns |
407432 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23625 ns |
25625 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24834 ns |
23666 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25000 ns |
29417 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23896 ns |
24041 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
185493.5 ns |
186240.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7687837.5 ns |
7554376 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
115711.5 ns |
120505 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
153209 ns |
152187 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
145583.5 ns |
145250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
153291 ns |
146917 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104333 ns |
103958 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1016908 ns |
1013659 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42426309 ns |
44493070 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
578946 ns |
535240 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74292 ns |
74583 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74354.5 ns |
79584 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
78334 ns |
76791.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74375 ns |
76083 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
191164.5 ns |
190594.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7963971 ns |
7364811 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
123771 ns |
121316.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
319458 ns |
273562.5 ns |
1.17 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
311000 ns |
304084 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
304750 ns |
303333 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
274792 ns |
307583 ns |
0.89 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1022252 ns |
1045024 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
41781714 ns |
39473308 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
691217 ns |
624192 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
13104.5 ns |
12417 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13020.5 ns |
12896 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14062.5 ns |
14000 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
13375 ns |
12500 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
137424.5 ns |
138416 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5742536 ns |
5479910 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
235003 ns |
226152 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26520.5 ns |
27792 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25209 ns |
26458 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26625 ns |
28437.5 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27062.5 ns |
33937.5 ns |
0.80 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
932532.5 ns |
924126.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41660148 ns |
42086872 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
697632 ns |
610976 ns |
1.14 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11875 ns |
11124.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
10584 ns |
10333 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12791 ns |
12479.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11125 ns |
11125 ns |
1 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
118662.5 ns |
118543.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3679682 ns |
3443799.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
237602 ns |
233176 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
23209 ns |
22291.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
23041 ns |
22417 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
23208.5 ns |
24167 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
22084 ns |
28562.5 ns |
0.77 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
677613.5 ns |
668341 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22594824 ns |
21034051 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
677387 ns |
569113 ns |
1.19 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
63854.5 ns |
68709 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
64000 ns |
62750 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67208 ns |
67520.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
62958 ns |
64417 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
102574 ns |
102389 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3618527 ns |
3441143 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
236212 ns |
230751 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
466729 ns |
506375 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
477417 ns |
510167 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
503895.5 ns |
475209 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
474459 ns |
647896 ns |
0.73 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
492357 ns |
492781 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21057710 ns |
20664230 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
717612.5 ns |
593680 ns |
1.21 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7437.5 ns |
7958 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7291 ns |
6750 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8500 ns |
8208 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7187 ns |
7562.5 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
136181.5 ns |
137965 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5536801 ns |
5508177.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59160 ns |
62687 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13125 ns |
16125 ns |
0.81 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13542 ns |
16250 ns |
0.83 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15625 ns |
16250 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15041 ns |
14833 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
898966 ns |
900927 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
39073495 ns |
39349971 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
401324 ns |
388286 ns |
1.03 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6152896 ns |
6150354 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6373625 ns |
6368167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6374875 ns |
6373937.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11911750 ns |
11915167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
348505 ns |
345749 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/GPU/oneAPI |
52576961 ns |
49052559 ns |
1.07 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
305468.5 ns |
388426 ns |
0.79 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19051437.5 ns |
19083437.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19952875 ns |
19960479.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
20023416.5 ns |
19966834 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36511083.5 ns |
37142104 ns |
0.98 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1017592 ns |
1072087 ns |
0.95 |
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI |
77843031 ns |
78467188 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1167942 ns |
1035750.5 ns |
1.13 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
959 ns |
958 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1041 ns |
1000 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
1042 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
959 ns |
958 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23689.5 ns |
23415 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2126340 ns |
2079171 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
210263 ns |
200906 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4083 ns |
4000 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4125 ns |
4041 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3917 ns |
5458 ns |
0.72 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
284075 ns |
270573.5 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10799500.5 ns |
10484095 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
625312 ns |
486775 ns |
1.28 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7583 ns |
8687 ns |
0.87 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7750.5 ns |
7459 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9500 ns |
9334 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7562.5 ns |
7834 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
117941 ns |
116220 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3465578 ns |
3435001.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
66261 ns |
71133 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
13125 ns |
12125 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11834 ns |
11958 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
13208 ns |
13000 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
12229.5 ns |
11750 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
624724 ns |
609643.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22359924 ns |
21784602 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
355344 ns |
341729 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22660.5 ns |
22413 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2090808 ns |
2035110 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46931 ns |
44053 ns |
1.07 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3000 ns |
3000 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
2917 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3334 ns |
3208 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3167 ns |
2916 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
202209.5 ns |
194923.5 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9381105 ns |
9225861.5 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
156911.5 ns |
154488.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11917 ns |
11625 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11708 ns |
10500 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12583 ns |
12875 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11542 ns |
11875 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
117102 ns |
115370 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3369882 ns |
3433218 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
234642.5 ns |
231793 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
22229 ns |
22667 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21333 ns |
22104.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24542 ns |
23625 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21334 ns |
26729 ns |
0.80 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
575582 ns |
555861 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19911746 ns |
20482208 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
654527 ns |
545740 ns |
1.20 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4208 ns |
4334 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4209 ns |
4333 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4291 ns |
4208 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24983 ns |
23923 ns |
1.04 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2175425.5 ns |
2205811 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48880 ns |
44864 ns |
1.09 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16250 ns |
16500 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16250 ns |
16333 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16250 ns |
16166 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16292 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
326969 ns |
319806 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12126578 ns |
10190777 ns |
1.19 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
210562 ns |
186077 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
2125 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
2084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2167 ns |
2209 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2125 ns |
2000 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36547 ns |
35327 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1188039 ns |
1213779 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
208582 ns |
199242 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
18500 ns |
17104 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
16624.5 ns |
20167 ns |
0.82 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
17354.5 ns |
19000 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
19688 ns |
23083.5 ns |
0.85 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
291021.5 ns |
284984 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19477895 ns |
18211018 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
687748 ns |
583431 ns |
1.18 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59437.5 ns |
59458 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
64125 ns |
65666 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
66375 ns |
66125 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51125 ns |
52833 ns |
0.97 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66729 ns |
66304 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/GPU/oneAPI |
84867273.5 ns |
87707222.5 ns |
0.97 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
98541 ns |
110241 ns |
0.89 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
132875 ns |
153041 ns |
0.87 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
110729.5 ns |
155229 ns |
0.71 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
163458.5 ns |
130209 ns |
1.26 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
232750 ns |
286334 ns |
0.81 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
214764 ns |
210129.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI |
145741629.5 ns |
149924497 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
552656 ns |
511145 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
81917 ns |
106521 ns |
0.77 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
86646 ns |
78958 ns |
1.10 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
89250 ns |
84042 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81916 ns |
115521 ns |
0.71 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193747 ns |
191513.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5265023 ns |
5334020 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
169752 ns |
267630 ns |
0.63 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1909063 ns |
1894896 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1910500 ns |
1902375 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1920458 ns |
1878334 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1930417 ns |
1895250 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
515414.5 ns |
507442 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26285784 ns |
28152566.5 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
918784.5 ns |
825763 ns |
1.11 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22081 ns |
21516 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2134831.5 ns |
2100524 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40930 ns |
35507 ns |
1.15 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
252925 ns |
245735 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
10692380.5 ns |
9780504 ns |
1.09 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
182692 ns |
164548 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9458 ns |
10916 ns |
0.87 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9188 ns |
8291 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10792 ns |
11146 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8417 ns |
9500 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
117162 ns |
114788 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3311573.5 ns |
3351587 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
237482 ns |
232004 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9458 ns |
8916 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
8854.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11833 ns |
10917 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8584 ns |
9583 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
505476 ns |
491693 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19448705 ns |
19969043 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
634061.5 ns |
536332 ns |
1.18 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
57958 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47167 ns |
46625 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46458 ns |
46750 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81542 ns |
83166 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39572.5 ns |
38476.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1348881 ns |
1460287 ns |
0.92 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77771 ns |
71814 ns |
1.08 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1918812 ns |
1905145.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1974208.5 ns |
1949542 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982521 ns |
1958500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1878771 ns |
1874958 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
218975.5 ns |
212675 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31988008.5 ns |
33332615 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1006995.5 ns |
968925.5 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
267167 ns |
267500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
270042 ns |
271479.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
270125 ns |
271209 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267500 ns |
268209 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
199236.5 ns |
194219.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7576739 ns |
7638787 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
283863 ns |
271267 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
601875 ns |
585333.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
592188 ns |
600292 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
688500 ns |
671042 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
585896 ns |
845604.5 ns |
0.69 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1011793 ns |
991966 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43904369 ns |
42952243 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
901014.5 ns |
831153 ns |
1.08 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2191333 ns |
2211666 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2189750 ns |
2203958 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2188208 ns |
2229083 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2218500 ns |
2173792 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
156704.5 ns |
161646 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8448752 ns |
8668502.5 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
408114.5 ns |
470965 ns |
0.87 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5478916 ns |
5493104.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5506458.5 ns |
5515875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5476312.5 ns |
5526542 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5505042 ns |
6852458 ns |
0.80 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
932002 ns |
959137 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50568026 ns |
49532486 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1719777 ns |
1437405 ns |
1.20 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
478417 ns |
478292 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
346417 ns |
345625 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
346833 ns |
346750 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
909416 ns |
908542 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
45620.5 ns |
46909 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
849567 ns |
871386 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
245423 ns |
393175 ns |
0.62 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2168937.5 ns |
2137500 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1865604 ns |
1869334 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1853875 ns |
1859271 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3118125 ns |
3380209 ns |
0.92 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
258815 ns |
264095.5 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
15172873 ns |
13390420 ns |
1.13 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
774888 ns |
632907.5 ns |
1.22 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57916.5 ns |
57458 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46625 ns |
46166 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46417 ns |
46250 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83083 ns |
78667 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28412 ns |
28560 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1440678 ns |
1394875.5 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77121 ns |
73147 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2032479.5 ns |
2029292 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2090834 ns |
2078187.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2093437.5 ns |
2063250 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1970145.5 ns |
1963958 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
225176 ns |
230846.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36084585 ns |
36347331 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1040890.5 ns |
980522 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58000 ns |
58083.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47250 ns |
46584 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47042 ns |
46917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82500 ns |
79958 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47929 ns |
48944 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
802274 ns |
829446 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
68761 ns |
71428.5 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1925083 ns |
1871729 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1980416.5 ns |
1973604 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1983458.5 ns |
1944167 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1899708 ns |
1876792 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
232135.5 ns |
238010 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17785153 ns |
18705710.5 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
917619.5 ns |
881607.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34988 ns |
34878 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1197187 ns |
1190778.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48631 ns |
47028 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7166 ns |
6270.5 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6229 ns |
6187.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6833 ns |
6125 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
204201 ns |
211705.5 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20381663 ns |
20119098 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368334 ns |
332741 ns |
1.11 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31750 ns |
32902 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1243003 ns |
1224139 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
37010.5 ns |
36327 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2667 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2708 ns |
2667 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3458 ns |
4292 ns |
0.81 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
3292 ns |
3167 ns |
1.04 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
184588.5 ns |
187662.5 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
8227541 ns |
5673429 ns |
1.45 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
149351 ns |
136635 ns |
1.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
495354 ns |
467208 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
492666 ns |
469417 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
465708 ns |
466875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
459375 ns |
464979.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
135301 ns |
137312 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5838247 ns |
5812904.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
324193 ns |
361475 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4083792 ns |
4027749.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4070187.5 ns |
4071500 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4078000 ns |
4067417 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4075916.5 ns |
5516750 ns |
0.74 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
676920 ns |
690445 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32374547 ns |
32063716 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1471415 ns |
1091915 ns |
1.35 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49815292 ns |
49879250 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35545937.5 ns |
35487583 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35523625 ns |
35512833.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
96973125 ns |
96974083 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1622419.5 ns |
1622377 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/oneAPI |
55438113 ns |
55868634.5 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1055850.5 ns |
1579230 ns |
0.67 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154508083.5 ns |
154423062.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112410479 ns |
112364750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112319041 ns |
112377416 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295571917 ns |
299989812 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6490301 ns |
6468945 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI |
129703825.5 ns |
126761495 ns |
1.02 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5576083 ns |
7230228 ns |
0.77 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
18188 ns |
19104.5 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
17500 ns |
18375 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17041 ns |
17375.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15083.5 ns |
15083 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
19789 ns |
19621 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1104141 ns |
1223248 ns |
0.90 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26210 ns |
28854 ns |
0.91 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10875.5 ns |
11062.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
8791 ns |
8833 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9000 ns |
9291 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17291.5 ns |
17667 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
247103 ns |
252067.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10458746 ns |
9844493 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
146071.5 ns |
138484 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9375 ns |
7937.5 ns |
1.18 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8875 ns |
8125 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9979.5 ns |
10375 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8291.5 ns |
8708 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
118486 ns |
120230.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3549725 ns |
3557828.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
236803 ns |
235119 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10771 ns |
9708 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9312.5 ns |
9084 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
9792 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10208.5 ns |
10667 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
587036 ns |
599437 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22177824 ns |
22720103 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
658231.5 ns |
557070 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9229 ns |
9291.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9791 ns |
8812.5 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10416 ns |
9917 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9208 ns |
8958.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
114121 ns |
118821 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3396921 ns |
3465548.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71690 ns |
71593 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
15145.5 ns |
13687.5 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13853.5 ns |
13604.5 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16167 ns |
14395.5 ns |
1.12 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14584 ns |
14750 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
559358 ns |
570663 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19846391.5 ns |
20121784.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
345674 ns |
323504 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
667 ns |
584 ns |
1.14 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34284 ns |
35088 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1180662 ns |
1218149.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
209633 ns |
203871 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8520.5 ns |
7562.5 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7667 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8167 ns |
7875 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10333.5 ns |
8520.5 ns |
1.21 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
228173.5 ns |
227876 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
24272825 ns |
22566032 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
665251.5 ns |
569945 ns |
1.17 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
15833 ns |
16458 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
15792 ns |
17041 ns |
0.93 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
14250 ns |
16209 ns |
0.88 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10604 ns |
10979 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
20654 ns |
20941 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1114936.5 ns |
1150830 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
189192 ns |
182992 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
35459 ns |
35666 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
35708.5 ns |
35167 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
35917 ns |
36000 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
35583 ns |
57833 ns |
0.62 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
262047 ns |
265749 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11306916 ns |
12188303 ns |
0.93 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
587246 ns |
534293 ns |
1.10 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
453583 ns |
447500 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
451792 ns |
488042 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
456541.5 ns |
455709 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
443395.5 ns |
496916 ns |
0.89 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194459 ns |
195513 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6038382.5 ns |
5997948.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
348194 ns |
328714 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4066000 ns |
4024209 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4050916 ns |
4055021 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4067083.5 ns |
4053917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4058792 ns |
5501562.5 ns |
0.74 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
511129.5 ns |
521631.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27375780 ns |
27256015 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1358564 ns |
1059038 ns |
1.28 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
828745583 ns |
836727208 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
540680709 ns |
553913292 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
540688459 ns |
540736625 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1504191667 ns |
1517196875 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22557615 ns |
22767789 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/oneAPI |
174343277 ns |
174930068 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14523230.5 ns |
10331681 ns |
1.41 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3007511084 ns |
3773348667 ns |
0.80 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
2962783167 ns |
1782084291 ns |
1.66 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1774752667 ns |
1780399750 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4717186625 ns |
4786718666 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
119117244 ns |
118657187 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI |
944339444.5 ns |
1332561794 ns |
0.71 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
87286346 ns |
67063298 ns |
1.30 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76833 ns |
76542 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
80875 ns |
76584 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79667 ns |
79583 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76416 ns |
76708.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
193222 ns |
195943.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7583823.5 ns |
5455658.5 ns |
1.39 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
108921 ns |
123300.5 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
193250 ns |
191292 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
280041.5 ns |
252042 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
281104 ns |
199562.5 ns |
1.41 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
198375 ns |
225542 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
995796 ns |
1004442 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43132049 ns |
43458500 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
631361 ns |
590764 ns |
1.07 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199538562.5 ns |
199694520.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139125625 ns |
138856500 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139078542 ns |
139241166 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
394211084 ns |
393790959 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5836169 ns |
5842492 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/oneAPI |
77966597 ns |
78913006.5 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3607862.5 ns |
4746717.5 ns |
0.76 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
618106583 ns |
617676375.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
439744666 ns |
439446917 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
438674541.5 ns |
439765166.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1195937583 ns |
1174222000 ns |
1.02 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26625423 ns |
26723523 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI |
274698573 ns |
276392509 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21854587 ns |
15854720 ns |
1.38 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7334 ns |
7292 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6208 ns |
6125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
5959 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9833 ns |
9834 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26860 ns |
26896.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1237975 ns |
1173091 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47461 ns |
55173 ns |
0.86 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220166.5 ns |
213041.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219875 ns |
227729 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229604 ns |
220416.5 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205791 ns |
206125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
215811 ns |
219868 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20823014 ns |
20153337 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
524725 ns |
541982 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8459 ns |
8521 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9667 ns |
7458 ns |
1.30 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10167 ns |
11167 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6875 ns |
9250 ns |
0.74 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
112059.5 ns |
115361 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3481104 ns |
3392154.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70541 ns |
74069 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7395.5 ns |
7562.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8000 ns |
7958 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10208 ns |
8167 ns |
1.25 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7583 ns |
7395.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
489162 ns |
495697 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18742318 ns |
20965461 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
313343 ns |
309298 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
417 ns |
417 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25812.5 ns |
26124 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1173150 ns |
1243719 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46821 ns |
45334 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
9584 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
9062.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
11500 ns |
9792 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9208 ns |
9542 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
247400 ns |
247606 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22760470 ns |
24899790.5 ns |
0.91 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
393455 ns |
382304 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
112333 ns |
112312.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
103729.5 ns |
103229 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
104458 ns |
104104.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
154666 ns |
155083 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
23375 ns |
23501 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
817681 ns |
811475 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
189652 ns |
192539 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
586396 ns |
536562 ns |
1.09 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
583374.5 ns |
554250 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
585896 ns |
535291.5 ns |
1.09 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
534334 ns |
910854 ns |
0.59 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
218992 ns |
221242 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11923370.5 ns |
11751092 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
607266 ns |
560216.5 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5333 ns |
5416.5 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6625 ns |
6208.5 ns |
1.07 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7625 ns |
6021 ns |
1.27 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6166.5 ns |
4000 ns |
1.54 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16902 ns |
17520 ns |
0.96 |
batchedmm(16, Bsize=32)/forward/GPU/oneAPI |
72702442 ns |
72849606 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
78091 ns |
73648 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
12479 ns |
11562.5 ns |
1.08 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
11166.5 ns |
11062 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11625 ns |
11000 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16625 ns |
16666 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
203795.5 ns |
207455.5 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI |
98168072 ns |
97442684 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
377594 ns |
330387 ns |
1.14 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39833 ns |
39667 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51500 ns |
51291 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52563 ns |
52958.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13583 ns |
13625 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
19687 ns |
20356 ns |
0.97 |
batchedmm(16, Bsize=128)/forward/GPU/oneAPI |
76051886 ns |
76663129 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
87971 ns |
98364 ns |
0.89 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36458 ns |
36375.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
32395.5 ns |
31417 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32104 ns |
31229.5 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57167 ns |
57000 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
181544 ns |
184178 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI |
111712645 ns |
111708023 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
408524.5 ns |
355254 ns |
1.15 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1854 ns |
1750 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
2042 ns |
0.92 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2250 ns |
2208 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1667 ns |
1875 ns |
0.89 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
19104 ns |
19575 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1153845 ns |
1219758.5 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
27270 ns |
29099.5 ns |
0.94 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2208 ns |
2208 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2291 ns |
2167 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2375 ns |
2375 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2291 ns |
2208 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
196068 ns |
198996.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9312847 ns |
8766738.5 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
136721 ns |
128571 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5375 ns |
4583 ns |
1.17 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
4417 ns |
1.08 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6250 ns |
6729 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
3958 ns |
1.14 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
140814.5 ns |
143699.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5579238.5 ns |
5704411.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
59971 ns |
61955.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8604.5 ns |
8334 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8417 ns |
8083.5 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
8709 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8417 ns |
8583 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
827700 ns |
836045.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
40495109.5 ns |
39725172 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
387884 ns |
364891 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
55208 ns |
54833 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
55916 ns |
55833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
55708 ns |
55583 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
56375 ns |
56000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36532 ns |
36570 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1228814.5 ns |
1345223 ns |
0.91 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206372 ns |
202568 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
523646 ns |
476729 ns |
1.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
535916.5 ns |
494500 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
504166.5 ns |
494208 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
460541 ns |
641625 ns |
0.72 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
256291 ns |
259886 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27542751 ns |
28017517.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
802518 ns |
705894 ns |
1.14 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3314917 ns |
3310333 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2332896 ns |
2334062.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2330833 ns |
2333375 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6314521 ns |
6300479 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204455 ns |
204581.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/oneAPI |
76381388 ns |
77398976 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
216403 ns |
373097 ns |
0.58 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11460833 ns |
11459729 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8303937.5 ns |
8305729.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8310083.5 ns |
8342854 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21101249.5 ns |
21088292 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
734685 ns |
744676 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI |
119951330 ns |
121497637 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1068321.5 ns |
1994797.5 ns |
0.54 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5250 ns |
4833 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5146 ns |
4646 ns |
1.11 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6958.5 ns |
7520.5 ns |
0.93 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
4917 ns |
0.80 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
132062.5 ns |
133339 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5633249 ns |
5450569.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
55811 ns |
61520 ns |
0.91 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7479.5 ns |
7083 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
7291.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7333 ns |
7500 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7187.5 ns |
7416.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
726722 ns |
725863 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
36533131 ns |
33872141 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
371459 ns |
353680 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
122208 ns |
100459 ns |
1.22 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
98583 ns |
123042 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
102750 ns |
102417 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
95458 ns |
121458.5 ns |
0.79 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
149588 ns |
151940.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5726109 ns |
5695179 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185537 ns |
233346 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2029833 ns |
2033271 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2026979 ns |
2026417 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027208 ns |
1997458.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2030625 ns |
2041833 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
673368 ns |
678763 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31179443 ns |
31810809 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1113991.5 ns |
931831 ns |
1.20 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
32417 ns |
32666 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35792 ns |
36562.5 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
34917 ns |
36167 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
500 ns |
667 ns |
0.75 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
14705 ns |
15627 ns |
0.94 |
batchedmm(2, Bsize=4)/forward/GPU/oneAPI |
72130343.5 ns |
72187220 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
85381 ns |
70121 ns |
1.22 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2625 ns |
2604.5 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2917 ns |
2958 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3042 ns |
2937.5 ns |
1.04 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2209 ns |
2167 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
137484 ns |
139744 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI |
92342967 ns |
92749943 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
354874 ns |
289641 ns |
1.23 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7208 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6000 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5916 ns |
5916 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9875 ns |
9917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35660 ns |
35855 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1282087 ns |
1252207 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49651 ns |
53911 ns |
0.92 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213312.5 ns |
212958.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
252999.5 ns |
222708 ns |
1.14 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
231666.5 ns |
219917 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205458 ns |
206209 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
240672 ns |
243430 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26115627.5 ns |
27468024.5 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
512115 ns |
513269 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3791 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21665 ns |
21959 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2106682 ns |
2194149 ns |
0.96 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
43630 ns |
35557 ns |
1.23 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14500 ns |
14500 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14542 ns |
14500 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14584 ns |
14500 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14458 ns |
14459 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
301039 ns |
302419 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11172042.5 ns |
11036089 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
195452 ns |
179841 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
117271 ns |
128041 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
105312.5 ns |
144417 ns |
0.73 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
105666 ns |
106917 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
99375 ns |
151959 ns |
0.65 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
150300.5 ns |
140874 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6047278 ns |
5963081 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
171362 ns |
236762 ns |
0.72 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1912167 ns |
1924583 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1918167 ns |
1920500 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1920208.5 ns |
1914229.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1931333 ns |
1928875 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
660598 ns |
673452 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29570117 ns |
29935915 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1219983 ns |
899671 ns |
1.36 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17875 ns |
17333 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17645.5 ns |
17354.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21583.5 ns |
21208 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18500 ns |
17375 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104806 ns |
108833.5 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3524106 ns |
3415955 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76655.5 ns |
91100 ns |
0.84 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229084 ns |
216917 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
249729.5 ns |
252646 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
225104.5 ns |
222166 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
216374.5 ns |
229125 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
497993 ns |
508535.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19805584 ns |
19323488.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
477940 ns |
419764 ns |
1.14 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
23708.5 ns |
24271 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
30208 ns |
30791.5 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28625 ns |
29437.5 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1250 ns |
1584 ns |
0.79 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15640 ns |
16398 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/GPU/oneAPI |
71882264.5 ns |
72518390 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
87081 ns |
76093 ns |
1.14 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4771 ns |
4500 ns |
1.06 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5042 ns |
4916 ns |
1.03 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5375 ns |
5125 ns |
1.05 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4791 ns |
4625 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
201063.5 ns |
204364 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI |
93151103.5 ns |
94073985 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
390274 ns |
331675 ns |
1.18 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
220875 ns |
222666 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
223291 ns |
220666.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
224542 ns |
225667 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221042 ns |
220583 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
218652.5 ns |
222506.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7853304 ns |
7881934.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
274613 ns |
267871 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
507333.5 ns |
495084 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
513291.5 ns |
511812.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
517583.5 ns |
500854 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
495500 ns |
675750 ns |
0.73 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1033086 ns |
1053634 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43068906 ns |
42862742 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
851254 ns |
780999 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20125 ns |
20375 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22271 ns |
20000 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23084 ns |
23875 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19500 ns |
18792 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112207.5 ns |
114286 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3446725.5 ns |
3510843 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75731 ns |
89858 ns |
0.84 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221542 ns |
212375 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218750 ns |
213041 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223104 ns |
214458 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213312 ns |
212541 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
713841 ns |
727333.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25836932 ns |
24570511 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
533536 ns |
469036 ns |
1.14 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6875 ns |
6666 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6625 ns |
6604.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7458 ns |
8750.5 ns |
0.85 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6084 ns |
6208 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
133356 ns |
137142 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5813011 ns |
5605207 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
67851 ns |
60974 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10583 ns |
9791 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11166.5 ns |
10084 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
10750 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10750 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
791888 ns |
794651.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38204864 ns |
37034174 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
384539 ns |
370101.5 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5000 ns |
4666 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5000 ns |
4708 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6167 ns |
7437.5 ns |
0.83 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
4917 ns |
0.92 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
135679.5 ns |
138544.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5370478.5 ns |
5520602 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
59130 ns |
59692 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7792 ns |
7458 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7542 ns |
7166 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7791 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7708 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
755166.5 ns |
755761 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39315379 ns |
37179182 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
397459.5 ns |
376523 ns |
1.06 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14342083 ns |
14498417 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10137958 ns |
10124125 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10102833 ns |
10094833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27729333 ns |
27748583.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
547150.5 ns |
532665 ns |
1.03 |
batchedmm(128, Bsize=512)/forward/GPU/oneAPI |
94505718 ns |
94795139 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
394839 ns |
866850 ns |
0.46 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46201374.5 ns |
46333437 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33505729.5 ns |
33447541.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33506833 ns |
33510458 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85350458 ns |
85445667 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2644948 ns |
2636151 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI |
192552360 ns |
192783631 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3310165 ns |
5189385.5 ns |
0.64 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66875 ns |
66458 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
67062.5 ns |
65687.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
70562.5 ns |
70500 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66292 ns |
66500 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
117903.5 ns |
118172.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3543442.5 ns |
3662360 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
234013 ns |
237313 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
480625 ns |
467958 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
471875 ns |
480333.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
479792 ns |
474916.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
467417 ns |
686583.5 ns |
0.68 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
702486.5 ns |
715446 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26280981 ns |
26609747 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
792918 ns |
655875 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33017 ns |
32877 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1183373 ns |
1227269 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47820 ns |
47579 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8708 ns |
8750 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9771 ns |
9208 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9208 ns |
9104.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
9750 ns |
0.84 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
282049 ns |
280778.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20818689 ns |
21881943 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
378759 ns |
355484 ns |
1.07 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9458 ns |
9500 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9500 ns |
9500 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9459 ns |
9500 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9459 ns |
9500 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23093.5 ns |
23273 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1930074 ns |
1862112.5 ns |
1.04 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
210972 ns |
200655 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
50417 ns |
50209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
50209 ns |
50250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
51166 ns |
50500 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
50292 ns |
72375 ns |
0.69 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
276253 ns |
278469.5 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13609571 ns |
13204061 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
610356 ns |
491037 ns |
1.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
55167 ns |
54917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
55875 ns |
55667 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
55459 ns |
55584 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
55500 ns |
56000 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27847 ns |
28169 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1197101 ns |
1174691 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206462 ns |
203240 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
479083 ns |
518854 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
504375 ns |
500625 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
537333 ns |
497750 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
461667 ns |
643417 ns |
0.72 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
236634 ns |
238777 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
33673119 ns |
31628121.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
841708 ns |
758938 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
643604.5 ns |
655042 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
591417 ns |
613083 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
647458 ns |
652541 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
653625 ns |
678416.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
206246.5 ns |
192069 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8254396 ns |
8140636 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
266572 ns |
269704 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2241792 ns |
2167104.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2237167 ns |
2233125 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2233375 ns |
2241292 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2235000 ns |
2230208.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
932062 ns |
929752.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48393957 ns |
55073105 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1360079 ns |
1217770.5 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19125 ns |
19500 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24979 ns |
19208.5 ns |
1.30 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23437.5 ns |
23542 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18583 ns |
20000 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111398.5 ns |
111306 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3483614 ns |
3589059.5 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79426 ns |
91551 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225521 ns |
220459 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
233166.5 ns |
226458 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229166 ns |
223104.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219708 ns |
219708 ns |
1 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
719263 ns |
714110 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25222149 ns |
26626181 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
554326 ns |
487481 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
625 ns |
0.80 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
24044 ns |
23491 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1223409 ns |
1232519 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47911 ns |
43771 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9750 ns |
9417 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9667 ns |
9291.5 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10417 ns |
9708 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
9646 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
267747 ns |
261581 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25715090 ns |
23734390 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
404304.5 ns |
381618 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7417 ns |
8917 ns |
0.83 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9000.5 ns |
7583 ns |
1.19 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10458 ns |
11854.5 ns |
0.88 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7959 ns |
9042 ns |
0.88 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
116633.5 ns |
115935.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3300094 ns |
3441325 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69880 ns |
70456.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7666 ns |
8125 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7542 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8000 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7750 ns |
7292 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
492259.5 ns |
484010 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17197317 ns |
17813154.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321574 ns |
302215 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1416 ns |
1417 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1583 ns |
1667 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2083 ns |
1959 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1542 ns |
1500 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20671 ns |
20030 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1166362.5 ns |
1146657 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
191412 ns |
184144 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3459 ns |
3708 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3667 ns |
3625 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3833 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3542 ns |
4917 ns |
0.72 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
216790 ns |
213101.5 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10825139.5 ns |
10511562.5 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
581136 ns |
524324.5 ns |
1.11 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
147979 ns |
148729 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
128271 ns |
128917 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
130625 ns |
129917 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225209 ns |
235541 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23166.5 ns |
22778 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1193615 ns |
1179919.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36441 ns |
46868 ns |
0.78 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143209 ns |
143645.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
125792 ns |
130875 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
130479.5 ns |
138417 ns |
0.94 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
251104.5 ns |
290021 ns |
0.87 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
214906.5 ns |
211960 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10713541 ns |
10741797 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
231632 ns |
223578 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7500 ns |
7167 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5959 ns |
5958 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5958.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34265 ns |
33236 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1203594 ns |
1203805 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49981 ns |
57207 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225604 ns |
221249.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
267167 ns |
238542 ns |
1.12 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
236917 ns |
264500 ns |
0.90 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212812.5 ns |
213250 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
263010 ns |
259447 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
28448372 ns |
27707385 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
521925 ns |
530542 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12416 ns |
13209 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11958 ns |
12166 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13291.5 ns |
13584 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
12416 ns |
12667 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
135191 ns |
135078 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5177938 ns |
5685986 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
235942 ns |
227730.5 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24458.5 ns |
23917 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24750 ns |
24083.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25020.5 ns |
24750 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23792 ns |
30146 ns |
0.79 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
844672 ns |
833527 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39036318 ns |
39963084.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
684117 ns |
615374.5 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8708 ns |
9271 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9687.5 ns |
9541 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10438 ns |
10375 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8541 ns |
9250 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
119636.5 ns |
119628 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3564984 ns |
3356719.5 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
70291 ns |
74940 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13833 ns |
14041 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14083 ns |
13958 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14375 ns |
14750 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14250 ns |
13459 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
646842.5 ns |
638262 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22010471 ns |
22466836 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
363989 ns |
344824 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8875 ns |
9666.5 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9042 ns |
9208 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10792 ns |
10959 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9000 ns |
9083.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
119111 ns |
118521 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3329855.5 ns |
3571671.5 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72351 ns |
79399 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12541.5 ns |
13416 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12958 ns |
12416 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13625 ns |
13479.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13270.5 ns |
12708 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
534452 ns |
530027 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19320932 ns |
19360325 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
339643 ns |
317163 ns |
1.07 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29104 ns |
30896 ns |
0.94 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34250 ns |
33813 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31333 ns |
32249.5 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1667 ns |
1875 ns |
0.89 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16403 ns |
16425 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/oneAPI |
77474262 ns |
76985679 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
78711 ns |
76663 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5313 ns |
5417 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5187.5 ns |
5000 ns |
1.04 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5542 ns |
5479.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6334 ns |
6270.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
139944.5 ns |
138278 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI |
109880855.5 ns |
109824422.5 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
385874 ns |
340566 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26331 ns |
25574 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1212809 ns |
1142450 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
50001 ns |
45666 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6375 ns |
6458 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6520.5 ns |
6375 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6833.5 ns |
6791.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6250 ns |
6458.5 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
189017.5 ns |
185923.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24086071.5 ns |
22900684.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
393884 ns |
365402.5 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
2084 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2084 ns |
2084 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2166 ns |
2083 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2000 ns |
2000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26947 ns |
26453 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1232598.5 ns |
1207656 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
209912 ns |
203645.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17271 ns |
18041 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17958 ns |
17166.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18437.5 ns |
17750 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17416.5 ns |
23458.5 ns |
0.74 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
273533.5 ns |
268326 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25687047 ns |
24994377.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
706927 ns |
600702.5 ns |
1.18 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
174834 ns |
147875 ns |
1.18 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
177583 ns |
155437.5 ns |
1.14 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
154000 ns |
155125 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147938 ns |
151708 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193431 ns |
190890.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7836168 ns |
7974634 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
191502 ns |
271146.5 ns |
0.71 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1322708.5 ns |
1321937.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1324354.5 ns |
1330625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1323396 ns |
1308375 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1323166 ns |
1285166 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
871146 ns |
867140 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46559834.5 ns |
45331705.5 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
992400 ns |
1006962 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24521 ns |
25500 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25208 ns |
23542 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27666 ns |
28708.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24333 ns |
24416.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230336.5 ns |
226899 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7412823 ns |
7680667 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
115211 ns |
128029 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
119395.5 ns |
125062.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
178146.5 ns |
165729.5 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
128749.5 ns |
125854.5 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
117812.5 ns |
180062 ns |
0.65 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
999409 ns |
998018.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45419603 ns |
44411227 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
590316 ns |
568743 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23826 ns |
23453 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1180276 ns |
1190116 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
49221 ns |
44533 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6895.5 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6458 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
6958 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6520.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
205667.5 ns |
201834 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24604250 ns |
23542895 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
395509.5 ns |
372536 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6292 ns |
5645.5 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
5375 ns |
1.13 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7979 ns |
0.86 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5166 ns |
1.13 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
139751.5 ns |
139838.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5712361.5 ns |
5619575.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
236162 ns |
229750 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10354.5 ns |
9958 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10167 ns |
10042 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10209 ns |
10417 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10125 ns |
10854.5 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
879601 ns |
866511 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
40717350 ns |
43130156 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
675267 ns |
603858 ns |
1.12 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
709 ns |
708 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
709 ns |
708 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
708 ns |
750 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
708 ns |
667 ns |
1.06 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23026.5 ns |
22827 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2088235 ns |
2079377 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
210772 ns |
202368 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4917 ns |
4834 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4875 ns |
4833 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5209 ns |
5125 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4833 ns |
6291 ns |
0.77 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
226029 ns |
222098 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10810104.5 ns |
9952955 ns |
1.09 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
586506 ns |
471721 ns |
1.24 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8146 ns |
8750 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8625 ns |
7834 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9270.5 ns |
9375 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7291.5 ns |
7646 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
118631.5 ns |
117939.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3512039 ns |
3568146 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
70461 ns |
74409 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8125.5 ns |
8792 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8583 ns |
8583 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
8875 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8542 ns |
8083 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
573498 ns |
568724.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20841257.5 ns |
20842961 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
343184 ns |
335106 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126249.5 ns |
126042 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
128834 ns |
129208 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
131000 ns |
129542 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
181395.5 ns |
180792 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46573 ns |
46423 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/oneAPI |
70147077 ns |
72616088 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
95481 ns |
101850 ns |
0.94 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
338333 ns |
315875 ns |
1.07 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
349000 ns |
334166.5 ns |
1.04 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
337062.5 ns |
323291.5 ns |
1.04 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
567833.5 ns |
609395.5 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
189615.5 ns |
187684 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI |
92008312 ns |
93899553 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
440594.5 ns |
405833.5 ns |
1.09 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397084 ns |
397500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288708 ns |
287979.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288709 ns |
288375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756250 ns |
756000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43906 ns |
43964 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1378630 ns |
1424885 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
81921 ns |
79439 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1460000 ns |
1461000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1133854 ns |
1133834 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133374.5 ns |
1129645.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2441854.5 ns |
2449292 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
258953 ns |
254140 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11022577.5 ns |
11042616 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
350313 ns |
254646 ns |
1.38 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
652042 ns |
626500 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
655770.5 ns |
657208.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
649874.5 ns |
649750.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
689167 ns |
642417 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
182902 ns |
185720.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8018912 ns |
8332264.5 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
263233 ns |
264649 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2443292 ns |
2452625 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2446500 ns |
2465208.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2444562.5 ns |
2459375 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2453167 ns |
2376375 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
954882.5 ns |
949649 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49210806 ns |
53455476.5 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1318509 ns |
1323598 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33250.5 ns |
32458 ns |
1.02 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35646 ns |
36521 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34312 ns |
34833 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
792 ns |
959 ns |
0.83 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15879 ns |
15902 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/GPU/oneAPI |
67611192 ns |
73782106 ns |
0.92 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
85861 ns |
74499.5 ns |
1.15 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3000 ns |
3125 ns |
0.96 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3125 ns |
3250 ns |
0.96 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3542 ns |
3375 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3084 ns |
3062.5 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
139105.5 ns |
137187.5 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI |
91963406 ns |
98822060.5 ns |
0.93 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
349913.5 ns |
314258 ns |
1.11 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
441375 ns |
436500 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
439792 ns |
438625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
440208 ns |
438791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
450041 ns |
445917 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43441 ns |
42826 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1439826.5 ns |
1503651 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
239472 ns |
374379.5 ns |
0.64 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4142666 ns |
4140000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4279125 ns |
4271375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4266750 ns |
4270687.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4032604 ns |
5468750 ns |
0.74 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
238139 ns |
236201.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36792447.5 ns |
36248116 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1233123 ns |
1135862 ns |
1.09 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3791 ns |
3750 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3792 ns |
3791 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34461 ns |
34158 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1301757 ns |
1274307 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40721 ns |
41117 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15292 ns |
15375 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15375 ns |
15334 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15625 ns |
15500 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15333 ns |
15250 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
259750 ns |
255579 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
9866512.5 ns |
8309435 ns |
1.19 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
168172 ns |
158606 ns |
1.06 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404291 ns |
404792 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
296209 ns |
295917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
294584 ns |
295958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760542 ns |
759750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113769 ns |
113245 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1008763 ns |
1043498 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89651 ns |
91962 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1484624.5 ns |
1482854 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1157750 ns |
1158625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1145000 ns |
1150334 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2462750 ns |
2466708 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
238527 ns |
236768.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12433376 ns |
9725420.5 ns |
1.28 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
352393.5 ns |
298578 ns |
1.18 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
584 ns |
0.86 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
584 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26037 ns |
25569 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1255059.5 ns |
1198679 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
210442 ns |
202679 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7541 ns |
8083 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7708 ns |
7792 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8084 ns |
8375 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7583 ns |
8437.5 ns |
0.90 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
211417.5 ns |
207068.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24400994.5 ns |
25228707 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
693867 ns |
593474 ns |
1.17 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
829271.5 ns |
829375 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
616604 ns |
617667 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
618458 ns |
618667 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1541041.5 ns |
1544417 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
131087.5 ns |
130866 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/oneAPI |
67626584 ns |
74874331.5 ns |
0.90 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
167721 ns |
211214 ns |
0.79 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2681083 ns |
2686104.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1986250 ns |
1994542 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1998541.5 ns |
1998375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4921791 ns |
4960479 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
235041 ns |
234509 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI |
96224859 ns |
102181218 ns |
0.94 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
859218 ns |
831293.5 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
33342 ns |
32562 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1243213 ns |
1276503 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49520 ns |
48691 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6312.5 ns |
6333 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6917 ns |
6667 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6250 ns |
6104.5 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
226628 ns |
227701 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20936439.5 ns |
21756022 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
365763 ns |
346728 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1723875 ns |
1760625 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1762479 ns |
1749875 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1732458.5 ns |
1744292 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1731875 ns |
1755166 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190679 ns |
189332 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8143920 ns |
7765672 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
366039 ns |
413433 ns |
0.89 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4353396 ns |
4360416 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4360791.5 ns |
4366917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4367521 ns |
4349104 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4360208 ns |
5705104 ns |
0.76 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
855902 ns |
849205 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46724354.5 ns |
48802559 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1250493 ns |
1205562.5 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
13708 ns |
9604 ns |
1.43 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7166 ns |
6916 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7834 ns |
8208 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7229 ns |
6854 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
22753 ns |
22924.5 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1051464 ns |
1184238.5 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
37990 ns |
46437 ns |
0.82 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
45958.5 ns |
50604.5 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
50187.5 ns |
52166 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
65249.5 ns |
45458.5 ns |
1.44 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
68708 ns |
33312.5 ns |
2.06 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
213762.5 ns |
211538 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10164364 ns |
10576796.5 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
234103 ns |
226508 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
20791.5 ns |
21646 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
25208 ns |
26083.5 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24229 ns |
24958.5 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5208 ns |
5291.5 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18045 ns |
18121 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/oneAPI |
82328327.5 ns |
88732630 ns |
0.93 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
90131 ns |
73668 ns |
1.22 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12312.5 ns |
12125 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10500 ns |
10667 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10875 ns |
10833 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18125 ns |
18042 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
222607.5 ns |
221707 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI |
143966308.5 ns |
148404121 ns |
0.97 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
383434 ns |
322703 ns |
1.19 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
405834 ns |
405917 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297333 ns |
296791.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296750 ns |
297167 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762541 ns |
756709 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46967 ns |
46696 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1397676 ns |
1393570.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90631 ns |
90770 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1478916 ns |
1487375 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1164458 ns |
1163500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1157937.5 ns |
1157209 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2468458 ns |
2472417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
284577.5 ns |
283340.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
14188453 ns |
11947586 ns |
1.19 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
375254 ns |
269032 ns |
1.39 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
438667 ns |
436458 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
439125 ns |
443270.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
440208 ns |
440750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447917 ns |
449000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54562 ns |
53940 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
999880.5 ns |
1027722 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
236502.5 ns |
323133 ns |
0.73 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4148000 ns |
4138541 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4270667 ns |
4268354.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4263812.5 ns |
4258750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4038187.5 ns |
5475229.5 ns |
0.74 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
258729.5 ns |
255597 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31613962 ns |
31502698.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1211463 ns |
1132896.5 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9292 ns |
9333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
8000 ns |
8000 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
8000 ns |
8000 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
13250 ns |
13250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24006 ns |
23885 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2083029 ns |
1973050 ns |
1.06 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
213222 ns |
202528 ns |
1.05 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
49667 ns |
49625 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
49583 ns |
49667 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
50041 ns |
49583 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
49416 ns |
71667 ns |
0.69 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
344384 ns |
336641 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
12740327 ns |
13058534 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
656211.5 ns |
508895.5 ns |
1.29 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83063 ns |
108270.5 ns |
0.77 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82958 ns |
86167 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
87395.5 ns |
86500 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
107312.5 ns |
146083 ns |
0.73 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192844 ns |
192063 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6092023.5 ns |
5750624 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
200502 ns |
267851 ns |
0.75 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2016750 ns |
2018917 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2018396 ns |
2016937.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2015124.5 ns |
2011375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2019500 ns |
2024000.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
519972 ns |
511598 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27686183 ns |
30563079 ns |
0.91 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1092442 ns |
860237 ns |
1.27 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Don't Merge triggering build with new Enzyme release