This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
perf: rework CPU groupnorm implementation #134
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/gn_perf_fix
branch
2 times, most recently
from
August 17, 2024 21:43
7db5c7d
to
dad57d8
Compare
5 tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 835e49c | Previous: c1fafb0 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6042 ns |
5437.5 ns |
1.11 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6208 ns |
6395.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7625 ns |
7833 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5958 ns |
6500 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
128632 ns |
119634 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2508335 ns |
2512796 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
885333 ns |
721042 ns |
1.23 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
423465 ns |
431465 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10041 ns |
9917 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9834 ns |
9834 ns |
1 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9833 ns |
9958 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9833.5 ns |
9958.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
540778 ns |
541539 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17733741 ns |
18802089 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2391125 ns |
2451375 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
701457 ns |
692737 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1833 ns |
3000 ns |
0.61 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1541.5 ns |
1750 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1667 ns |
1708.5 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1708 ns |
1958 ns |
0.87 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21581 ns |
21726 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1321599 ns |
1355600 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
217750 ns |
217375 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31260 ns |
36941 ns |
0.85 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3354.5 ns |
4291.5 ns |
0.78 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3833 ns |
3666 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4125 ns |
4333 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3458 ns |
4167 ns |
0.83 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
145164.5 ns |
144722 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9880239.5 ns |
8880529 ns |
1.11 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1628813 ns |
1648666.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
151966.5 ns |
151411.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57958 ns |
57834 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46333.5 ns |
46729.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46000 ns |
46291 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84395.5 ns |
84333 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37356 ns |
37352 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
589953 ns |
561155 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1085187.5 ns |
1107020.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
79351 ns |
85191 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024854.5 ns |
2030042 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2083145.5 ns |
2097625 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2096895.5 ns |
2087958 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2011750 ns |
2016750 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
231392.5 ns |
232273.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8445771 ns |
8341437 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5641416 ns |
7266041 ns |
0.78 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1521466 ns |
1232312 ns |
1.23 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
146709 ns |
178084 ns |
0.82 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
173083 ns |
147125 ns |
1.18 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
148833 ns |
149875 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
150000 ns |
147833.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165553 ns |
165763.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8451979 ns |
8166656 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1654417 ns |
1617708.5 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
174042 ns |
173962 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1112833 ns |
1106708 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1108542 ns |
1119084 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1139375 ns |
1111167 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1132833 ns |
1119063 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
691332 ns |
689246 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35667293.5 ns |
35815575 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5940750 ns |
6026687.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1039131 ns |
1035291 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4959 ns |
4999.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4375 ns |
4000 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5833.5 ns |
5042 ns |
1.16 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4417 ns |
4875 ns |
0.91 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
93077 ns |
91758 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5569355 ns |
5740227 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
668313 ns |
730666.5 ns |
0.91 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
59811 ns |
62971 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8584 ns |
8750 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8354.5 ns |
8708 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
9125 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8875 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
598095 ns |
599371 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
34936414 ns |
35003908 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5813042 ns |
6751208 ns |
0.86 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388894 ns |
390174 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18500.5 ns |
19209 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18459 ns |
19062.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21458.5 ns |
21666.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18333 ns |
17584 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
67113.5 ns |
66535 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3224484 ns |
3239099 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1352625 ns |
1353167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
72711 ns |
72651 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223000 ns |
116792 ns |
1.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220917 ns |
116375 ns |
1.90 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
212500 ns |
117666 ns |
1.81 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218416 ns |
119084 ns |
1.83 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
353362 ns |
351956 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
15738065 ns |
14239465 ns |
1.11 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5754750 ns |
5649000 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
477395 ns |
474255 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
833 ns |
0.75 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
584 ns |
750 ns |
0.78 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
875 ns |
792 ns |
1.10 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
750 ns |
666 ns |
1.13 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20409 ns |
20726 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1264144.5 ns |
1170656.5 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
298416 ns |
292250 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
31760 ns |
32761 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1437.5 ns |
1334 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1416 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1625 ns |
1666 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1500 ns |
1375 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
124808 ns |
124916 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8831021 ns |
8928265 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1625833 ns |
1698771 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
128451.5 ns |
125781 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7334 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6125 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6166 ns |
6083 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10125 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24078 ns |
23941 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1330146 ns |
1217417.5 ns |
1.09 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
438791 ns |
680500 ns |
0.64 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47170 ns |
47021 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
235000.5 ns |
232875 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
265209 ns |
230291 ns |
1.15 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230083 ns |
241958 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224833 ns |
223792 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
184070 ns |
187237 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29584191 ns |
33020508 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8711646 ns |
8706959 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
612956 ns |
613946 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4166 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4167 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23455 ns |
22984 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2048672 ns |
2096362 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
218812.5 ns |
221520.5 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
49110 ns |
50180 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16791 ns |
16834 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16833 ns |
17125 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17667 ns |
16958 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16916 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
198509 ns |
195722.5 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10575947 ns |
10651426 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
909375 ns |
946584 ns |
0.96 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
179231.5 ns |
180062 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
510083 ns |
510750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405334 ns |
404750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
405292 ns |
404250 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865625 ns |
864833 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113245 ns |
113177 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
396869 ns |
401031 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
454459 ns |
453604.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
248983 ns |
249693 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2307416 ns |
2323208 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2039645.5 ns |
2033333.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2032167 ns |
2026228.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3202437 ns |
3193583.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
242211.5 ns |
240360.5 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
12568672 ns |
9442101 ns |
1.33 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1984042 ns |
1968896 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
762058 ns |
758862.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7021 ns |
6270.5 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6292 ns |
6666.5 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6750 ns |
7687.5 ns |
0.88 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6334 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
93474.5 ns |
93492.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5518295 ns |
5508786.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
803166.5 ns |
864625 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
60690 ns |
60751 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11875 ns |
10083 ns |
1.18 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12125 ns |
12167 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12083.5 ns |
12750 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10916 ns |
11584 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
649108 ns |
662173.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38994231.5 ns |
39475440.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5594063 ns |
5972771 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
415844.5 ns |
403374 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23495 ns |
23307 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2211478 ns |
2179447 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
227708.5 ns |
328812.5 ns |
0.69 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
54380 ns |
51631 ns |
1.05 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2208 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2208 ns |
2083 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
242428.5 ns |
233425 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11060678 ns |
11214097 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2013125 ns |
2043458.5 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
190522 ns |
180732 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8375.5 ns |
11541 ns |
0.73 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9666 ns |
11458 ns |
0.84 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11604.5 ns |
12333 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8750 ns |
11000 ns |
0.80 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
112442 ns |
109091 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3290927 ns |
3254189.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
706834 ns |
740875 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
75201 ns |
75381 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
18521 ns |
32812 ns |
0.56 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
19083 ns |
35583 ns |
0.54 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18709 ns |
33167 ns |
0.56 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17854 ns |
33583.5 ns |
0.53 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
642801 ns |
622088 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
17172659 ns |
16800283 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
4542667 ns |
4505292 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
388784 ns |
385519 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
708 ns |
0.82 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35677 ns |
35978 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1253475 ns |
1178150 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
289812.5 ns |
459291.5 ns |
0.63 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45940 ns |
46000 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
9542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10833.5 ns |
12041.5 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11062.5 ns |
11334 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10209 ns |
9833 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
262682 ns |
265870.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18628091 ns |
18598955 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4787791.5 ns |
5290625 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
373353 ns |
374714 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397042 ns |
397625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288084 ns |
287584 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288042 ns |
287791 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756667 ns |
756375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111881 ns |
111994 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
330206.5 ns |
329207.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
489542 ns |
468416 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
77540 ns |
76521 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1442583 ns |
1448895.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1134209 ns |
1136209 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133750 ns |
1131396 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2360208 ns |
2356083 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
208137.5 ns |
207006.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11262758 ns |
9481058 ns |
1.19 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1637000.5 ns |
1625208.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
323273 ns |
321553 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7021 ns |
7312 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7334 ns |
7541 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8000 ns |
8062.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7333 ns |
7333.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
148568.5 ns |
153081 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5928806 ns |
5776534 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
451125 ns |
470938 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
60270 ns |
61150 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15833.5 ns |
16854.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13500.5 ns |
15708.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15104 ns |
15542 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13541 ns |
12479.5 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
993776 ns |
1021250 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
42935151 ns |
41825159 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5915083 ns |
6411500 ns |
0.92 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
436475 ns |
426474 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25000 ns |
28083.5 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
26292 ns |
26625 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27770.5 ns |
29208 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
30125 ns |
25208 ns |
1.20 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
227462 ns |
222882.5 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8039495.5 ns |
7852868 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
587166 ns |
621583.5 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
117001 ns |
117831 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
154292 ns |
142916.5 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
146604 ns |
114958 ns |
1.28 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
153667 ns |
149541.5 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104208 ns |
151500 ns |
0.69 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1193735 ns |
1187333.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42952167 ns |
44743732 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5910000 ns |
5970583 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
599656 ns |
597955 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77499.5 ns |
78166.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75083 ns |
76333 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
85042 ns |
77042 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74459 ns |
76396 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
234612 ns |
233021 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7732168 ns |
7690691.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
541187 ns |
535250 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
126292 ns |
124351.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
291750 ns |
283125 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
301271 ns |
315833.5 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
291042 ns |
293375 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
294709 ns |
299584 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1211278 ns |
1223336.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42220288 ns |
41588499 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6618416.5 ns |
6484750 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
698427 ns |
698582 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16583.5 ns |
16646 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
16875 ns |
17500 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18542 ns |
17667 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16604 ns |
16583 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
165607 ns |
164724.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5776171 ns |
5915268.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
724479 ns |
443666.5 ns |
1.63 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239272 ns |
237842 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
28625 ns |
26583 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27541 ns |
26625 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
29042 ns |
27750 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27334 ns |
29875 ns |
0.91 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
1045694.5 ns |
1040078 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41576377 ns |
42774820 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5989729.5 ns |
6352728.5 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
704457 ns |
699307 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
10833 ns |
12687.5 ns |
0.85 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11437.5 ns |
13166.5 ns |
0.87 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13104.5 ns |
13479.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11000 ns |
13687.5 ns |
0.80 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
140187 ns |
139794.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3519988 ns |
3546983 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
762167 ns |
908479 ns |
0.84 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
242183 ns |
239582 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21958.5 ns |
37750 ns |
0.58 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22646 ns |
37500 ns |
0.60 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22375 ns |
36959 ns |
0.61 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
23083 ns |
37125 ns |
0.62 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
756953 ns |
753503 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22864881 ns |
22091236.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5241812.5 ns |
5265708.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
689067.5 ns |
684491.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
64271 ns |
63479 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
63500 ns |
64042 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
66042 ns |
65666.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67250.5 ns |
63021 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
122269.5 ns |
119325.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3529561 ns |
3429402.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1376896 ns |
1372375 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
239082 ns |
237212.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
487542 ns |
394042 ns |
1.24 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
450354 ns |
390833 ns |
1.15 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
444375 ns |
356104 ns |
1.25 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
438791.5 ns |
339562.5 ns |
1.29 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
554989 ns |
553205 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21251546 ns |
20941175 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6388250 ns |
5998187.5 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
721452.5 ns |
721532.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7833 ns |
7334 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7000 ns |
7667 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8417 ns |
7541 ns |
1.12 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7417 ns |
7042 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
163171 ns |
160853.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5828490 ns |
5800036 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
475875 ns |
431041 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59641 ns |
59401 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15333 ns |
15583.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15437.5 ns |
15229.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14875 ns |
15375 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13416 ns |
14375 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1017848.5 ns |
1015035 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
39985725 ns |
38865855 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5602875 ns |
5628792 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
407364.5 ns |
407024 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6152250 ns |
6151583 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6373750 ns |
6376417 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6381167 ns |
6376250 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11917750 ns |
11912312.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301774 ns |
302602.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
302353.5 ns |
298133 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19110187 ns |
19016604 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
20002875 ns |
19946812.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19995333 ns |
19945334 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36566937.5 ns |
36494333 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1017432 ns |
1019494 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1158812 ns |
1164692 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
917 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1000 ns |
1000 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
959 ns |
917 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1000 ns |
917 ns |
1.09 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23513 ns |
23519 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2124512 ns |
2109375.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
232500 ns |
324041.5 ns |
0.72 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
217342 ns |
214402 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3666 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
297403 ns |
298735 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11619942.5 ns |
11135634 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2171395.5 ns |
2209459 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
641337 ns |
645226 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8166 ns |
8896 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8354 ns |
8917 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9604.5 ns |
10041.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8541 ns |
8500 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
136618 ns |
136574 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3587474 ns |
3620427 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
795083.5 ns |
861959 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67631 ns |
67761 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12250 ns |
14187.5 ns |
0.86 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13104.5 ns |
14604 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12708 ns |
13666 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
12417 ns |
13500 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
718472 ns |
714394 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22799808 ns |
22432282 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5108167 ns |
5400166 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
364604 ns |
362444 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
334 ns |
0.87 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23005 ns |
22401 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2119128 ns |
2057885 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
224458 ns |
322875 ns |
0.70 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
51830 ns |
51141 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2709 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2959 ns |
2958 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3542 ns |
2792 ns |
1.27 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2959 ns |
2792 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
213984.5 ns |
211494.5 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9979577.5 ns |
9841082 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1641188 ns |
1694625 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
161492 ns |
161162 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11916.5 ns |
14229 ns |
0.84 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11833.5 ns |
14125 ns |
0.84 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12604.5 ns |
15042 ns |
0.84 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11666 ns |
13166.5 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
138131.5 ns |
138230.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3454696 ns |
3441989 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
884249.5 ns |
904000 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
238223 ns |
239462 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
22625 ns |
30792 ns |
0.73 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21334 ns |
31666.5 ns |
0.67 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21604.5 ns |
31083 ns |
0.70 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
22542 ns |
31083.5 ns |
0.73 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
649586.5 ns |
650508.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20411511 ns |
22682431.5 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4565459 ns |
4794458 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
654817 ns |
652296.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4375 ns |
4417 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4416 ns |
4417 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4375 ns |
4334 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4416 ns |
4375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24134 ns |
24503 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2113270 ns |
2175697 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223770.5 ns |
225542 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52760 ns |
52080 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16583.5 ns |
16709 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16583 ns |
16750 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16500 ns |
16542 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16542 ns |
16500 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
354357.5 ns |
351126 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12701524 ns |
12669321.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1070167 ns |
1080541 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
213352.5 ns |
211892 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
1959 ns |
2083 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2042 ns |
2042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2167 ns |
1959 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2042 ns |
1958 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36744 ns |
36935 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1245549.5 ns |
1233787 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
284812.5 ns |
461250 ns |
0.62 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
207412 ns |
206092 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
17000 ns |
17917 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
22999.5 ns |
18000 ns |
1.28 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
19792 ns |
17500 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
18708 ns |
16709 ns |
1.12 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
303916 ns |
304287 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21298274 ns |
20764724.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5237542 ns |
5344187.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
700467 ns |
699867 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59083 ns |
59916 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65646 ns |
65292 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65437.5 ns |
64708 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
66708 ns |
51250 ns |
1.30 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66566 ns |
66521 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
95451 ns |
96421 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
202000 ns |
164541 ns |
1.23 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
165375 ns |
113937.5 ns |
1.45 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
170000 ns |
141875 ns |
1.20 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
312729 ns |
312750 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
231402.5 ns |
229949 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
583327 ns |
584276 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
82916.5 ns |
111000 ns |
0.75 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
122000 ns |
108542 ns |
1.12 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
84875 ns |
110729.5 ns |
0.77 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81479.5 ns |
145625 ns |
0.56 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193530.5 ns |
193372 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5543361 ns |
5531442 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2086125 ns |
2074479 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
171071 ns |
169392 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1886228.5 ns |
1119437.5 ns |
1.68 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1899084 ns |
1129458 ns |
1.68 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1928250 ns |
1170417 ns |
1.65 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1913125 ns |
1244625.5 ns |
1.54 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
571942 ns |
570757 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27732022 ns |
26295440 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9246958 ns |
9147146 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1081981 ns |
1078351 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21729 ns |
21759 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
1940449 ns |
2164251 ns |
0.90 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
337667 ns |
367833 ns |
0.92 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
45551 ns |
45110 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
268734 ns |
266995 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9968011 ns |
9594583 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1106271 ns |
1574833 ns |
0.70 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
187482 ns |
189592 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10875 ns |
9625 ns |
1.13 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9895.5 ns |
10292 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11500 ns |
10312.5 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
10479 ns |
8750 ns |
1.20 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
135713 ns |
136132.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3680004 ns |
3223474 ns |
1.14 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
895479 ns |
878166 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
240337.5 ns |
239692 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
10104.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10145.5 ns |
10625 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9708 ns |
10145.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10834 ns |
10042 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
573950.5 ns |
569140 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19676673.5 ns |
20087444 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4463792 ns |
4786083 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
651017 ns |
650531.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57916 ns |
57708 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46625 ns |
46917 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46125 ns |
46667 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84042 ns |
84292 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40444 ns |
39840 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1435250 ns |
1317862 ns |
1.09 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1154417 ns |
1152833 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75161 ns |
78095.5 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1874959 ns |
1897125 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1976312.5 ns |
1927479 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1964208 ns |
1973791.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1881292 ns |
1896417 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
234207 ns |
236470 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31957727 ns |
32730804 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11022416.5 ns |
10909500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1030711 ns |
1022350.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
419042 ns |
417791.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
419375 ns |
416833.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
420208 ns |
418771 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
420542 ns |
418062 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
236036 ns |
235229 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8282465 ns |
7864869 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
542041 ns |
534854.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
288253 ns |
286883 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
783792 ns |
768334 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
778938 ns |
735375 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
747250 ns |
762250.5 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
782750 ns |
674042 ns |
1.16 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1135371 ns |
1132260.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45251960.5 ns |
47588718 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6497188 ns |
6440750 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
921490 ns |
918239 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3450666.5 ns |
3464500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3366250 ns |
3441791 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3434042 ns |
3444833 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3445521 ns |
3457959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
179244 ns |
175447 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8695448.5 ns |
8231556 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1415312.5 ns |
1413084 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
410215 ns |
431194 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6122020.5 ns |
6209167 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6209458 ns |
6208292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6206021 ns |
6192917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6177041.5 ns |
6202041.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1069713 ns |
1069795 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51379719 ns |
50322190 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7520979 ns |
7331875 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1567331 ns |
1565490.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
471687.5 ns |
471791 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341750 ns |
342062.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
342334 ns |
342250 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
904375 ns |
901625 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46351 ns |
46247 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
859225 ns |
843184 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
543833 ns |
516354 ns |
1.05 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
251242.5 ns |
249962 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2329250 ns |
2340875 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2036125 ns |
2041666.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2037791 ns |
2038042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3207625 ns |
3197458 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
258561 ns |
253522 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
14028982 ns |
12874338 ns |
1.09 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2234208 ns |
2210458 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
790883 ns |
788063 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57833 ns |
57292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
45833 ns |
46500 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45875 ns |
46417 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83792 ns |
84125 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28439.5 ns |
28420 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1363713 ns |
1377166 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1172375 ns |
1162021 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76791 ns |
76810 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024708 ns |
2044292 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1952958 ns |
2076458.5 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2099229 ns |
2089958 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1979459 ns |
1998583.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
239189 ns |
240601 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36882348 ns |
35733583 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11484625 ns |
11180416 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1049721 ns |
1043950 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57833 ns |
58125 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46208 ns |
46583 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46292 ns |
46542 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84250 ns |
84000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
50347 ns |
49930 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
808260.5 ns |
777391 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1121187.5 ns |
1106417 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
74401 ns |
72511 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1924917 ns |
1889916 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970250 ns |
1973000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1977791.5 ns |
1972479 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1857458 ns |
1872625 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
245621.5 ns |
246746 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17757586 ns |
18203740 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9925583.5 ns |
9664041 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
923799.5 ns |
928509 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
459 ns |
292 ns |
1.57 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34804 ns |
35428 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1218893 ns |
1198883.5 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
397375 ns |
429645.5 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48010 ns |
48131 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
6917 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7416 ns |
7791.5 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8520.5 ns |
7208.5 ns |
1.18 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7708 ns |
7209 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
215607.5 ns |
211219 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21091113 ns |
21407176 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5195958 ns |
5360666 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
372224 ns |
376479 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32770 ns |
32448 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1215240.5 ns |
1275807 ns |
0.95 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
256917 ns |
259000 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
42270 ns |
39671 ns |
1.07 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
3000 ns |
2625 ns |
1.14 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2875 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3167 ns |
2667 ns |
1.19 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
3209 ns |
2625 ns |
1.22 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
201962.5 ns |
200262.5 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
8332092.5 ns |
7622969.5 ns |
1.09 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1059833 ns |
979750 ns |
1.08 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
160791.5 ns |
155666.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
455375 ns |
449792 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
459396 ns |
478708 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
457834 ns |
452229 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
424000 ns |
474583.5 ns |
0.89 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
141718 ns |
140930 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6004284 ns |
6179425.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2282458 ns |
2484187.5 ns |
0.92 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
354068.5 ns |
361833 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3807666 ns |
3176354 ns |
1.20 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3805312 ns |
3261062.5 ns |
1.17 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3799042 ns |
3262333.5 ns |
1.16 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3816667 ns |
3221209 ns |
1.18 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
775227 ns |
771390 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34318701 ns |
34655170.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11702250 ns |
11378833 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1497905 ns |
1317413 ns |
1.14 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49826958 ns |
49865792 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35532875 ns |
35514646 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35542209 ns |
35524708 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97217916.5 ns |
97122291.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1598964 ns |
1600652 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1015720 ns |
1009800 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154448729 ns |
154457333 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112325083.5 ns |
112547249.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112474459 ns |
112457583 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295795583.5 ns |
295326229 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6481568 ns |
6539911 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5536597 ns |
5525076 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
19167 ns |
19208 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
18666.5 ns |
19021 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17000 ns |
16583 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15459 ns |
14958 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20686 ns |
20510 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1139455 ns |
1178878.5 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
223292 ns |
225334 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
27480 ns |
25670 ns |
1.07 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10771 ns |
10959 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9125 ns |
9292 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9375 ns |
9250 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17291 ns |
17229 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
295440.5 ns |
293160 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10592984.5 ns |
10236863.5 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1611333 ns |
1622291 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
155402 ns |
153751 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9708 ns |
9167 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8771 ns |
9417 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10167 ns |
10833 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8541.5 ns |
8833 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
141417 ns |
138920 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3531192 ns |
3603346 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
816542 ns |
839250 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
240452 ns |
238792 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9042 ns |
11875 ns |
0.76 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9792 ns |
12292 ns |
0.80 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10041 ns |
11854.5 ns |
0.85 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9750 ns |
11917 ns |
0.82 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
699140.5 ns |
698758 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23822389 ns |
24631890 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
4852938 ns |
5382334 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
674987 ns |
668936 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10542 ns |
12458 ns |
0.85 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9375 ns |
13042 ns |
0.72 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10687.5 ns |
13458 ns |
0.79 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9375 ns |
12125 ns |
0.77 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
138047 ns |
137317.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3390144 ns |
3499319.5 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
909771 ns |
896354.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71791 ns |
69695.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13042 ns |
24541.5 ns |
0.53 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13709 ns |
25229 ns |
0.54 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13979 ns |
24500 ns |
0.57 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13875.5 ns |
24042 ns |
0.58 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
648061 ns |
646856 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20050250 ns |
23295762.5 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4504354 ns |
4815416 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
356714 ns |
351994 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
584 ns |
0.79 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
584 ns |
459 ns |
1.27 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
35556 ns |
36062 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1229644.5 ns |
1169666 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
286417 ns |
452041 ns |
0.63 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
208252 ns |
206923 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8000 ns |
8792 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8458 ns |
8875 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8916 ns |
8250 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8417 ns |
8375 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
235548 ns |
235191.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21701037 ns |
22413504.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5360334 ns |
5654041 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
672887 ns |
673696.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16979 ns |
17583 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
16416.5 ns |
16083 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
14520.5 ns |
14333 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10458 ns |
11333 ns |
0.92 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21463 ns |
21821 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1174474 ns |
1137736 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
211958 ns |
211417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
189952 ns |
188972 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
31958 ns |
32104.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32167 ns |
31729.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32208 ns |
32167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32208 ns |
32125 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
313430 ns |
310996.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11266559 ns |
11938483 ns |
0.94 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1726583 ns |
1824500 ns |
0.95 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
609127 ns |
604126 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
477020.5 ns |
486833.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
445875.5 ns |
508541 ns |
0.88 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
482750 ns |
486625 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
475395.5 ns |
454125 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195221 ns |
194584 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6242919.5 ns |
5926149 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2184791 ns |
2088042 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
352983 ns |
352254 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3824937 ns |
3061708 ns |
1.25 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3830041.5 ns |
3216520.5 ns |
1.19 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3824333 ns |
3216333 ns |
1.19 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3826021 ns |
3168521 ns |
1.21 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
581313.5 ns |
577264 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29006462 ns |
29149705 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10014250 ns |
9970042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1393034.5 ns |
1384533.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
787158958 ns |
784523458 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542549542 ns |
544902000 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
543228083 ns |
543012875 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1564693291.5 ns |
1509420062.5 ns |
1.04 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22779955 ns |
22536899 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14035957 ns |
14041441 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3021582792 ns |
2510925417 ns |
1.20 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
3018211250 ns |
1803839541 ns |
1.67 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
2290281375 ns |
1795547625 ns |
1.28 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4758106584 ns |
4752835000 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
367725794 ns |
307750578 ns |
1.19 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88501294 ns |
88099937 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76083 ns |
77917 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
79333.5 ns |
77875 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79333 ns |
79042 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
83042 ns |
75791 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
235362 ns |
233766.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8035484 ns |
8078290 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
539395.5 ns |
530146 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
110001 ns |
109531 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
243833 ns |
192667 ns |
1.27 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
277208 ns |
291792 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
208708 ns |
273687.5 ns |
0.76 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
248583 ns |
256646 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1123869 ns |
1119007 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
62288144 ns |
42599415 ns |
1.46 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6396917 ns |
6376104 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
644607 ns |
642377 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199874958.5 ns |
199939334 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
138581125 ns |
139299000 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139276834 ns |
139327708 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
389347000 ns |
388944583 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5835227 ns |
5821793 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3447300.5 ns |
3425144 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619507958.5 ns |
617991979 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
440048792 ns |
440381666 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
441312104 ns |
440103937.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1181437458 ns |
1183443041 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26630511 ns |
26658646.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21849307 ns |
21776749 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7166 ns |
7375 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6333 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6042 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10084 ns |
9959 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27869 ns |
27817 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1499999.5 ns |
1182446 ns |
1.27 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
607271 ns |
639250 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47951 ns |
47461 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220187.5 ns |
226021 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228250 ns |
221917 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223562.5 ns |
221000 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207875 ns |
207292 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
238148 ns |
239313.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47009123 ns |
32659665 ns |
1.44 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8821687.5 ns |
8880750 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
541335 ns |
531505 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10334 ns |
10000 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9270.5 ns |
10834 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10042 ns |
10500 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8791 ns |
9041 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
134453 ns |
134348 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
4752172 ns |
3385012 ns |
1.40 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
902562.5 ns |
900583 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
73011 ns |
75651 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7541 ns |
9542 ns |
0.79 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7917 ns |
9625 ns |
0.82 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
9000 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
9083.5 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
573820 ns |
572551.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
23140180.5 ns |
19627467.5 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4653750 ns |
4779292 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
337993.5 ns |
324513.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
458 ns |
500 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26568 ns |
26716 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1525380 ns |
1199625 ns |
1.27 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
453000 ns |
463625 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
48711 ns |
50855.5 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9729 ns |
10625 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10209 ns |
10687.5 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10937.5 ns |
10375 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9584 ns |
10479.5 ns |
0.91 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
270217 ns |
270531 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
35388163 ns |
23743488 ns |
1.49 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5811500 ns |
5988708 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
393654 ns |
394399 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
106604 ns |
106500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
99562.5 ns |
99666 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
100458 ns |
99312.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146417 ns |
146625 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24179 ns |
24426 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1634225.5 ns |
1181882 ns |
1.38 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
274833 ns |
266166 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190671.5 ns |
190032 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
504916.5 ns |
502958 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
503833 ns |
482375 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
478583 ns |
504042 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
501833.5 ns |
514333.5 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
252696 ns |
251300 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
14907856 ns |
11770034 ns |
1.27 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2190209 ns |
2216125.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
624282 ns |
620221 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5125 ns |
6354 ns |
0.81 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
7041.5 ns |
6666 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7292 ns |
6167 ns |
1.18 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4042 ns |
4292 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16750 ns |
17553 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
73431 ns |
73251 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11396 ns |
11562.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10916.5 ns |
11125 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11000 ns |
10792 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16875 ns |
16417 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
232583 ns |
231268 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
373234 ns |
372348.5 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39166.5 ns |
39396 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
46166.5 ns |
50709 ns |
0.91 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52375 ns |
51500 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13542 ns |
13583 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20275 ns |
20068 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
81410.5 ns |
79491 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
37145.5 ns |
36312.5 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
30937.5 ns |
31833 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32000 ns |
30709 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
63000 ns |
57166.5 ns |
1.10 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
209011 ns |
207413 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
402554 ns |
420690 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1667 ns |
1875 ns |
0.89 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1916 ns |
1875 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2125 ns |
2000 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1916 ns |
1708 ns |
1.12 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20146 ns |
20504.5 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1486576.5 ns |
1119812.5 ns |
1.33 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
319520.5 ns |
322917 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
29000 ns |
28660 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2084 ns |
2250 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2167 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2375 ns |
2167 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2375 ns |
2084 ns |
1.14 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
222775.5 ns |
220155.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
12262903.5 ns |
9008904 ns |
1.36 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1557792 ns |
1683062 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
139472 ns |
138171 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6417 ns |
4937.5 ns |
1.30 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
5417 ns |
0.88 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5584 ns |
5875 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
4104.5 ns |
1.10 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
161768 ns |
160460 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5903240.5 ns |
5609681 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
448729 ns |
442896 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
64810 ns |
61631 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8334 ns |
8250 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8458 ns |
8500 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8625 ns |
8375 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8459 ns |
8208 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
951965.5 ns |
947221 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
38542963.5 ns |
38843143 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5697834 ns |
6350958 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391494 ns |
388709 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56708 ns |
57375 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57459 ns |
58250 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57792 ns |
58000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58209 ns |
58250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38079 ns |
38316.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1233354 ns |
1212852.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
391875 ns |
617562.5 ns |
0.63 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206592 ns |
206812 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449437 ns |
477416 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
476791 ns |
507791.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
466875 ns |
472666.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
461875 ns |
444792 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
274183.5 ns |
276598 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
28394052 ns |
26668340 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8043958 ns |
7961291 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
802558 ns |
797578 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3312667 ns |
3332792 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2336249.5 ns |
2334417 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2342875 ns |
2333542 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6347375 ns |
6313375 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204537 ns |
205221 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
212302 ns |
202667 ns |
1.05 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11415541.5 ns |
11461208 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8328417 ns |
8317917 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8327584 ns |
8320916 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21174250 ns |
21185000 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
735759 ns |
735643 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1061331 ns |
1062746 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4958 ns |
5000 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4875 ns |
5396 ns |
0.90 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6666 ns |
6291.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4750 ns |
4645.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
154138 ns |
153484 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5561590 ns |
5666549 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
804458 ns |
812625.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56461 ns |
56320 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7250 ns |
7250 ns |
1 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7270.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7334 ns |
7500 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7541 ns |
7208.5 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
807821.5 ns |
801828 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
35003795 ns |
35498019 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5490583.5 ns |
5867167 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
378974 ns |
378564 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
91396.5 ns |
131459 ns |
0.70 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
102583 ns |
142167 ns |
0.72 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
127792 ns |
142146 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
97645.5 ns |
164292 ns |
0.59 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
157957 ns |
149293 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5977195 ns |
5868037 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2236250 ns |
2917812.5 ns |
0.77 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
186092 ns |
186522 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024312 ns |
1341416.5 ns |
1.51 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1994291 ns |
1357249.5 ns |
1.47 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2024250 ns |
1356499.5 ns |
1.49 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2017625 ns |
1449166.5 ns |
1.39 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
770062 ns |
771014 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32930857 ns |
32793983.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11041979.5 ns |
11204625 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1258193 ns |
1252157 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34583.5 ns |
34437 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
36312.5 ns |
35083 ns |
1.04 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
35209 ns |
35333 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15885 ns |
15590 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
72280 ns |
71891 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2604.5 ns |
2500 ns |
1.04 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3000 ns |
2959 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3000 ns |
2792 ns |
1.07 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2333 ns |
2145.5 ns |
1.09 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
149559 ns |
148484 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
352028.5 ns |
350923.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7250 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
6083 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6084 ns |
5958 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10000 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37142 ns |
37340 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1251206 ns |
1174226.5 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
616812.5 ns |
360000 ns |
1.71 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48800 ns |
48850 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215687.5 ns |
214187.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228792 ns |
231708.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222646 ns |
228771 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207791.5 ns |
206750 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
251808 ns |
256068 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26763071 ns |
26779708 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7803770.5 ns |
7835125.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
518295 ns |
515070 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3959 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3959 ns |
3916 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21823 ns |
22329 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2050078 ns |
2087967 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
247063 ns |
247041 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
47650 ns |
47341 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14917 ns |
15000 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14958 ns |
15000 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14958 ns |
14875 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14750 ns |
14791 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
338791 ns |
341001 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11385289.5 ns |
11479170 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1006250 ns |
1027583 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
204762 ns |
206562 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
102208.5 ns |
109209 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
104917 ns |
138417 ns |
0.76 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
129750 ns |
111250 ns |
1.17 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
104021 ns |
147334 ns |
0.71 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
159126.5 ns |
160787 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5714397.5 ns |
5651486 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2300354 ns |
2194562.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185907 ns |
184382 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1922334 ns |
1235958 ns |
1.56 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1913292 ns |
1245041.5 ns |
1.54 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1926521 ns |
1239895.5 ns |
1.55 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1902937.5 ns |
1331583 ns |
1.43 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
764317.5 ns |
762378 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29980716 ns |
31522616.5 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10895437.5 ns |
10791938 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1074156 ns |
1076411 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19042 ns |
21604 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19167 ns |
21479.5 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20250 ns |
23312.5 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18354.5 ns |
20521 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
125836 ns |
125219.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3384577.5 ns |
3275413 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1379375 ns |
1406750 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
81551 ns |
81081 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218750 ns |
131166.5 ns |
1.67 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226250 ns |
141083.5 ns |
1.60 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217229 ns |
160916 ns |
1.35 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
229334 ns |
123166.5 ns |
1.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
571424 ns |
566379.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20023693.5 ns |
19392377 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6043770.5 ns |
6064791 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
480905 ns |
477160 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
24937 ns |
23541 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
31958 ns |
31042 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28250 ns |
29374.5 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1375 ns |
1520.5 ns |
0.90 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16933 ns |
16877.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
83400 ns |
83491 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
5062.5 ns |
4396 ns |
1.15 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5229 ns |
5458 ns |
0.96 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
4958 ns |
5104.5 ns |
0.97 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4292 ns |
4958 ns |
0.87 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
229400 ns |
228917.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
383374 ns |
372328 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
305459 ns |
306604.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
305916.5 ns |
306333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
307625 ns |
307917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
306333 ns |
304979 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
257756.5 ns |
258687 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7522370.5 ns |
7844322 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
555271 ns |
573771 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
279323 ns |
277933 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
581709 ns |
542250 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
540500 ns |
543479 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
595667 ns |
585709 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
591291.5 ns |
538604 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1191373.5 ns |
1187647 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45474161 ns |
43445666 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6084625 ns |
6112625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
873299 ns |
870608 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19166.5 ns |
21125 ns |
0.91 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20750 ns |
21749.5 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21459 ns |
21812.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21916 ns |
18020.5 ns |
1.22 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
132926.5 ns |
132203 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3790040 ns |
3739210.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1493395.5 ns |
1504000 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77091 ns |
78030.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219437.5 ns |
167333 ns |
1.31 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219583.5 ns |
134500 ns |
1.63 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220812.5 ns |
159042 ns |
1.39 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214062.5 ns |
124209 ns |
1.72 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
872731 ns |
880193 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25269894 ns |
25393029 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7298812.5 ns |
7251645.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
548566 ns |
543615 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6354.5 ns |
6520.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7125 ns |
6896 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7584 ns |
7500 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6791 ns |
6084 ns |
1.12 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
156211.5 ns |
156999 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5636934.5 ns |
5793375.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
787166 ns |
870687 ns |
0.90 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
69301 ns |
68881 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10104.5 ns |
9583.5 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10417 ns |
10687.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10437.5 ns |
10958 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9979.5 ns |
9604 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
883875 ns |
885668 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37160740 ns |
37904917 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5369959 ns |
5829833 ns |
0.92 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
397264 ns |
393169 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5041 ns |
1.17 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6333 ns |
5520.5 ns |
1.15 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6708.5 ns |
6479.5 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5125 ns |
6729 ns |
0.76 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
159436 ns |
159186.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5501137.5 ns |
5729090.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
793167 ns |
876750 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
60120 ns |
60940 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7479.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7584 ns |
7875 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7792 ns |
7625 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7604.5 ns |
7292 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
829204 ns |
832069 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
37986219 ns |
38659149 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5619458.5 ns |
6332125 ns |
0.89 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
401834 ns |
393804 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14497458 ns |
14495625 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10132666 ns |
10140833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10124562.5 ns |
10106521 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27906167 ns |
27875104.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
530686 ns |
529047 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
398124 ns |
386994 ns |
1.03 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46170625 ns |
46373458 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33506291.5 ns |
33487354.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33427292 ns |
33494417 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85895667 ns |
85793667 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2656682 ns |
2653722 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3278943 ns |
3282222 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
67166.5 ns |
67187.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
66292 ns |
66875 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67958.5 ns |
67708 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67104 ns |
64833 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
133780.5 ns |
135830.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3478485 ns |
3619962.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1540000 ns |
1520104 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
228942 ns |
228172.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
440145.5 ns |
364542 ns |
1.21 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
483083 ns |
375833 ns |
1.29 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
453417 ns |
409937.5 ns |
1.11 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
466041.5 ns |
354250 ns |
1.32 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
792134 ns |
800863 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25961291 ns |
26204351 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7763375 ns |
7686729.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
805429 ns |
805248 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32596 ns |
33339 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1178603 ns |
1169338.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
288000 ns |
468292 ns |
0.62 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49240 ns |
49541 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9333.5 ns |
9417 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
10042 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11042 ns |
10334 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9750 ns |
9791.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
297797 ns |
301951 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
24618078 ns |
22325214.5 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5354667 ns |
5516417 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
388694 ns |
390349 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9833 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9875 ns |
9875 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9792 ns |
9792 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9792 ns |
9750 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23426 ns |
23482 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2048844 ns |
2052000 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
224416 ns |
223708 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
216872 ns |
216553 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46208 ns |
46125 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46416 ns |
46750 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46209 ns |
46209 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46083 ns |
45709 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
308866 ns |
309105 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11831397 ns |
11209966.5 ns |
1.06 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
968625 ns |
958416.5 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
622956.5 ns |
624786 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56166 ns |
56500 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57167 ns |
57333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57083 ns |
57041 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57917 ns |
57875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
29168 ns |
29824 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1167578 ns |
1164396.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
411062.5 ns |
641250 ns |
0.64 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206532 ns |
204527 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
452416 ns |
454458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
509583.5 ns |
478167 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
476416 ns |
478521 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
444959 ns |
446521 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
254124 ns |
259076 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31789996 ns |
33336521.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9426125 ns |
9171667 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
849939 ns |
842558 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
640625 ns |
637625 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
619541.5 ns |
644625 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
630458 ns |
639750 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
662667 ns |
661334 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
224984.5 ns |
225397 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7955006 ns |
8285194.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1362896 ns |
1367250 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
251843 ns |
241067 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2230750 ns |
2232333 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2232854.5 ns |
2248792 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2228750 ns |
2225834 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2273000 ns |
2254541.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1043963.5 ns |
1060393 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47501673.5 ns |
47792349 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8542916 ns |
9665042 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1385645 ns |
1382969 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20937.5 ns |
23229 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20333 ns |
22666.5 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21604.5 ns |
24875 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19770.5 ns |
22250 ns |
0.89 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
127036.5 ns |
127723 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3566947 ns |
3567777 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1521478.5 ns |
1513396 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76451 ns |
75991 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
242292 ns |
169709 ns |
1.43 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
258208.5 ns |
136646.5 ns |
1.89 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
233313 ns |
153833.5 ns |
1.52 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220271 ns |
139917 ns |
1.57 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
826844.5 ns |
837422 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26518846 ns |
25105631 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7525209 ns |
7698542 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
565015.5 ns |
564226 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23432 ns |
23760 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1184152 ns |
1214816 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
488395.5 ns |
475542 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
53071 ns |
50011 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10771 ns |
11125 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10854.5 ns |
11000 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
11187.5 ns |
11292 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10541.5 ns |
10125 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
275712 ns |
279802.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24826970.5 ns |
25720935 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6139916.5 ns |
6173209 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
406834 ns |
412239 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8083 ns |
9313 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9625 ns |
10187.5 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10625 ns |
11042 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8791.5 ns |
9396 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
135624.5 ns |
137005 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3326426 ns |
3441963 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
887166.5 ns |
884667 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68240.5 ns |
68471 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7292 ns |
9542 ns |
0.76 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
9500 ns |
0.86 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8000 ns |
9083.5 ns |
0.88 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8834 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
551032.5 ns |
555464 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17379523 ns |
17569095 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4314688 ns |
4514000 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
336788 ns |
331008 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1333 ns |
1479.5 ns |
0.90 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1708.5 ns |
1792 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1958 ns |
1792 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1646 ns |
1500 ns |
1.10 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20952 ns |
21494 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1135875.5 ns |
1185348 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
306459 ns |
314000 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
192772 ns |
191562 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3209 ns |
3334 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3437.5 ns |
3333 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3541 ns |
3250 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3416.5 ns |
3208 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
239311.5 ns |
241554 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10237973 ns |
10425050 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1678250 ns |
1821521 ns |
0.92 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
596836 ns |
592756 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
149416 ns |
149166 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
129833 ns |
130208.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
129499.5 ns |
128645.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225958.5 ns |
225834 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24017 ns |
24389 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1143847 ns |
1201989 ns |
0.95 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
301229 ns |
302583 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36430 ns |
36510 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143250 ns |
143708 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
123813 ns |
111020.5 ns |
1.12 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
127875 ns |
134437.5 ns |
0.95 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
255500 ns |
262958 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
238432 ns |
239921 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10471458 ns |
10686272 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2041416 ns |
2070938 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
221102 ns |
224742 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7250 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10167 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33217 ns |
33902 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1230067 ns |
1196982 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
352792 ns |
714417 ns |
0.49 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50240 ns |
50290 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220625 ns |
263208 ns |
0.84 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229604 ns |
231250 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
239229.5 ns |
241146 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214270.5 ns |
213791 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
268350 ns |
274134 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
28209273 ns |
27340584 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8153604 ns |
7954750 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
529575 ns |
527115 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15437.5 ns |
14979 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15208 ns |
15020.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
15875 ns |
16209 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15166.5 ns |
14812.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
156499 ns |
156749.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5474853 ns |
5690134 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
805708 ns |
864916 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
239243 ns |
238262 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23812.5 ns |
22625 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
22875 ns |
24458.5 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24041 ns |
24354.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23312 ns |
23250 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
922540.5 ns |
928671 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39041185.5 ns |
41299107 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5759833 ns |
5736208.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
693637 ns |
690377 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9562.5 ns |
10917 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9625 ns |
11854.5 ns |
0.81 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10709 ns |
12167 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10333 ns |
11666.5 ns |
0.89 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
140292 ns |
141298.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3529937.5 ns |
3563617 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
797604 ns |
843146 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
71605.5 ns |
70831 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13459 ns |
29354 ns |
0.46 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14353.5 ns |
30000 ns |
0.48 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14375 ns |
29083 ns |
0.49 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14334 ns |
28771 ns |
0.50 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
760819 ns |
768813 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
20915172 ns |
20417695 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5000417 ns |
5157104.5 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
373884 ns |
372409 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8812.5 ns |
13333 ns |
0.66 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9625 ns |
12791.5 ns |
0.75 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10042 ns |
13208 ns |
0.76 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10583.5 ns |
12146 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
138903.5 ns |
139709 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3466863.5 ns |
3557647.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
883500 ns |
909917 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70570 ns |
73631 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12563 ns |
23750 ns |
0.53 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13083.5 ns |
23833 ns |
0.55 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13124.5 ns |
24208 ns |
0.54 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12729.5 ns |
23458 ns |
0.54 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
622007 ns |
626538 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19686555 ns |
19300609 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4122375 ns |
4583187 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
346024 ns |
344169 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
31584 ns |
31375 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
35208 ns |
33396 ns |
1.05 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31209 ns |
31208 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1917 ns |
2021 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16552 ns |
16930 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
74491 ns |
74830 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5208 ns |
5125 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5500 ns |
5354.5 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5333.5 ns |
5166 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6541.5 ns |
6375 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
149130 ns |
150124.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
374329 ns |
370864 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25816 ns |
26814 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1207741.5 ns |
1225504 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
293417 ns |
454292 ns |
0.65 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48471 ns |
48391 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7083.5 ns |
7500 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7750 ns |
7833 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7875 ns |
7250 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7667 ns |
7208.5 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
197623 ns |
201565.5 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22789021.5 ns |
23235084.5 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5867521 ns |
5935021.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
392354 ns |
390884 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
1959 ns |
2083 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2042 ns |
2042 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2041 ns |
1958 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2041 ns |
2000 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
27291 ns |
28025 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1215403 ns |
1226218.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
313396 ns |
471291 ns |
0.66 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
209502 ns |
207862 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17333.5 ns |
17500 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17749.5 ns |
17854 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17916 ns |
17958 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
18000 ns |
17083 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
282695.5 ns |
285936.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25744566.5 ns |
25228181.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5995917 ns |
6216792 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
717467 ns |
714727 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
192292 ns |
148333 ns |
1.30 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
150833 ns |
176416.5 ns |
0.85 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
175500 ns |
152625 ns |
1.15 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
148042 ns |
170125 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
218870 ns |
222295.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8000798 ns |
7715073.5 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1455667 ns |
1456521 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
210472 ns |
210297 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1322708.5 ns |
1320395.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1316291.5 ns |
1319125 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1325604 ns |
1317292 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1338875 ns |
1294854 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
979638 ns |
992319 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45361365.5 ns |
44464195.5 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6376521 ns |
6790917 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1017855.5 ns |
1121586.5 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23854 ns |
26396 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27083.5 ns |
25667 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28459 ns |
26916.5 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25563 ns |
25375 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
264500.5 ns |
268095.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7809214 ns |
8204793 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
895084 ns |
733937.5 ns |
1.22 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
118461 ns |
119821 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
119708 ns |
173375 ns |
0.69 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
177541 ns |
155792 ns |
1.14 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
143167 ns |
131333 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
168708 ns |
178833 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1198851 ns |
1210208.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45600571 ns |
43840934 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6093687.5 ns |
6272209 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
605316.5 ns |
603516 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23314 ns |
23508 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1228429.5 ns |
1238248 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
394000 ns |
445167 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
51920 ns |
48931 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7209 ns |
7875 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7667 ns |
7709 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8167 ns |
7500 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7875 ns |
7542 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
203859 ns |
207810 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25563216.5 ns |
24656260.5 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5757500 ns |
6158292 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
399854 ns |
393874 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6646 ns |
5875 ns |
1.13 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6916 ns |
0.89 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6937.5 ns |
7125 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5625 ns |
5499.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
166454.5 ns |
167316.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5809274 ns |
5777197 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
708770.5 ns |
447625 ns |
1.58 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
238783 ns |
237783 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10167 ns |
10166.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9959 ns |
10166 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10041 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9833 ns |
9666 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
967413.5 ns |
970034.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41785539 ns |
42782054 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5925645.5 ns |
6418791.5 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
679277 ns |
681416.5 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22759 ns |
22878 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2139858 ns |
2049585 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
226937.5 ns |
324604 ns |
0.70 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
216062 ns |
215392 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4625 ns |
4583 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4875 ns |
4833 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4791 ns |
4584 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4666 ns |
4583 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
236640.5 ns |
238765.5 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9553141 ns |
9762790 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1652125 ns |
1655625 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
600776 ns |
596696 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9167 ns |
8937.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8500 ns |
8875 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9520.5 ns |
10958 ns |
0.87 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8521 ns |
8667 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
137766 ns |
139144 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3877362 ns |
3594360 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
786312.5 ns |
862042 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69650.5 ns |
71351 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
11625 ns |
0.70 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9062.5 ns |
11833 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8708 ns |
11250 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
11062.5 ns |
0.80 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
669732 ns |
677180 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21485053 ns |
20723580 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4919771 ns |
5232500 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
353474 ns |
352763 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126458 ns |
127209 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129000 ns |
129333.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
129520.5 ns |
128479.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
181166.5 ns |
183250 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46508 ns |
46780 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
93291 ns |
93411 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
302834 ns |
339917 ns |
0.89 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
343749.5 ns |
329041 ns |
1.04 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
334792 ns |
341708 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
592625 ns |
609125 ns |
0.97 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
206983 ns |
208062.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
491556 ns |
486820 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398125 ns |
397604.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288125 ns |
288333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288125 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
757209 ns |
756875 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43574.5 ns |
43764 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1416178 ns |
1410294 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
500459 ns |
493709 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83591 ns |
83675.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1457375 ns |
1466291.5 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1126667 ns |
1136791 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1132875 ns |
1135375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2362562.5 ns |
2361708 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
263463.5 ns |
265609 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
13311016 ns |
11700873 ns |
1.14 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1816292 ns |
1792250 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
354288 ns |
353883.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
547708 ns |
599750 ns |
0.91 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
649084 ns |
646541 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
652396 ns |
642354.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
654334 ns |
651750.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
201982 ns |
222052.5 ns |
0.91 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8255620.5 ns |
8001504 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1373250 ns |
1368083.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
253218 ns |
250553 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2444583 ns |
2433792 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2438958 ns |
2453333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2456021 ns |
2450084 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2465104 ns |
2446146 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1059434 ns |
1076593 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49903929 ns |
49144651 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9206000 ns |
9471917 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1486386 ns |
1481145 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33750 ns |
33792 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35833 ns |
35646 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
35250 ns |
33916.5 ns |
1.04 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
708.5 ns |
979.5 ns |
0.72 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16085 ns |
16163 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
73280 ns |
73271 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3000 ns |
3041 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3208 ns |
3416.5 ns |
0.94 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3500 ns |
3208 ns |
1.09 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3208 ns |
3042 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
147479 ns |
149087.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
345454 ns |
347298 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406479.5 ns |
406917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
408083 ns |
409000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
409084 ns |
408542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
421791 ns |
421709 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43788.5 ns |
44262 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1374390 ns |
1423580 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1159229.5 ns |
1169583 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
242877.5 ns |
240142.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3872667 ns |
3869979.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3987437 ns |
3995917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3999542 ns |
3987625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3767021 ns |
3783042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
248367 ns |
254172 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35969162.5 ns |
36698272 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11556812.5 ns |
11782500.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1239688 ns |
1239597.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3916 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34416 ns |
34650 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1254522 ns |
1238882 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
183083.5 ns |
261875 ns |
0.70 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
43435.5 ns |
42510 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15792 ns |
15833 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16000 ns |
16041 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15958 ns |
15750 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15541 ns |
15459 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
272573.5 ns |
274536 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
9029094 ns |
9047424.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
888333 ns |
874416.5 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
173932 ns |
179351 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404292 ns |
403937.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295917 ns |
295833 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295562.5 ns |
295583 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
761375 ns |
760750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113129.5 ns |
113238 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1005696.5 ns |
1008963 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
442209 ns |
409875 ns |
1.08 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
91631 ns |
90691 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1471792 ns |
1492104.5 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1150417 ns |
1161916 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1155729.5 ns |
1160479 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2383583 ns |
2384834 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
247677 ns |
255747 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
9605874 ns |
12021719 ns |
0.80 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1880667 ns |
1920937.5 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
355334 ns |
355564 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
625 ns |
0.80 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26332 ns |
26941 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1258812 ns |
1191246 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
457667 ns |
444000 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
211573 ns |
211532 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8062.5 ns |
8688 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8666.5 ns |
9083 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9083 ns |
8375 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8791.5 ns |
8500 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
216840 ns |
220821.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24545217 ns |
25176337 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5775000 ns |
6029000 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
704998 ns |
698797 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
834604 ns |
832791.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
617104 ns |
619250 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
621854.5 ns |
616458 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1542333.5 ns |
1546542 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
131980 ns |
131009 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
163901 ns |
163946.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2694104 ns |
2694041.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
2004583.5 ns |
2004458 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2005604 ns |
2011312.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4957938 ns |
4947688 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
252564.5 ns |
250909 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
882654.5 ns |
881169 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32204 ns |
32264 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1208893 ns |
1169718 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
431375 ns |
445458 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
50650 ns |
48990 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7041.5 ns |
7604.5 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7416 ns |
7667 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7791 ns |
7417 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7688 ns |
7334 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
226380.5 ns |
228776.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21316304 ns |
21955889 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4812583 ns |
5605854.5 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
372278.5 ns |
371843.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2369521 ns |
2428708 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2394479 ns |
2407042 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2397750 ns |
2383833 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2391083.5 ns |
2399125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
197960 ns |
216906.5 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8220871.5 ns |
7952502 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1425417 ns |
1453042 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
360798.5 ns |
358813.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4620416 ns |
4653958 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4653625 ns |
4663000 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4665708.5 ns |
4642292 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4681167 ns |
4666271 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
967988.5 ns |
972209 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47578796 ns |
46330123 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6898542 ns |
6641750 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1260673 ns |
1413054.5 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6729 ns |
7083.5 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7104 ns |
7375 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7209 ns |
7145.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7333 ns |
7583 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23200 ns |
23810 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1027845 ns |
1108764.5 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
272042 ns |
271104 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
39230 ns |
38020 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
53333 ns |
69417 ns |
0.77 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
33562 ns |
32917 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
48645.5 ns |
49958 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
49125 ns |
64167 ns |
0.77 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
237637 ns |
236370 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10782045 ns |
10552299 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2087041.5 ns |
2069000 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
222347 ns |
236452 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21792 ns |
21875 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
25875 ns |
25479 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24334 ns |
24583.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5271 ns |
5292 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18207 ns |
18099 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
85991 ns |
85341 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12000 ns |
12042 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10500 ns |
10500.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
11000 ns |
10375 ns |
1.06 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18437.5 ns |
18292 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
246951.5 ns |
245984.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
378013 ns |
375823.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
405625 ns |
405750 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297042 ns |
296958 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296542 ns |
296709 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
763333 ns |
762834 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46048 ns |
46525 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1384674.5 ns |
1383371.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
407292 ns |
429125 ns |
0.95 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89421 ns |
89211 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1487979 ns |
1492083 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1159229 ns |
1167583 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1164583 ns |
1165979.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2387791.5 ns |
2388333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
298677.5 ns |
295648 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
14095391.5 ns |
12300838 ns |
1.15 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2104333 ns |
2096021 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
380349 ns |
377223 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
433750 ns |
435625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436520.5 ns |
438500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436541.5 ns |
438792 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
448583 ns |
448750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
55246 ns |
55236 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1030654 ns |
1047143 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1162770.5 ns |
1128750 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
237662 ns |
236028 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3862458 ns |
3892667 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4025771 ns |
4033792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4032500 ns |
4016916.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3818750 ns |
3818500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
268759 ns |
271864 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31040419 ns |
31422713 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10498666 ns |
10635458 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1247197.5 ns |
1232417 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8709 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7667 ns |
7667 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7667 ns |
7625 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12459 ns |
12375 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24118 ns |
23822 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2224323 ns |
2169618 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
225292 ns |
226458 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
218122 ns |
217182 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45125 ns |
45208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45167 ns |
45792 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45666 ns |
45167 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
44917 ns |
45042 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
364860.5 ns |
362753 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13727353.5 ns |
11448473 ns |
1.20 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1837020.5 ns |
1763770.5 ns |
1.04 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
675127 ns |
670177 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
86208 ns |
160583 ns |
0.54 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
87833 ns |
135792 ns |
0.65 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
124750 ns |
138250 ns |
0.90 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
91625 ns |
122542 ns |
0.75 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190737.5 ns |
190208 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5938165.5 ns |
5709657 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2108667 ns |
2138667 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185242 ns |
206202 ns |
0.90 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2002125 ns |
1263083 ns |
1.59 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2012583.5 ns |
1291854 ns |
1.56 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2000062 ns |
1283542 ns |
1.56 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2024270.5 ns |
1348667 ns |
1.50 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
574437 ns |
573721 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28287780 ns |
27951504.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9414709 ns |
9876750 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1110142 ns |
954730 ns |
1.16 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/gn_perf_fix
branch
3 times, most recently
from
August 17, 2024 23:45
1409228
to
496707b
Compare
avik-pal
force-pushed
the
ap/gn_perf_fix
branch
2 times, most recently
from
August 18, 2024 00:25
2d8396a
to
e8493fb
Compare
avik-pal
force-pushed
the
ap/gn_perf_fix
branch
from
August 18, 2024 00:46
e8493fb
to
835e49c
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.