This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
perf: fusing activation functions and other misc perf improvements #126
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/act_fuse2
branch
2 times, most recently
from
August 12, 2024 00:52
e1ccc36
to
e47271f
Compare
20 tasks
avik-pal
force-pushed
the
ap/act_fuse2
branch
7 times, most recently
from
August 13, 2024 07:33
3148a89
to
a345fe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 19be2ca | Previous: 6426043 | Ratio |
---|---|---|---|
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56875 ns |
35792 ns |
1.59 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57791 ns |
29709 ns |
1.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57792 ns |
29583 ns |
1.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58250 ns |
55000 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
83410 ns |
39146 ns |
2.13 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
728971 ns |
1178312 ns |
0.62 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
506125 ns |
661250 ns |
0.77 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
356904 ns |
206667.5 ns |
1.73 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
450459 ns |
234833 ns |
1.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
465709 ns |
249584 ns |
1.87 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
498584 ns |
202541 ns |
2.46 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
434625 ns |
345937.5 ns |
1.26 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
215832 ns |
280223 ns |
0.77 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
13199361 ns |
27356049 ns |
0.48 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
4192500 ns |
7836709 ns |
0.53 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
817938 ns |
808758 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
542 ns |
0.54 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
667 ns |
0.56 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
292 ns |
667 ns |
0.44 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
667 ns |
0.44 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26705 ns |
26727 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1504569 ns |
1153992 ns |
1.30 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
340250 ns |
454437 ns |
0.75 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
49601 ns |
48591 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6709 ns |
10000 ns |
0.67 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7834 ns |
10375 ns |
0.76 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7125 ns |
10667 ns |
0.67 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7167 ns |
10083 ns |
0.71 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
164243.5 ns |
201412.5 ns |
0.82 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
10082424 ns |
24286330 ns |
0.42 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
3891979.5 ns |
5574313 ns |
0.70 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
459015 ns |
400079.5 ns |
1.15 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
55583 ns |
94583 ns |
0.59 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46542 ns |
97125 ns |
0.48 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46708 ns |
95625 ns |
0.49 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82791 ns |
135875 ns |
0.61 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40421 ns |
41104 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1487630 ns |
1318240 ns |
1.13 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1157479.5 ns |
1152708 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
84251 ns |
79551 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1922792 ns |
1216750 ns |
1.58 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1974166.5 ns |
1149500 ns |
1.72 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1956146 ns |
1236292 ns |
1.58 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1900375 ns |
1131000 ns |
1.68 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
178262.5 ns |
239669 ns |
0.74 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32212056 ns |
32036283 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11372708 ns |
10989604 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1021771 ns |
1025085.5 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458 ns |
14875 ns |
0.09801680672268907 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1333.5 ns |
15875 ns |
0.084 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
2042 ns |
15458 ns |
0.13 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1750 ns |
11042 ns |
0.16 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21663 ns |
22126 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1152516.5 ns |
1470687 ns |
0.78 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
190000 ns |
207208 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31531 ns |
31530 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4333 ns |
14875 ns |
0.29 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3687.5 ns |
14792 ns |
0.25 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4250 ns |
14958 ns |
0.28 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4458 ns |
14625 ns |
0.30 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
142909.5 ns |
148820 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9336728 ns |
8661230 ns |
1.08 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1457791 ns |
1532875 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
146772 ns |
151851 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
55333 ns |
82708.5 ns |
0.67 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46375 ns |
81958 ns |
0.57 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46250 ns |
81500 ns |
0.57 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82583.5 ns |
136166 ns |
0.61 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37284 ns |
38471 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
890940 ns |
567295 ns |
1.57 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1140958 ns |
1074604 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78811 ns |
85291 ns |
0.92 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024874.5 ns |
1238583 ns |
1.63 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2091187.5 ns |
1220375 ns |
1.71 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2081375 ns |
1219041.5 ns |
1.71 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1990834 ns |
1403875 ns |
1.42 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
210677 ns |
239045 ns |
0.88 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28000217 ns |
8329149 ns |
3.36 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11050834 ns |
4579959 ns |
2.41 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
955840 ns |
1418844 ns |
0.67 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
16083 ns |
0.45 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
16209 ns |
0.38 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
16084 ns |
0.38 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
16125 ns |
0.63 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37274 ns |
37816 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1282363.5 ns |
1234872.5 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
358750 ns |
390292 ns |
0.92 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
46900 ns |
48940 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
242041 ns |
116667 ns |
2.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
236187.5 ns |
127250 ns |
1.86 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
249417 ns |
117375 ns |
2.12 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
246000 ns |
112125 ns |
2.19 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
186212.5 ns |
258303 ns |
0.72 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25577862.5 ns |
26347832 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7941042 ns |
7806875 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
492325 ns |
517565 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
1875 ns |
3667 ns |
0.51 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
1958 ns |
3791 ns |
0.52 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
1958 ns |
3792 ns |
0.52 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1834 ns |
4125 ns |
0.44 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
27186 ns |
27825 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1249457 ns |
1170474 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
310625 ns |
467792 ns |
0.66 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
209702 ns |
209383 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16750 ns |
28167 ns |
0.59 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18375 ns |
28646 ns |
0.64 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18000 ns |
29021 ns |
0.62 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17417 ns |
30021 ns |
0.58 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
195070 ns |
286300.5 ns |
0.68 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24793786 ns |
23897802 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5385125 ns |
5772208 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
708967 ns |
718348 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
55917 ns |
82625 ns |
0.68 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46167 ns |
82020.5 ns |
0.56 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46125 ns |
81625 ns |
0.57 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82750 ns |
136209 ns |
0.61 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28302 ns |
28938 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1451723 ns |
999533 ns |
1.45 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1147500.5 ns |
1151208 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77140 ns |
78011 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020520.5 ns |
1323604.5 ns |
1.53 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2093458.5 ns |
1313958 ns |
1.59 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2095271 ns |
1307041 ns |
1.60 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1996708 ns |
1534208.5 ns |
1.30 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
191196 ns |
244351 ns |
0.78 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35662214 ns |
36964728 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11437833 ns |
11304583 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1046990 ns |
1046480 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
542 ns |
0.54 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
667 ns |
0.56 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
666 ns |
0.56 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
625 ns |
0.47 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22574 ns |
23144 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1305770.5 ns |
1208983 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
446333 ns |
342749.5 ns |
1.30 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
50210 ns |
49360 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7125 ns |
10708 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8292 ns |
11542 ns |
0.72 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7917 ns |
11874.5 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7583 ns |
10250 ns |
0.74 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
167233.5 ns |
204673 ns |
0.82 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23772286.5 ns |
24111144 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5480125 ns |
6107958 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
392159 ns |
403004 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
56083 ns |
98375 ns |
0.57 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46416 ns |
95062.5 ns |
0.49 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46479.5 ns |
94875 ns |
0.49 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82709 ns |
136292 ns |
0.61 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
49641 ns |
51489 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
636733 ns |
785195 ns |
0.81 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1102375 ns |
1104458 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
80306 ns |
69496 ns |
1.16 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1909916 ns |
1136375 ns |
1.68 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1972708 ns |
1124416 ns |
1.75 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1974687.5 ns |
1151875 ns |
1.71 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1890959 ns |
1203396 ns |
1.57 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
194221 ns |
252720 ns |
0.77 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30315655 ns |
18838190 ns |
1.61 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9917000 ns |
9662667 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
921639 ns |
929214.5 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
584 ns |
1666 ns |
0.35 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
812.5 ns |
1750 ns |
0.46 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
1042 ns |
2167 ns |
0.48 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
2042 ns |
0.31 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20990 ns |
20791 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1243750 ns |
1175759 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
291979 ns |
292959 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
33800 ns |
33140 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1458.5 ns |
2125 ns |
0.69 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1500 ns |
2458 ns |
0.61 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583 ns |
2417 ns |
0.65 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1458 ns |
2000 ns |
0.73 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
128883 ns |
127770.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9448663 ns |
8845531 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1694437 ns |
1561146 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
132216.5 ns |
128606.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
583 ns |
0.50 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
625 ns |
0.67 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
666 ns |
0.56 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
542 ns |
0.54 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35059 ns |
36022 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1298446.5 ns |
1250216.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
284375 ns |
377395.5 ns |
0.75 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48351 ns |
48550 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
9083 ns |
0.79 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8333 ns |
9000 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7541 ns |
9666 ns |
0.78 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7479.5 ns |
8812.5 ns |
0.85 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
174785 ns |
214920 ns |
0.81 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20938075 ns |
20414049.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5002771 ns |
4631333.5 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
378793 ns |
375684 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
16500 ns |
0.43 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
16584 ns |
0.37 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
16459 ns |
0.37 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
16500 ns |
0.60 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23823 ns |
24522 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1334936 ns |
1265913 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
604375 ns |
645041.5 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49355.5 ns |
47030 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
262667 ns |
143416.5 ns |
1.83 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
265042 ns |
173208 ns |
1.53 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
269000 ns |
137208 ns |
1.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
259417 ns |
147333 ns |
1.76 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
188465.5 ns |
189286 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
35995208 ns |
28874599.5 ns |
1.25 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8975291.5 ns |
8789250 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
567896 ns |
615281 ns |
0.92 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
14250 ns |
19791 ns |
0.72 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
17375 ns |
19020.5 ns |
0.91 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17292 ns |
18520.5 ns |
0.93 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15083 ns |
20791.5 ns |
0.73 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
21895 ns |
20764 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1263662.5 ns |
1131585 ns |
1.12 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
221125 ns |
220479 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
31490 ns |
26310 ns |
1.20 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10917 ns |
18916 ns |
0.58 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9000 ns |
18250 ns |
0.49 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9250 ns |
18292 ns |
0.51 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17167 ns |
23770.5 ns |
0.72 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
175236.5 ns |
298346 ns |
0.59 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10227843.5 ns |
9570516 ns |
1.07 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1611666.5 ns |
1582250 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
153642 ns |
154912 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
405875 ns |
225084 ns |
1.80 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
409042 ns |
179542 ns |
2.28 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
408458 ns |
178875 ns |
2.28 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
421041 ns |
413125 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43509.5 ns |
44222 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1529291 ns |
1389820 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1176145.5 ns |
1171416 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
243032.5 ns |
240732 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3868354.5 ns |
2228062.5 ns |
1.74 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3997334 ns |
1923687.5 ns |
2.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3989791 ns |
1926520.5 ns |
2.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3776125 ns |
3170083 ns |
1.19 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
237696 ns |
250699.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34681235 ns |
36821504.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11474291 ns |
11786563 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1243523 ns |
1238162.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
416 ns |
2916 ns |
0.14 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
2916 ns |
0.19 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
500 ns |
3000 ns |
0.17 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
458 ns |
2875 ns |
0.16 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35793 ns |
36899 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1251727 ns |
1166121 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
284417 ns |
277167 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49510 ns |
46391 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9416 ns |
18666.5 ns |
0.50 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10125 ns |
19459 ns |
0.52 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10416.5 ns |
21333 ns |
0.49 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9458 ns |
19250.5 ns |
0.49 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
253222 ns |
269415 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
22069562 ns |
17400566.5 ns |
1.27 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5234959 ns |
5085437 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
377774 ns |
377589 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
417 ns |
3042 ns |
0.14 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
3167 ns |
0.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
3208 ns |
0.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
458 ns |
2917 ns |
0.16 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31907 ns |
33391.5 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1262816 ns |
1149211 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
352604 ns |
291209 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47470 ns |
49490 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
19791 ns |
0.52 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9875 ns |
21125 ns |
0.47 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11250 ns |
21875 ns |
0.51 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10542 ns |
19208 ns |
0.55 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
276016 ns |
302281 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
22141088 ns |
22137697 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5497041.5 ns |
5389542 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
394104 ns |
393329 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
11750 ns |
17416.5 ns |
0.67 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
17167 ns |
17562.5 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
15916 ns |
17458 ns |
0.91 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10375 ns |
16208 ns |
0.64 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21529 ns |
22007 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1147174 ns |
1195107 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
214812.5 ns |
212083 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
190712 ns |
190302 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32083 ns |
27750 ns |
1.16 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
31750 ns |
24209 ns |
1.31 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
31959 ns |
24417 ns |
1.31 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
31708 ns |
40479 ns |
0.78 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
287170.5 ns |
314835.5 ns |
0.91 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11042817 ns |
11155150.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1824374.5 ns |
1711854 ns |
1.07 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
606726 ns |
608191.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
708 ns |
0.71 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
792 ns |
0.74 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
792 ns |
0.74 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
792 ns |
0.63 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
35515 ns |
36290 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1229283 ns |
1204951 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
275209 ns |
278645.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
209282 ns |
208802 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8541 ns |
10208 ns |
0.84 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9270.5 ns |
10667 ns |
0.87 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8958 ns |
10959 ns |
0.82 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
10042 ns |
0.80 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
235494 ns |
239441 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21421544.5 ns |
21307139.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4941770.5 ns |
4985250 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
681512 ns |
675322 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
750 ns |
0.61 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
833 ns |
0.70 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
792 ns |
0.74 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
792 ns |
0.58 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26600 ns |
26516 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1249818.5 ns |
1204061.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
459896 ns |
406521 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
207602 ns |
209672 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8229.5 ns |
11250 ns |
0.73 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9208 ns |
12000 ns |
0.77 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9041 ns |
12458 ns |
0.73 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8541 ns |
11583 ns |
0.74 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
213419 ns |
212072 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25815525 ns |
25515071 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6065250 ns |
5682813 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
712268 ns |
706477 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56500 ns |
35333 ns |
1.60 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57166 ns |
29500 ns |
1.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57125 ns |
29333 ns |
1.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57916 ns |
54500 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
29232 ns |
29741 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1182325 ns |
1192558 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
683834 ns |
675459 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206402 ns |
206787.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
488792 ns |
265687.5 ns |
1.84 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
515042 ns |
241187.5 ns |
2.14 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
473542 ns |
218666.5 ns |
2.17 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
470250.5 ns |
410125 ns |
1.15 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
253111 ns |
259339 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
35487764 ns |
31700012 ns |
1.12 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9155875 ns |
9604021 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
861719 ns |
854999 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
542 ns |
0.46 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
625 ns |
0.67 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
625 ns |
0.60 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
625 ns |
0.47 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32164 ns |
32274 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1256267 ns |
1211786 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
326770.5 ns |
420709 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49541 ns |
52000 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
9500 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7959 ns |
10312.5 ns |
0.77 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7875 ns |
10687 ns |
0.74 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7625 ns |
9291.5 ns |
0.82 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
227699 ns |
229200 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22647102 ns |
21880902 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4872083.5 ns |
5109250 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
371894 ns |
376694 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7166 ns |
181375 ns |
0.039509303928325294 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7250 ns |
187729 ns |
0.03861949938475143 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7416 ns |
188875 ns |
0.03926406353408339 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6750 ns |
143459 ns |
0.047051770889243616 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
24162 ns |
23468 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1203777 ns |
1196290 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
301979 ns |
273792 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
33170 ns |
39211 ns |
0.85 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
33625 ns |
193458 ns |
0.17 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
66396 ns |
191708.5 ns |
0.35 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33500 ns |
204937.5 ns |
0.16 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
33000 ns |
226500 ns |
0.15 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
233166.5 ns |
238377 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10575996 ns |
10900737.5 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2017104 ns |
2034417 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
234692.5 ns |
226092 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7125 ns |
15792 ns |
0.45 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
16209 ns |
0.38 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
15834 ns |
0.38 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10229.5 ns |
15667 ns |
0.65 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28701.5 ns |
28554 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1196816.5 ns |
1218212.5 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
592416.5 ns |
596542 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49390.5 ns |
50301 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219000 ns |
124792 ns |
1.75 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
225084 ns |
126229 ns |
1.78 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
224750 ns |
127334 ns |
1.77 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
246625 ns |
175500.5 ns |
1.41 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
230047 ns |
242364.5 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32871289 ns |
30677520 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9026938 ns |
9028500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
555871 ns |
534385 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
432917 ns |
240250 ns |
1.80 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436750 ns |
190333 ns |
2.29 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436334 ns |
190709 ns |
2.29 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
446875 ns |
442042 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54568 ns |
55862 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
635923 ns |
1016111.5 ns |
0.63 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1100979.5 ns |
1126667 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
232652 ns |
237223 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3899020.5 ns |
2144084 ns |
1.82 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4028958 ns |
1859750 ns |
2.17 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4022646 ns |
1848416 ns |
2.18 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3809125 ns |
2959083.5 ns |
1.29 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
269010 ns |
270815 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31746518 ns |
31291796.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10010333 ns |
10183875 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1235103 ns |
1241163 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
458 ns |
2833 ns |
0.16 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
2916 ns |
0.19 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
459 ns |
2917 ns |
0.16 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
458 ns |
2917 ns |
0.16 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26952 ns |
27379 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1263081 ns |
1192304 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
466396.5 ns |
454792 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50560 ns |
48561 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10250 ns |
23625.5 ns |
0.43 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10125 ns |
23375 ns |
0.43 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10208 ns |
23750 ns |
0.43 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
23083.5 ns |
0.44 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
269464 ns |
273919 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24949123 ns |
22736189.5 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5831896 ns |
5829834 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
390684 ns |
399054 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
104334 ns |
212479.5 ns |
0.49 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
99271 ns |
214125 ns |
0.46 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
100542 ns |
214333 ns |
0.47 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146750 ns |
221166.5 ns |
0.66 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24354 ns |
24897 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1176994 ns |
1188077 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
306874.5 ns |
265291 ns |
1.16 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190281.5 ns |
190962 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
478687.5 ns |
417917 ns |
1.15 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
527979 ns |
341625 ns |
1.55 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
520333 ns |
334687.5 ns |
1.55 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
529708.5 ns |
602250 ns |
0.88 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
251460.5 ns |
255547 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11632878 ns |
12372769.5 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2160750 ns |
2125916.5 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
591085 ns |
625836 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
3083 ns |
0.16 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
3208 ns |
0.18 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
3209 ns |
0.18 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
3000 ns |
0.15 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22776 ns |
23558 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1211728 ns |
1247062 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
473270.5 ns |
331458 ns |
1.43 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50160 ns |
50101 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10334 ns |
24625 ns |
0.42 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
11625 ns |
25333 ns |
0.46 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
11042 ns |
26791 ns |
0.41 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10937.5 ns |
23854.5 ns |
0.46 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
275538 ns |
279705 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
28784070 ns |
24035981 ns |
1.20 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6178625 ns |
5959937.5 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
400298 ns |
416944 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1416 ns |
2208.5 ns |
0.64 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1584 ns |
2292 ns |
0.69 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
2500 ns |
0.75 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1417 ns |
2250 ns |
0.63 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20996 ns |
21285 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1185953 ns |
1169943.5 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
321396 ns |
299791 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
192441 ns |
192672 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3250 ns |
3250 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3417 ns |
3292 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3458 ns |
3250 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3291.5 ns |
3750 ns |
0.88 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
240693 ns |
241487 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10013228.5 ns |
10275500 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1819625 ns |
1643396 ns |
1.11 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
596385 ns |
596106 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148709 ns |
250417 ns |
0.59 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
128521 ns |
243208.5 ns |
0.53 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
129896 ns |
239875 ns |
0.54 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
241333 ns |
287979 ns |
0.84 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24268 ns |
24789 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1208779 ns |
1167987 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
304437.5 ns |
269958 ns |
1.13 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
33520 ns |
36631 ns |
0.92 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143459 ns |
265312.5 ns |
0.54 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
126458 ns |
242729 ns |
0.52 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
110708 ns |
240334 ns |
0.46 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
270187.5 ns |
353541 ns |
0.76 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
232829 ns |
241464 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10416372 ns |
10648580.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2083666 ns |
1971396 ns |
1.06 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
235677 ns |
223693 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
16792 ns |
0.43 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
16750 ns |
0.37 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
16750 ns |
0.36 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
16208 ns |
0.63 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32749 ns |
33837 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1189766 ns |
1266052 ns |
0.94 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
692875 ns |
614000 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49550 ns |
50530 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221312.5 ns |
125750 ns |
1.76 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
241041 ns |
152812 ns |
1.58 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
267792 ns |
127979.5 ns |
2.09 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214187.5 ns |
134208.5 ns |
1.60 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
267710.5 ns |
273789 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
28987559 ns |
27569153 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7979708 ns |
8173209 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
513334 ns |
534905 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
1958 ns |
3666 ns |
0.53 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2041 ns |
3750 ns |
0.54 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2042 ns |
3791 ns |
0.54 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
1917 ns |
4125 ns |
0.46 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36286 ns |
37434 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1191308 ns |
1210446.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
445584 ns |
432542 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
209522 ns |
209622 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16750 ns |
23667 ns |
0.71 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
17000 ns |
23709 ns |
0.72 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
17375 ns |
24479.5 ns |
0.71 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16792 ns |
25542 ns |
0.66 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
305312 ns |
309240 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
23922335 ns |
20107132 ns |
1.19 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5404354.5 ns |
4810166 ns |
1.12 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
694705.5 ns |
692227 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1791 ns |
2250 ns |
0.80 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1958 ns |
2542 ns |
0.77 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2125 ns |
2625 ns |
0.81 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1750 ns |
2375 ns |
0.74 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
21206.5 ns |
20498 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1124010 ns |
1138479 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
314500 ns |
312375 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
28431 ns |
29111 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2166 ns |
2667 ns |
0.81 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
3145.5 ns |
0.69 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
3125 ns |
0.72 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2667 ns |
0.80 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
223080 ns |
224596.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9277099 ns |
9000358 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1575354.5 ns |
1476583.5 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
141301 ns |
138381.5 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #126 +/- ##
==========================================
+ Coverage 80.59% 82.78% +2.18%
==========================================
Files 37 37
Lines 1737 1795 +58
==========================================
+ Hits 1400 1486 +86
+ Misses 337 309 -28 ☔ View full report in Codecov by Sentry. |
avik-pal
force-pushed
the
ap/act_fuse2
branch
3 times, most recently
from
August 13, 2024 15:00
b88a219
to
1cee769
Compare
5 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
bias_activation
perf has been fixed.batchnorm
performance has been fixed.