This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
- Loading branch information
Showing
2 changed files
with
3 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.3.5" | ||
version = "1.3.6" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6976693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
6976693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/118229
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
6976693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6333
ns5000
ns1.27
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5125
ns5125
ns1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8041
ns7375
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5542
ns4833
ns1.15
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
106513
ns108327
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
892833
ns704958
ns1.27
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
441815
ns452318
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9833
ns10000
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns9917
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10209
ns10229.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9895.5
ns9729.5
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
513421
ns538089
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2325125
ns2390625
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
666767
ns709441
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1333
ns1792
ns0.74
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1500
ns1792
ns0.84
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1666.5
ns2000.5
ns0.83
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1459
ns1584
ns0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
20135
ns19729
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
453896
ns439229
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
33340
ns33851
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4375
ns4375
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4375
ns3833.5
ns1.14
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4417
ns4250
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4271
ns3520.5
ns1.21
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
134295
ns134838
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
2377250
ns2235354
ns1.06
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
150416.5
ns143632.5
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58042
ns56375
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46833
ns46875
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46833
ns46750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81500
ns78375
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36986
ns36801
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1095167
ns1444229
ns0.76
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
81585.5
ns84285
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2034750
ns2037375.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2082521
ns2083500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2087250
ns2090334
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1980458
ns1999916
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
221788
ns215168.5
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
4497250
ns5415625
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1017651
ns1280705
ns0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
155458
ns148666.5
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145750
ns145833
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149875
ns152417
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147042
ns160792
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166515
ns167254
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1577896
ns1500250
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
173242
ns172909
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1120521
ns1133479.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1117125
ns1112750
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1113979.5
ns1115292
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1110958
ns1109687.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
654141.5
ns623047
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8479083
ns10180459
ns0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1028872
ns1022168
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4375
ns4771
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4395.5
ns4708
ns0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6125
ns6666
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3875
ns4167
ns0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
85237.5
ns80121.5
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
1327375
ns1222709
ns1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69631
ns56392.5
ns1.23
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8584
ns8521
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8417
ns8542
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9000
ns9375
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8584
ns8542
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
558891
ns547974
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
7932562.5
ns7799104.5
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388659.5
ns384758
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17541.5
ns18062.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19417
ns16875
ns1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19729.5
ns21625
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17104
ns17666.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
63486.5
ns62259
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1339292
ns1327729
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
72720.5
ns76443
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221959
ns212542
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212104
ns217708
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
217417
ns222604.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219541
ns235416.5
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
331752
ns326680
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5750229
ns5672875
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
468975
ns468011
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns625
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
667
ns625
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns959
ns0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns792
ns0.74
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
18902
ns18885
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
430875
ns446167
ns0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
30710
ns31881
ns0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1416
ns1417
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1375
ns1375
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1500
ns1667
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1334
ns1375
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
115876
ns117120.5
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
2152000
ns2151437.5
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
136481
ns135835
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7250
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns6000
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6083
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9959
ns10166
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23983
ns23630
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
870625
ns838084
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48931
ns48897
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221500
ns220042
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
241041
ns234750
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
267583
ns270833.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212771
ns253000.5
ns0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
190752
ns188891
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9536917
ns8581771
ns1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
612916
ns612944.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23861
ns23120
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
449042
ns433416
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
48310
ns47491
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16792
ns16542
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16917
ns17041
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16791
ns17167
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
17084
ns16875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
187040
ns186342.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
2209583
ns2081000
ns1.06
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
176662
ns174571.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
922958
ns919250
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
828208
ns828041
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
826833
ns838917
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
1251916
ns1258333
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113841
ns113235.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
539875
ns452875
ns1.19
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
244213
ns243040
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2596083.5
ns2556167
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2332333
ns2320333.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2327583.5
ns2328916.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3546333
ns3549104.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
231757
ns229235
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2200250
ns2156125
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
745088
ns739658
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6666
ns6084
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5312.5
ns5520.5
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7000
ns8354
ns0.84
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6625
ns5834
ns1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
83981
ns83528.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
1197104.5
ns1131521
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
59291
ns58842
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11604.5
ns11729.5
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11979
ns11583
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10917
ns11479.5
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11250
ns10999.5
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
599108
ns596279
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
7575937.5
ns7505021
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
410064
ns402564
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23782
ns23594
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
438000
ns436875
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
48641
ns48301
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2083
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2084
ns2208
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
224264.5
ns224089.5
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
2511958
ns2406437.5
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
175852
ns182056
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8875
ns8916
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9125
ns8292
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9229
ns11209
ns0.82
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8792
ns8375
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
107320.5
ns101414
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
1245250
ns1214500
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
73535.5
ns73272.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18084
ns18625
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17500
ns17208.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17979
ns18667
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17875.5
ns16771
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
580294
ns555190.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5603625
ns5531208.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
384069
ns379272
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
459
ns458
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33286
ns34468
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
662708.5
ns654854
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
46191
ns45552
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9229
ns9854
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9917
ns9250
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9041.5
ns9458
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8875
ns8562.5
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
252697
ns257386.5
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5732729
ns5553750
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
368379
ns366942
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397000
ns396542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288166
ns288042
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288167
ns287541
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756250
ns756167
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110629.5
ns112104
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
394292
ns519187.5
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
74970
ns76352
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1465084
ns1409875
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1132458
ns1132584
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1130750
ns1126791.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2438896
ns2436813
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
196505
ns199625
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1727792
ns1712834
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
325688.5
ns322335
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7667
ns7083
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6917
ns6874.5
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7437.5
ns8458
ns0.88
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6479.5
ns6938
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
134612.5
ns134438.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
1200042
ns1132749.5
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
59360.5
ns59441
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15146
ns16563
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14750
ns13917
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15417
ns16167
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13270.5
ns15187.5
ns0.87
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
893159
ns880177
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
8030208
ns7959042
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
422534.5
ns418702.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25666
ns24146
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
27708
ns23791.5
ns1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26375
ns28250
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
31250
ns24896
ns1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
184506
ns185908.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1655917
ns1653167
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
112126
ns114524
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
104541
ns152041
ns0.69
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
149479.5
ns105395.5
ns1.42
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
117292
ns113125
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
131500
ns104979
ns1.25
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1008242
ns1011252
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8271541.5
ns8155875
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
585661.5
ns577332
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
79667
ns79000
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
83312.5
ns76417
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75250
ns76833
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74625
ns80250
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
188947
ns190543
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1588521
ns1268166
ns1.25
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
126096
ns125494
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
295166
ns301375.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
297958
ns295750
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
302917
ns231208
ns1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
208479.5
ns209499.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1011400
ns1046615
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9046229.5
ns9187687.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
695657.5
ns689189
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13333
ns13333
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13458
ns13334
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13291.5
ns15062.5
ns0.88
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13166
ns12750
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
136849.5
ns137754.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
1163104
ns1170125
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
234843
ns233927
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26270.5
ns28270.5
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27250
ns26542
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26187.5
ns27166.5
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26750
ns26062
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
906727.5
ns912323.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
8072458.5
ns7923459
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
686187
ns689579
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13375
ns15042
ns0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
15167
ns14625
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14917
ns17292
ns0.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
14042
ns13834
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
119077.5
ns119657.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
1244520.5
ns1225791.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
239672
ns239157
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26583
ns26375
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26874.5
ns26208
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26583
ns26375
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27458
ns26375
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
671026
ns665016.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6008312.5
ns5755000
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
677847
ns674067.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184417
ns183750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182500
ns181645.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184875
ns187833
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182167
ns183666
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100828
ns101191
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1357167
ns1353021
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
234093
ns235596.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
584250
ns636291
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
584458
ns594625
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586834
ns592062.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
601875
ns613458
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
487358
ns491587
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6120229
ns6127021
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
715228
ns708249
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7854.5
ns7375
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8000
ns8333
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7542
ns9417
ns0.80
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7250
ns7229.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
135774.5
ns137783
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
1199333
ns1110021
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
58821
ns57461
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13041
ns14812.5
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14709
ns14791
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13937.5
ns14875
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14562.5
ns12896
ns1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
874051
ns881205
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
7775916
ns7653313
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
397914.5
ns399470
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6152062.5
ns6156708
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6375958.5
ns6375958.5
ns1
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6375125
ns6373937.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11916542
ns11907750
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
336665
ns347134
ns0.97
batchedmm(512, Bsize=4)/forward/GPU/Metal
1592000
ns1596208
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
301084
ns300417.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19075667
ns19072062.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19963000
ns19937292
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19972937.5
ns19969000
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36525333.5
ns36484084
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1014155
ns1007983
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/Metal
7901000
ns7924354
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1170477.5
ns1163329
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1750
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1791
ns1875
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23343
ns23636
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
470292
ns431667
ns1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209012
ns208896
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4792
ns4875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4959
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4833
ns4833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
268124
ns270525.5
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2721875
ns2513333
ns1.08
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
627407
ns618686
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7083.5
ns9416.5
ns0.75
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8229
ns7917
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8458
ns9625
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7666.5
ns7271
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
115788
ns116370.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
1227812.5
ns1185875
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
66840
ns68072
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
12083
ns11937.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11541.5
ns10958
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12125
ns12417
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11313
ns11083.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
602742
ns603718
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5680125
ns5647937.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
358744
ns355648
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22574
ns22877
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
434292
ns443875
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
50011
ns46351
ns1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns2916
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns3083
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2958
ns3250
ns0.91
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3084
ns2958
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
193009.5
ns196283.5
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
2163750
ns2099292
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
169072
ns160444
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
14791.5
ns14208.5
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
14458.5
ns14375
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15749.5
ns17521
ns0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
16229
ns14729
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
117264.5
ns116923.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
1169583
ns1146125
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
239503
ns237206
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25938
ns25666
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25166.5
ns25500
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25667
ns25875
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25417
ns25791
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
554698
ns551650
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5215833
ns5245875
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
655417
ns650325
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4292
ns4208
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4333
ns4208
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24021
ns24277
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
448792
ns445125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
49241
ns48561
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16334
ns15917
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16458
ns16208
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16417
ns16250
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16167
ns16125
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
315889
ns320460
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
2507333
ns2478875
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
210422
ns206705
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5791
ns5625
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5708
ns5917
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5833
ns5834
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5834
ns5833
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34426
ns35140
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
658604.5
ns657000
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
206672
ns205735
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22542
ns20708
ns1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
22791.5
ns21146
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21541.5
ns22208
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
22125
ns21750
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
281382.5
ns281377
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
6163916
ns5995542
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
690657
ns679901
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
60229
ns58583
ns1.03
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
67000
ns65083
ns1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66291
ns66334
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51000
ns51645.5
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66426
ns66570
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/Metal
14940791.5
ns14881125
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
99716.5
ns95562
ns1.04
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
133354.5
ns181791.5
ns0.73
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
166083
ns125000
ns1.33
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
109833
ns149958.5
ns0.73
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
298209
ns310334
ns0.96
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
207628
ns209829
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/Metal
46423875
ns46762875
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
557541
ns579958
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82458
ns82625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81542
ns80750
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84459
ns86292
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83500
ns82500
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192154
ns192479
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2110021
ns1995437.5
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
193692
ns168164
ns1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1637875
ns1923792
ns0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1914562.5
ns1884271
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1911000
ns1888583
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1914333
ns1917291
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
500508
ns508617
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8777854.5
ns8813959
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1073351
ns923511
ns1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
333
ns291
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21580
ns21906
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
497166.5
ns450667
ns1.10
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
43201
ns41861
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1791
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1916
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
241066
ns246989
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
2295374.5
ns2172458.5
ns1.06
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
182807
ns186805
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10625
ns9979
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9687.5
ns8562.5
ns1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11875
ns11458
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8063
ns8666.5
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
115805
ns114779
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
1135833.5
ns1098750
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
236673
ns238165
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns9771
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9542
ns10000
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10062.5
ns10291
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10042
ns9604.5
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
486530
ns492318
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5071146
ns5055604
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
637096.5
ns634834
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns56541
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46541
ns46708
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47000
ns46792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78709
ns77500
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38846
ns38130.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1372041
ns1203084
ns1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
75456
ns79889
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1862978.5
ns1937792
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1924708
ns1980021
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1939583.5
ns1936541.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1603042
ns1886999.5
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
209319
ns211665
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10839458.5
ns11204125
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1019311.5
ns1008110
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
269563
ns267979
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268083
ns266375
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270937.5
ns271000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
269646
ns268291.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
192827
ns193827.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1582625
ns1446458.5
ns1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
284593
ns282897
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
658729.5
ns675542
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
675437.5
ns673792
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589229.5
ns589042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
691584
ns681292
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
977700
ns994673.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9348125
ns8996396
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
902340
ns898667.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2226917
ns2161437
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2207375
ns2211833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2181500
ns2212042
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2213209
ns2215687.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
156097
ns154115
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1440375
ns1427083.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
409085
ns406627
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5503437.5
ns5581500
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5507667
ns5501104
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5517375
ns5517083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5238729
ns5264333.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
926541
ns937351
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9910166.5
ns10010417
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1552722.5
ns1552019
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
991958
ns986917
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
898791
ns898250
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
900521
ns898500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
1323375
ns1324292
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46383
ns46763
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
460146
ns458458.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
244933
ns243438
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2620500
ns2547916.5
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2325791
ns2324625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2331292
ns2333583
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3556812
ns3548709
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
253489
ns256534
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2357041
ns2463833
ns0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
774424
ns770755
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56375
ns56084
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46395.5
ns46250
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46750
ns46542
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82708
ns81750
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
27939
ns27782
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1375750
ns1193583
ns1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
76386
ns72909
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1707541
ns2048500
ns0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2085833
ns2090917
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2085083
ns2061417
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1715833
ns1996958.5
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
222633
ns223774
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11461896
ns11058874.5
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1034336.5
ns1035585
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55854.5
ns56458
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46833
ns46709
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46791
ns47084
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78416.5
ns78584
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48253
ns48280
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1130333
ns1315916.5
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
61021
ns71380
ns0.85
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1652750
ns1903125
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1976145.5
ns1963666.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1972583
ns1961854
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1873145.5
ns1850771
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
229304.5
ns231382
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9660125.5
ns9466667
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
919460
ns913772
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32770
ns34209
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
649167
ns630896
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
46020
ns48489
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6541
ns6625
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7291
ns6375
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6875
ns7208
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6500
ns6500
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
203642.5
ns205122.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5720167
ns5599333
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
320874
ns366869
ns0.87
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns291
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31916
ns32165
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
380667
ns385250
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
34140
ns40300
ns0.85
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2959
ns2875
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns3083
ns0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2959
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3125
ns3000
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
180055.5
ns183941
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
1844396
ns1836854.5
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
148011
ns164169.5
ns0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1425959
ns1427166.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1413333
ns1449750
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1411729
ns1417625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1408583
ns1441604
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134588.5
ns134383
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2798124.5
ns2843875
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
352599
ns355189
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5010083.5
ns4996833
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5062521
ns5015708
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5038334
ns5020625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5011959
ns4981250
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
667945
ns673084.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10907500
ns10662292
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1472837
ns1463829
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49831333
ns49772312.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35550667
ns35522417
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35529334
ns35489333
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97719291
ns96946583
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1621745
ns1601690
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/Metal
10632250
ns10627562.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1054872
ns1042214.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154578979
ns154216458
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112329583.5
ns112301604.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112410541
ns112218667
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
298542500
ns294869708.5
ns1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6574515
ns6475752.5
ns1.02
batchedmm(512, Bsize=32)/zygote/GPU/Metal
72685667
ns70117375
ns1.04
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5575743
ns5557063.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
48000
ns48417
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47583
ns47916
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47959
ns48021
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47625
ns47541
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19664
ns19924.5
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
506125
ns496041
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27380
ns25680
ns1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
51146
ns49792
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50834
ns50708.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50792
ns51209
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50667
ns51458
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
242242
ns245262
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
2303041.5
ns2146500
ns1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
148061.5
ns146160
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7958
ns10209
ns0.78
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9875
ns8959
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10167
ns10750
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8791
ns9000
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
118650
ns118313
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
1220104.5
ns1163542
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237682.5
ns237350.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10834
ns10708
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10834
ns10417
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10854.5
ns10833
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12083
ns10208
ns1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
577803
ns582997
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5781833
ns5755625
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
657058
ns653411
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9250
ns8417
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9333
ns8979
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10270.5
ns11208
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8625
ns9875
ns0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
115000
ns115767
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
1179208.5
ns1146625
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
72391
ns72681
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14875
ns14833
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
14604.5
ns14584
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14708
ns14979.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13937.5
ns14125
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
553338.5
ns554958.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5232750
ns5137041
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345108.5
ns345660.5
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
958
ns1083
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
958
ns958
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
33129
ns34204.5
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
644958
ns638979.5
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
207273
ns207831
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9667
ns8291
ns1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9833
ns8541
ns1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9292
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9625
ns9500
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
222729
ns223363.5
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5938167
ns5901875
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
660822
ns657971.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23250
ns23500
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23708
ns23542
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23625
ns23834
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23666
ns23125
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
19720
ns20050
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
463291
ns448583.5
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
187092
ns188301
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
54583.5
ns53770.5
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
53417
ns53042
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
53500
ns54042
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
53437
ns55020.5
ns0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
257983
ns258832
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
2497375
ns2415625
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
593097
ns588042
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1400083
ns1448437.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1397354
ns1438125
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1396958
ns1405125
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1439145.5
ns1396021
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193949
ns194395.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2076583
ns2058625
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
346604
ns346302
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5008750
ns5024812.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4964729
ns5026125
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5035396
ns5011083
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4929271
ns5006958
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
504516
ns510089
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9047750
ns9178458
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1200808.5
ns1198365
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
788818625
ns779661000
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
551612334
ns541756209
ns1.02
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
540244250
ns545828709
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1530028375
ns1513614750
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22553472
ns22673094
ns0.99
batchedmm(512, Bsize=512)/forward/GPU/Metal
107424833
ns107171459
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14536744
ns14686436
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2524317083
ns2975273958
ns0.85
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1813606000
ns2889890291
ns0.63
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1790647500
ns1793050500
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4736014333
ns4711214375
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118463004
ns118916960
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/Metal
3156126000
ns2622707250
ns1.20
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87116333
ns87900974
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76750
ns76541
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79667
ns79375
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78750
ns79167
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75625
ns85583
ns0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
191467.5
ns191949
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1500958
ns1500104
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
107821
ns105890.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
197854
ns261583.5
ns0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
282500
ns232562.5
ns1.21
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
194437.5
ns196625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
193354.5
ns192687.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
989075
ns996248
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8722083.5
ns8743333
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
632717
ns628158
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199760333.5
ns198984604
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
138689875
ns139204167
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139093875
ns139144125
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
393577208
ns393236834
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5844135
ns5825572
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/Metal
33596666.5
ns33344937.5
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3554721
ns3611135.5
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
618168312.5
ns617564646
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
438128750
ns440013042
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
439741104.5
ns438881145.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1193079708
ns1193608916
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26785977
ns26745549.5
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/Metal
111846000
ns110179542
ns1.02
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21973627
ns21869093
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7500
ns7083
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns6208
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns6042
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9750
ns9833
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26816
ns26360.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
870208
ns873478.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
47821
ns46220
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227291
ns213416.5
ns1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
248354
ns232437.5
ns1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221417
ns222375
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206500
ns219250
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
213108
ns215332
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9197375
ns8943333
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
528466
ns524234
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8125
ns8083
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8333
ns8291
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10771
ns10709
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7125
ns8500
ns0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
112963
ns113094.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
1159166
ns1123895.5
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
72821
ns70651
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns8917
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8541.5
ns8958
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8520.5
ns8584
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8041
ns8208
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
485370
ns492563
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5087333
ns5073167
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
321064
ns317437.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
459
ns459
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
375
ns541
ns0.69
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns542
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24957
ns25048
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
721416
ns713958
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48420
ns46561
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12146
ns10666.5
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11771
ns11479
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
11625
ns11583
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
11833.5
ns10354
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
244608
ns244034
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6507583
ns6283709
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
390079.5
ns383588
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351395.5
ns353416
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351312.5
ns353792
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352000
ns352021
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352667
ns350958
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
22442
ns22877.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
331292
ns312208
ns1.06
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
189122
ns188432
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
827250
ns793000
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
807500
ns807333.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
779084
ns777437
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
824062.5
ns830979
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
215869
ns218580
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2827333
ns2766209
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
609147
ns604914.5
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5375
ns5521
ns0.97
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5041.5
ns5479
ns0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6542
ns7396
ns0.88
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4292
ns4166
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17243
ns17982
ns0.96
batchedmm(16, Bsize=32)/forward/GPU/Metal
1971750
ns1438291.5
ns1.37
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
71641
ns71380
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12896
ns12520.5
ns1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
12770.5
ns11521
ns1.11
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11687.5
ns11521
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17250
ns18042
ns0.96
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
201112.5
ns207562.5
ns0.97
batchedmm(16, Bsize=32)/zygote/GPU/Metal
5430958.5
ns5079708
ns1.07
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
376764
ns368113
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39958
ns38125
ns1.05
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51479.5
ns51291.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
53084
ns52584
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13375
ns13500
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20843
ns20289
ns1.03
batchedmm(16, Bsize=128)/forward/GPU/Metal
4986791.5
ns4978875
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
80861
ns84681
ns0.95
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
37666
ns36896
ns1.02
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
32604.5
ns31458
ns1.04
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31666
ns31958
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
67625
ns66000
ns1.02
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
178132
ns184469
ns0.97
batchedmm(16, Bsize=128)/zygote/GPU/Metal
13540167
ns13432687
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
411414
ns412423
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3583
ns3583
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3458
ns3666
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3750
ns3958.5
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3458
ns3500
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19148
ns19634
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
475042
ns458041
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
26470
ns28900
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4250
ns4208
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4250
ns4375
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4500
ns4625
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4500
ns4167
ns1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
192719
ns197467.5
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
2182750
ns2168666
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
138367
ns138551.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6167
ns5208
ns1.18
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4833
ns4792
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5833.5
ns7250
ns0.80
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4125
ns3792
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
139477
ns142334.5
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
1191667
ns1171167
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
59621
ns58781
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9312.5
ns9125
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8541
ns8833
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8833
ns9125
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8250
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
804737
ns822603
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
7653458.5
ns7665708
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389654.5
ns387763.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206083
ns204042
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210292
ns212000
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210625
ns210875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
202208
ns200958
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36872
ns36985.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
841958
ns853417
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205472
ns205912
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
654354
ns653187.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
673208.5
ns665958
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622021
ns622770.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
623250
ns585667
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
254216
ns260510
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8205667
ns8195083
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
802774
ns799653
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3322833.5
ns3369291
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2326375
ns2332125
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2332417
ns2329166
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6281542
ns6307167
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204376.5
ns205325
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/Metal
6114500
ns6066541
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
214983
ns212943
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11470521
ns11648041
ns0.98
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8342437.5
ns8330687.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8323000
ns8348104
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21067292
ns21116042
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
785999
ns734131.5
ns1.07
batchedmm(128, Bsize=128)/zygote/GPU/Metal
28332646
ns26082375
ns1.09
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1073402
ns1069061
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7562.5
ns4521
ns1.67
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4708
ns5208
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6416
ns7583
ns0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4208.5
ns5500
ns0.77
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
130660
ns132826.5
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
1185854.5
ns1175375
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
55851
ns55421
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8750
ns9292
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9541
ns8334
ns1.14
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10834
ns9562.5
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
12333
ns8604.5
ns1.43
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
712252
ns716825.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
7308583.5
ns7184437.5
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
372784
ns369984
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
124416
ns98313
ns1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93999.5
ns125521
ns0.75
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
98583
ns100541
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
99812.5
ns103500
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
148811.5
ns149399
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2832625
ns2228333.5
ns1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
186182
ns182342
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2027833
ns2046104.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2027833.5
ns2031250
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2031312
ns1985791.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1974000
ns2021416.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
662744
ns674153.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10775125
ns10587167
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1253594
ns1250004
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33875
ns34188
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36708
ns36000
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35000
ns35021
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
750
ns833
ns0.90
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15499
ns15860
ns0.98
batchedmm(2, Bsize=4)/forward/GPU/Metal
543666.5
ns553417
ns0.98
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
76180
ns75761
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
3708
ns3083.5
ns1.20
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
4333
ns3541
ns1.22
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3458.5
ns3625
ns0.95
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2583.5
ns3375
ns0.77
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
135016
ns140010.5
ns0.96
batchedmm(2, Bsize=4)/zygote/GPU/Metal
1502041
ns1942729.5
ns0.77
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
356474
ns353624
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7000
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns6041
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5958
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9709
ns9958
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36031.5
ns35885
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
847396
ns854042
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49931
ns50330
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221584
ns223104
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223416
ns234125
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221500
ns221250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219292
ns215667
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
238688
ns243422
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7960125
ns8021021
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
520506
ns512516
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3709
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21870
ns22271.5
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
453375
ns468292
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
43661
ns43460
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14750
ns14167
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14666
ns14541
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14542
ns14583
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14459
ns14500
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
296421
ns303531
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
2342458
ns2253708.5
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
189867.5
ns200012.5
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
130583
ns99083
ns1.32
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
100333
ns128333.5
ns0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
102250.5
ns103812
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
122167
ns103958.5
ns1.18
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131841
ns150020
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2397729
ns2875583
ns0.83
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
183912
ns195772
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1931437.5
ns1887875.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1901292
ns1929042
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1914291.5
ns1884833
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1655875.5
ns1894729
ns0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
656012
ns670688
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10340375
ns10463500
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1066462
ns1065452
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18125
ns18959
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17208.5
ns17354.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20479
ns22208
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17749.5
ns17541.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104389.5
ns104525.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1377917
ns1362312.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79261
ns79351
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230583.5
ns252250
ns0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
233166
ns260833
ns0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219208.5
ns219458
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218334
ns257937
ns0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
490023
ns495429
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6087208
ns6195583
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
468975
ns462125
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24459
ns24958.5
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
32521
ns32604.5
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
28917
ns27500
ns1.05
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1291.5
ns1208
ns1.07
batchedmm(16, Bsize=4)/forward/GPU/CUDA
15797
ns16021
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/Metal
560437.5
ns533959
ns1.05
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
87701
ns80071
ns1.10
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
5833
ns5250
ns1.11
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
6959
ns5854.5
ns1.19
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5708
ns5792
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
5145.5
ns6125
ns0.84
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
199406
ns201439.5
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/Metal
2041979
ns2014541.5
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
394889.5
ns376235
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222542
ns221583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221708
ns222541.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222750
ns226291
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221437.5
ns221875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
217746
ns219232.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1704083
ns1686583
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
274143
ns271454
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
509083.5
ns559604
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
504916.5
ns548354
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
500250
ns500083.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
510916.5
ns498250
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1021374
ns1034159
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8740667
ns8587229
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
857769.5
ns850955.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19750
ns19625
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18333
ns19313
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22500
ns23208
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19104.5
ns20583
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
111476.5
ns111518.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1485375
ns1475625
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
77575.5
ns80186
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225146
ns215020.5
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215291.5
ns250333
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215562
ns214500
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217354.5
ns221729.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
701275.5
ns708936
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7154208
ns7292833
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
539526
ns539977
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7416.5
ns6166
ns1.20
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5916.5
ns6479
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8958.5
ns8042
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns6417
ns0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
131574
ns133623
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
1218250
ns1170916
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
65831
ns66921
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12062.5
ns12250
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13125
ns11729.5
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14500
ns13334
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16562.5
ns11645.5
ns1.42
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
768388
ns771416.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
7371292
ns7239334
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
382304
ns391255
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7187.5
ns4500
ns1.60
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4771
ns5041.5
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7083
ns7042
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6375
ns5500
ns1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
133143
ns134989.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
1214458
ns1146875
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
57010
ns58260
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7687.5
ns7750
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7625
ns7750
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns8125
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7958.5
ns7709
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
736643
ns738275
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
7612479.5
ns7536771
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
390394
ns386245
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14470208.5
ns14664541
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10098583
ns10093041
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10105833
ns10106791
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27692583
ns27704625
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
536068
ns529053
ns1.01
batchedmm(128, Bsize=512)/forward/GPU/Metal
22376167
ns22466021
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
395754
ns401266
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46412458.5
ns46793583
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33522500
ns33459958.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33458958
ns33523667
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85300667
ns85429125
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2879954
ns2854223
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/Metal
83664854.5
ns89341312.5
ns0.94
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3291966
ns3309294
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
187833.5
ns188000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
185250
ns186250
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
187792
ns188667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
185292
ns185938
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101714
ns101713
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1511792
ns1484500
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
230572
ns235268
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
598667
ns641812.5
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
641604.5
ns636958
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
590145.5
ns589208
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
598458.5
ns591771
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
702997.5
ns704450.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7556645.5
ns7517417
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
789679
ns785986
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns750
ns0.67
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
750
ns750
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
750
ns667
ns1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31490
ns32067
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
666250
ns651375
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47360
ns47241
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12458
ns9979
ns1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9958
ns11521
ns0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12270.5
ns10188
ns1.20
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10292
ns9500
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
276432
ns276358.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
6005333.5
ns5875459
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
380359
ns374075
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26291
ns26291
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26292
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26750
ns26500
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26292
ns26209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23151.5
ns23479
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
439917
ns437083
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
210222
ns210433
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67084
ns67042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67292
ns68833
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67083
ns68917
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66958
ns67583
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
272442
ns274089
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
2242375
ns2210459
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
609967
ns606899
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203875
ns204500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210083
ns210417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209458
ns211125
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199541
ns200125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28002
ns27585
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
878708.5
ns861208
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205762.5
ns205157.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
647687
ns652542
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
630416
ns671541
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
624563
ns624208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
629666
ns580625
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
234240
ns236486
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9313917
ns9239500
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
850729
ns837472
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
655291
ns650083
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
642270.5
ns650625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
617187.5
ns550709
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
610333
ns652708
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
187686
ns186884
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1354667
ns1405750
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
249712
ns234974
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2271521
ns2244125
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2229354.5
ns2249625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2251625
ns2253687.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2105417
ns2232292
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
903523
ns908141
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9610542
ns9610291
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1362010.5
ns1356860
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20437.5
ns19479
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18708
ns20020.5
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22875
ns22000
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19604.5
ns20500
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
107237
ns107405.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1498312.5
ns1497959
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81410
ns82031
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232333
ns259687.5
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232917
ns234896
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222708
ns223354.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
231584
ns222104
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
698969
ns701938
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7523124.5
ns7694083.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
552146
ns552123
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns750
ns0.67
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
750
ns750
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns667
ns0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
22758
ns22889
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
720354
ns713250.5
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
49520
ns47681
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
11209
ns10833
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10708
ns11458
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16250
ns10958
ns1.48
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10000
ns11333
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
258661.5
ns258094.5
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6648083
ns6601250
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
400195
ns398396
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10292
ns8021
ns1.28
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8166
ns7916.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9771
ns10479
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8687.5
ns7771
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
114739
ns114650.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
1152208.5
ns1128833
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
69971
ns67611
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns8625
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11375
ns9459
ns1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns9334
ns0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7479
ns10083
ns0.74
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
475261
ns474110.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4907479.5
ns4853125
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
325933
ns322085
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2375
ns2104.5
ns1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2208
ns2375
ns0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2562.5
ns2667
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2333
ns2125
ns1.10
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
19068
ns19503
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
427542
ns435896
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
191412
ns189822
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
8208
ns7666.5
ns1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
7292
ns7083
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
7083
ns7771
ns0.91
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6583
ns8417
ns0.78
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
208633.5
ns209638.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
2362021
ns2304438
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
583987
ns579508
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
747083
ns749167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746666.5
ns749833.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
747771
ns747292
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
746625
ns748521
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
22542.5
ns22733
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
340104
ns312604
ns1.09
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
38681
ns37375.5
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
773000
ns778000
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
791312.5
ns807229
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775416.5
ns774167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
792583
ns776625
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
207047
ns207826
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2669500
ns2597208
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
233853
ns220633
ns1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7209
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns6000
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6042
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32722
ns32931
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
855791
ns855708.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48751
ns50540
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230146
ns262833
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
264521
ns263396
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228791
ns229333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212833
ns212854
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
253139
ns255573
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8191500
ns8358834
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
522136
ns524047.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13021
ns12083
ns1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11375
ns11959
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
14187
ns13583
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12395.5
ns12771
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
131017.5
ns132456
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
1184667
ns1189125
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
233542
ns233113
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25271
ns25021
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24167
ns25500
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25041
ns25458
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24333
ns24792
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
811743.5
ns815326
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
7729583
ns7701292
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
685097
ns681611
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9187.5
ns9562.5
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9958
ns9833
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10208
ns12000
ns0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9458.5
ns9541.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
118607
ns118599
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
1259500
ns1229416
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
72621
ns74341
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15791.5
ns14375
ns1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14125
ns20917
ns0.68
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15334
ns17250
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14583.5
ns15562.5
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
623973.5
ns626256
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5665292
ns5717062
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
372954
ns368145
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9771
ns9270.5
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9125
ns9208
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10917
ns11042
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9375
ns9145.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
117527.5
ns117653
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
1177500
ns1158958
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
72381
ns73341
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13687
ns14062.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
16084
ns15125
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13000
ns15125
ns0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12937.5
ns15146
ns0.85
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
517238
ns518369.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5092125
ns5051833
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
342213.5
ns340775
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
28209
ns27708
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
34167
ns33875
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32375
ns31792
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2020.5
ns2229.5
ns0.91
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16305
ns16522
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/Metal
4760375
ns4854041.5
ns0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
77641
ns78412
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
6458
ns5583
ns1.16
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
6625
ns5917
ns1.12
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5666
ns6084
ns0.93
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6417
ns7770.5
ns0.83
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
134001.5
ns136257
ns0.98
batchedmm(2, Bsize=128)/zygote/GPU/Metal
13311250
ns13273333
ns1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
386434
ns379326
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
291
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24308
ns24751
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
701250
ns682541.5
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48831
ns48791
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9291.5
ns7520.5
ns1.24
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8583
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8625
ns8625
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6416.5
ns7458.5
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
180410.5
ns181857
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6473583
ns6285375
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388845
ns389326
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5708
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5916
ns6208
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6166
ns6000
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25125
ns25394
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
723375
ns714417
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
207722
ns207474
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22833
ns26375
ns0.87
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
23000
ns23250
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
23187.5
ns21459
ns1.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21625
ns20250
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
264109
ns262619.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6721166.5
ns6644125
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
692722.5
ns695681
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
149812.5
ns145625
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147708
ns178292
ns0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149834
ns150417
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147375
ns153812.5
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
185601
ns188204
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1536750
ns1588584
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
193682
ns190633
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1331000
ns1345771
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1332124.5
ns1331542
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1336000
ns1322333.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1317917
ns1167354
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
846555
ns856737
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9345604.5
ns9165250
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1090882
ns997975
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25313
ns24250
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23709
ns24458.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26250
ns27084
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24500
ns24417
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
225227.5
ns225455
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1714291
ns1705354
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
103451
ns115742
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
154250
ns127500
ns1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
144603.5
ns174187
ns0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
122187.5
ns119042
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
118542
ns130375
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
967258
ns984493
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8776208.5
ns8679292
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
545716
ns591319
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns291
ns1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22145
ns22641
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
722291.5
ns689208
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
49071
ns47290
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7292
ns7083.5
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8083
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12458
ns6958
ns1.79
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6750
ns6500
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
196429.5
ns197931.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6548875
ns6549187.5
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
391269.5
ns395326.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7125
ns6333.5
ns1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5333
ns5708
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6771
ns7541
ns0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5833
ns6000
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
135991
ns137058.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
1202458
ns1181916.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
235063
ns232733
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10875
ns10833.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10166
ns10583
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10959
ns10416
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns9792
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
836132.5
ns841858
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
8143458
ns8090729
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
676977
ns672580
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1542
ns1584
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1583
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22716
ns22927
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
438354.5
ns429250
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209702
ns208003
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5792
ns5917
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6084
ns6375
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5833
ns6125
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5750
ns5750
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
215208
ns217549
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
2221500
ns2167125
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
586451.5
ns581914.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8292
ns8562
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8541
ns8458
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8562.5
ns10291.5
ns0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7875
ns8229.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
116817
ns116906
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
1243833
ns1209583
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
71641
ns77271.5
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10084
ns9104.5
ns1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9041
ns15417
ns0.59
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9708
ns8792
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns8084
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
553365
ns557267.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5644167
ns5634417
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
347974
ns344656
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126146
ns125125
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
129625
ns130729
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129167
ns130250
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180792
ns181042
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45896
ns46296.5
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/Metal
360083
ns364354
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
98361
ns100232
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
331812.5
ns309333
ns1.07
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
345666
ns342125
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
328917
ns313833
ns1.05
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
614041
ns570709
ns1.08
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
182821
ns185266
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/Metal
2280959
ns1373875
ns1.66
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
505920.5
ns506148
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397250
ns396437.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288041
ns289000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288166
ns288375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756000
ns756250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43372
ns43482.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
445416
ns434458
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
81381
ns79761
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1449813
ns1408916.5
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1136895.5
ns1136979
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1132916
ns1132062
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2442792
ns2443000.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
244597
ns248184
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1978292
ns1965375
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
348724
ns349476
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
652792
ns645500
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
653208.5
ns650562.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
548209
ns546541.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
584396
ns545645.5
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
183317
ns173484
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1360500
ns1350375
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
252133
ns242424
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2466333.5
ns2520666.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2483562.5
ns2473750
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2467833
ns2447792
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2282833.5
ns2452584
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
931909
ns937381.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10376583
ns10132041
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1299404
ns1450713
ns0.90
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
32959
ns30500
ns1.08
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36125
ns36187.5
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34625
ns34146
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
958
ns958
ns1
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15098
ns15458
ns0.98
batchedmm(2, Bsize=32)/forward/GPU/Metal
1378749.5
ns1293854
ns1.07
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
79251
ns71001
ns1.12
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
4084
ns3084
ns1.32
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
4750
ns3958
ns1.20
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3917
ns3333
ns1.18
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3125
ns3042
ns1.03
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
132398
ns135380
ns0.98
batchedmm(2, Bsize=32)/zygote/GPU/Metal
5241250
ns5260562.5
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
338354
ns340585.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1459250
ns1460666
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1500166
ns1503375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1498875
ns1503000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1438479.5
ns1441729
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42417
ns41871
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1219791
ns1242250
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
240737.5
ns239254
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5130542
ns5151979
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5293020.5
ns5296833.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5286125
ns5285437.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4986583
ns4980042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
230731.5
ns230225
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11576750
ns11359208.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1242163
ns1233400
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33753
ns33654
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
376312
ns352750
ns1.07
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
39530
ns39741
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15708.5
ns15041
ns1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15417
ns15709
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15750
ns15500
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15292
ns15375
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
249731.5
ns251748
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
1855541
ns1635667
ns1.13
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
171782
ns165632
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404167
ns401812.5
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295875
ns296666
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296083
ns295167
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760916
ns760709
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113370
ns113125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
587375.5
ns574187
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
89731
ns87471
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1479833
ns1429500
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1153270.5
ns1159833
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1161500
ns1157541
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2457520.5
ns2466395.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
240863
ns235512
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
2046250
ns1507125
ns1.36
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
353339
ns353405
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1083
ns959
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1125
ns1042
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24309
ns24950
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
718583
ns692770.5
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
209613
ns208254
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns7917
ns1.26
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10375
ns9916
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9916
ns8583
ns1.16
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8042
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
206553.5
ns202658.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6653271
ns6448187.5
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
690112.5
ns697032
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
829375
ns831021
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
616937.5
ns619667
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
616312
ns618250
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1443333.5
ns1541417
ns0.94
batchedmm(128, Bsize=32)/forward/GPU/CUDA
132766
ns131643
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/Metal
1745250
ns1716917
ns1.02
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
166192
ns166023
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2679709
ns2699312.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1975667
ns1995500
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2000708
ns1985791
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4917479.5
ns4946958
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
234270.5
ns234057
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/Metal
6784500
ns6761458
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
861380
ns852834
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns291
ns1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
30986
ns32746
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
649291.5
ns642249.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
47661
ns47461
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9250
ns6208
ns1.49
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7666
ns9334
ns0.82
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9145.5
ns6708
ns1.36
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6458
ns6229
ns1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
216879.5
ns223155
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
6034187.5
ns6000375
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
362223.5
ns361916
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1727271
ns1731292
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1726375
ns1754791
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1726708
ns1728874.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1718541.5
ns1745562.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
184752
ns190073
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1378104.5
ns1502437.5
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
355824
ns353886
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4378291.5
ns4404625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4427500
ns4422041
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4404270.5
ns4362625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4352625.5
ns4346521
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
828954
ns855907
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9565917
ns9512792
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1254843
ns1246280
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7167
ns6875
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7000
ns17395.5
ns0.40
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7167
ns7250
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6896
ns6834
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22145
ns22751
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
276666.5
ns272959
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
36790
ns37041
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
66916
ns33000
ns2.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
46000
ns68979.5
ns0.67
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
39708.5
ns33333
ns1.19
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
33041
ns45500
ns0.73
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
204844
ns212527.5
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2631208.5
ns2608042
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
218562.5
ns221728.5
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21375
ns23417
ns0.91
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
26250
ns25542
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
25542
ns23312.5
ns1.10
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5292
ns5625
ns0.94
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17697
ns18456
ns0.96
batchedmm(2, Bsize=512)/forward/GPU/Metal
15018041
ns14791020.5
ns1.02
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
84241
ns89826.5
ns0.94
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
13041
ns11917
ns1.09
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
11875
ns11125
ns1.07
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
11000
ns10625
ns1.04
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18250
ns17958
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
215801
ns223372.5
ns0.97
batchedmm(2, Bsize=512)/zygote/GPU/Metal
46206666
ns45999500
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
371364
ns382947
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
405666.5
ns403917
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297167
ns297500
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
297083
ns297375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762584
ns762334
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46537
ns47041
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
552000
ns533542
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
88131
ns89431
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1486542
ns1426250
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1163917
ns1164625
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1163479.5
ns1163125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2469750
ns2468250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
271051
ns281846
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2273937.5
ns2244750
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
377594
ns378111.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484541
ns1487625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1526541
ns1529979.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1526583
ns1529729.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1461708
ns1464667
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
53340
ns54740
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1143959
ns1143667
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
236227.5
ns235424
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5132125
ns5146979
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5279312.5
ns5286395.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5291750.5
ns5251625
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4941792
ns4982541.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
250059.5
ns258236.5
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10397667
ns10236958
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1214863
ns1218755
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28458.5
ns28125
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28292
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
23939
ns24960
ns0.96
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
446500
ns430583
ns1.04
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
211713
ns212483
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66708
ns66375
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67542
ns66542
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66500
ns67000
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66417
ns66584
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
332916
ns344216.5
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
2719875
ns2732875
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
656417.5
ns652061
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
86042
ns84500
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
131334
ns93000
ns1.41
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85417
ns85541
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80250
ns81042
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192392
ns190669
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2062187.5
ns2029208
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203152
ns183273
ns1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2025521
ns2023313
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1850750
ns2010958
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2019000
ns1979291.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1835792
ns1995645.5
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
505719.5
ns520209.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9377312.5
ns9143521
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1086637
ns1082408
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.