This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
15 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c185f04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
c185f04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/113551
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
c185f04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6083
ns4937.5
ns1.23
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5417
ns5666
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8021
ns8042
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6146
ns5687.5
ns1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
120417
ns120909
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
812042
ns791750
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
424375
ns413945
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10250
ns10000
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9917
ns10250
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns9875
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
11792
ns9584
ns1.23
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
556460
ns558079
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2542833
ns2765041
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
686027
ns664078
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1500
ns1375
ns1.09
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
2792
ns1500
ns1.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1708.5
ns2083
ns0.82
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1583
ns1667
ns0.95
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
22218
ns21790
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
205792
ns216396
ns0.95
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
29920
ns31411
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3542
ns4229.5
ns0.84
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4209
ns3666
ns1.15
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4271
ns4166
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4229
ns4208
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
148035
ns149451
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1621188
ns1690125
ns0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
151742
ns153327
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58542
ns58500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46375
ns39500
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46584
ns47042
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83708
ns83333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37608
ns37308.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1081917
ns1066021
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
84866
ns81381
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2027833
ns2032541.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2085458
ns2086500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090292
ns2080042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1999000
ns1986125
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
233327.5
ns235686.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7717583
ns7909459
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1460226
ns1203034
ns1.21
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145375
ns148292
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147458
ns166416.5
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150584
ns150375
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
170437.5
ns153437
ns1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166412
ns165231.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1615604.5
ns1574250
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
202872
ns180947
ns1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1119083.5
ns1115708.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1109000
ns1115583
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1118458
ns1111895.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1116145.5
ns1116604
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
707978
ns717544
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
5932000
ns5783062
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1046946
ns1033041
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5104.5
ns4958
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4250
ns4334
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5895.5
ns5667
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5624.5
ns4771
ns1.18
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
93783.5
ns95604
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
721583.5
ns722292
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
70761
ns60161
ns1.18
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns8875
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8792
ns8542
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9083
ns8583
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8541
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
603451.5
ns618298
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6400917
ns6128688
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388649.5
ns393664
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20083
ns17312.5
ns1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18812.5
ns18834
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20958
ns20375
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18000
ns18291.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
68784
ns67939.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1334334
ns1353958
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
83861
ns76051
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224229.5
ns223334
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219416
ns211917
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219062.5
ns219896
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212958
ns212084
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
360915.5
ns360648.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5929666
ns5859625
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
478315
ns479156
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns583.5
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
709
ns666
ns1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
1041
ns834
ns1.25
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns708
ns0.88
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
21396
ns20628
ns1.04
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
303750
ns300000
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32981
ns32990
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1458
ns1395.5
ns1.04
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1459
ns1541
ns0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1542
ns1417
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1542
ns1334
ns1.16
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
127634
ns126083.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1626542
ns1618083
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
138112
ns126566.5
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7375
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5375
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns6166
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10333
ns9917
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24384
ns23573
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
700271
ns626500
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46841
ns47160
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221166
ns234375
ns0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
238834
ns242458
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230666
ns270958
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
251250
ns251083
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
193817
ns185731
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8912375
ns9574500
ns0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
653712
ns623787
ns1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4083
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4083
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
24189
ns23126
ns1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
223791
ns229375
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
49151
ns48711
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16584
ns17041
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16917
ns16500
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17042
ns17250
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16833
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
199158
ns195934
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
963270.5
ns972750
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
176322
ns179972
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
512792
ns509541.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
404292
ns332334
ns1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
404896
ns404875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
864583
ns865104.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113852
ns113032
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
448709
ns448145.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
250173
ns248713
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2271145.5
ns2319667
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2031292
ns1752729.5
ns1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2033750
ns2031958
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3280292
ns3283979.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
247459
ns244203
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2065875
ns2016625
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
765823
ns763594
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7145.5
ns6042
ns1.18
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6958.5
ns6458
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8541
ns7708
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6479.5
ns6333
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
93682.5
ns93025
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
806084
ns901583
ns0.89
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
68781
ns62671
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11708.5
ns11583
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11875
ns10500
ns1.13
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11000
ns11458
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12020.5
ns10979
ns1.09
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
642017
ns646277
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5707875
ns5976917
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
421135
ns418444
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
24054
ns23258
ns1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
228333
ns327292
ns0.70
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
54330
ns52080
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns2084
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
237805
ns221917
ns1.07
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
1998833
ns2054166
ns0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
190172
ns182777
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9333.5
ns9062.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9104
ns9792
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10521
ns10167
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8959
ns8896
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
113550
ns106196.5
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
875353.5
ns876291.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
78760
ns75871
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16729.5
ns16854.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18250
ns17624.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18104
ns18958
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18458
ns17187.5
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
643636
ns603520
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5156541
ns5108208
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
396545
ns396274
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
459
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
35808
ns35127
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
323000
ns475271
ns0.68
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
46571
ns49441
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10375
ns9792
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9791.5
ns10125
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10375
ns10229.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10750
ns9334
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
262020
ns257318.5
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5294125
ns5182437.5
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
382009.5
ns380274
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399000
ns397083
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288125
ns215250
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288292
ns287916
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
755625
ns756041
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
113561
ns111427
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
367729.5
ns363792
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
77481
ns78871
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1393333
ns1454375
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1136083.5
ns859500
ns1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1131458.5
ns1129916
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2438041
ns2440417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
212129
ns209113
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1596167
ns1658937.5
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
329854
ns328243
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7708
ns6916.5
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7458.5
ns7084
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9000
ns8188
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7812
ns7042
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
159498.5
ns152190.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
481750
ns764458
ns0.63
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
60340
ns60511
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14667
ns16083.5
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15437.5
ns15625
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15479.5
ns14167
ns1.09
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14979
ns14333
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
1030852
ns1030700
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6424458
ns6599291
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
435905
ns440235
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
26958
ns25292
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25209
ns29625
ns0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27208
ns26500
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24584
ns25291
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228128
ns228373.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1045041.5
ns1026771
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
120221
ns118522
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103791
ns146334
ns0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
150833
ns118854
ns1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
148187
ns148958
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
116292
ns117125
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1163495
ns1207252
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6459417
ns6191583
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
607082
ns601256
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76416
ns73833
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
81020.5
ns78104.5
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
85083
ns78021
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
79625
ns77750
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
234622
ns234865
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
628124.5
ns534625
ns1.17
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
127432
ns125536.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
283166.5
ns305354
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
316541
ns321166
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
302917
ns295667
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
315041.5
ns304625
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1204655
ns1245639
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6660083
ns6703875
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
700322.5
ns703602.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
16708.5
ns16812.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17333
ns16334
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17854.5
ns17438
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16479
ns17083
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
167006
ns166179
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
446708
ns615083
ns0.73
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
239982
ns240073
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26125
ns27771
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26917
ns30354.5
ns0.89
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27208
ns26791
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
25333
ns26770.5
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
1047898
ns1050438
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6661333.5
ns6159041
ns1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
718328
ns717457
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11084
ns11125
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11625
ns11624.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13000
ns11958
ns1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11666.5
ns11229.5
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
141188
ns139887
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
897333.5
ns817625
ns1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
243182.5
ns244463
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22145.5
ns20917
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21875
ns21833.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22667
ns22563
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21792
ns21416.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
756695
ns755838.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5374500
ns5465833
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
695018
ns695787.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
63937.5
ns69479
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63500
ns66041
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66042
ns66229
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
63666.5
ns67791.5
ns0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
124307.5
ns119885
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1367917
ns1370041.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
241283
ns239323
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
437854
ns484042
ns0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
464833
ns465541
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
474208
ns439354.5
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
437729.5
ns437646
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
560487
ns558258
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6247083
ns6275458.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
733228
ns737788
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7104.5
ns7166.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7083
ns7792
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8334
ns8375
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7604
ns7292
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
163142
ns161904.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
463833.5
ns456041
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
68371
ns61410
ns1.11
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14542
ns14708
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15396
ns17500
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15458.5
ns14646
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14750
ns15250
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1022438
ns1023653
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6461041
ns6089229.5
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
412334
ns412184
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6159375
ns6148208
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6372249.5
ns3227583
ns1.97
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6374125
ns6378333
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11910167
ns11914959
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302029
ns301812
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
302953
ns296489
ns1.02
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19119687
ns19106770.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19945437.5
ns11136250
ns1.79
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
20008771
ns19962416
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36510208.5
ns36542271
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1019652
ns1158703
ns0.88
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1173152.5
ns1169188
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
959
ns959
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
959
ns959
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23843
ns23501
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
335916
ns329667
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
215882
ns216992
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3625
ns3709
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3666
ns3666
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns3750
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3667
ns3667
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
300289
ns297158.5
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2148500
ns2191521
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
644731.5
ns650431.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8334
ns8500
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8104
ns9250.5
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9750
ns9396
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8500
ns8125
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
137456
ns136116.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
796375
ns819208
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
68311
ns67611
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11666
ns11250
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12083
ns12667
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12583
ns11729
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12750
ns11042
ns1.15
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
721292
ns721603
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5345750
ns5441770.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
373344
ns373594
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
291
ns291
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23235
ns22886
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
226791
ns331291
ns0.68
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
51721
ns51921
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns3000
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns2958
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3250
ns3042
ns1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2834
ns2875
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
216259.5
ns213807.5
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1692958
ns1713479.5
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
161612
ns168911.5
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11833
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11229.5
ns11896
ns0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13250
ns13000
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12166
ns12291
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
139967.5
ns137978.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
892584
ns900666.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
243863
ns239817.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21083
ns23042
ns0.91
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20396
ns22104
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
26062.5
ns20458
ns1.27
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21604
ns23917
ns0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
652418
ns653027.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4821708.5
ns4833270.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
672612
ns673012
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4417
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4416
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4333
ns4375
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24831
ns24516
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
223938
ns231666.5
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
52890.5
ns54410
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16333.5
ns16750
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16750
ns16167
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16583
ns16833
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16625
ns16459
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
356581
ns353559.5
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1752937.5
ns1092020.5
ns1.61
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
210052
ns216712
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
1958
ns1959
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
1917
ns1959
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2166
ns2083
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2084
ns2125
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
36754.5
ns35968
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
299041
ns444042
ns0.67
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
208032
ns208282
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
16958.5
ns17958.5
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
19042
ns16958
ns1.12
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17458
ns17333.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
18062.5
ns17646
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
307642
ns305401
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5677458.5
ns5381292
ns1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
709468
ns703887
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59125
ns59250
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
66208
ns60625
ns1.09
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66083.5
ns64167
ns1.03
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51334
ns51291
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66592
ns66533
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
113701
ns101811
ns1.12
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
210458
ns196208
ns1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
143000
ns139333
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
119583
ns155270.5
ns0.77
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
307688
ns285354
ns1.08
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
234156
ns231110.5
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
598956
ns587041
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
123833.5
ns82771
ns1.50
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
123125
ns87959
ns1.40
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86500
ns85959
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82958
ns81812.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190129
ns192437.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1825667
ns2001125
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
188412
ns172856.5
ns1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1927375
ns1915166.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1909416.5
ns1905625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1906875
ns1906791.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1931021
ns1867521
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
578778.5
ns575411.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9303959
ns9319437.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1081141.5
ns1079271.5
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22349
ns21855
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
372291
ns370125
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
45590
ns45340
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
272164
ns268250
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1469500
ns1115417
ns1.32
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
187152
ns183362
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9250
ns8250
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8708
ns11209
ns0.78
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11166
ns9708
ns1.15
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
10459
ns9084
ns1.15
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
134628.5
ns135375
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
897749.5
ns905833
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
241763
ns242893
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns12042
ns0.84
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8458.5
ns11125
ns0.76
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
14375
ns8833
ns1.63
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9542
ns12083
ns0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
584537
ns583740
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4632562
ns4734458
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
645752
ns652127
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58375
ns58458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46625
ns39584
ns1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46708
ns47104.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82000
ns83084
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40806
ns39769
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1140854.5
ns1151625
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78371
ns78761
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1934584
ns1929833
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1981708
ns1940687
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1989334
ns1942312.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1899750
ns1910500
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
239556
ns236370
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11301583
ns11016708.5
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1030691
ns1026191
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
422125
ns417125
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
417583
ns417396
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
419750
ns419312.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
416292
ns416334
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
241184
ns238661.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
546083
ns553834
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
289943
ns288843
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
752875.5
ns709000
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
755666
ns734313
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
675729
ns671250
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
760021
ns669791.5
ns1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1151706
ns1151563
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6939708
ns6696083
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
927380
ns931160
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3457437.5
ns3399479.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3437021
ns3363750
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3434709
ns3425625
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3439146
ns3391083.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
201324
ns177139
ns1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1424084
ns1423625
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
412665
ns416864
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6238000
ns6186791
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6200250
ns6198687.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6194458
ns6090875
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6143770.5
ns6187875
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1091727.5
ns1083853
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8063541.5
ns8058500
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1569386
ns1565741.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
473666
ns471667
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
340792
ns253791
ns1.34
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
342166
ns342583
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
905125
ns902708
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46953
ns46521
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
496959
ns448250
ns1.11
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
251203
ns251212
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2275334
ns2350667
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2043625
ns1761583.5
ns1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2032437
ns2037792
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3282416.5
ns3284625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
283225
ns258155.5
ns1.10
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2237145.5
ns2294875
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
791808
ns791358
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57833
ns58292
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
45958
ns39584
ns1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns46542
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82792
ns82833
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28918
ns27855
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1145250
ns1156292
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78811
ns77241
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2000229
ns2035459
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2089833
ns2077875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2077250
ns2072875
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1980437.5
ns1932083
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
244212
ns241361.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11407979
ns11703125
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1055251
ns1056652
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns58333
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46250
ns39458
ns1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46666
ns46834
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83041
ns83250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
50656
ns49658
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1123000
ns1110916
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
73121
ns75300.5
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1903916
ns1894292
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1902541
ns1940666
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1978250
ns1969937.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1902959
ns1886875
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
251664
ns247040
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9794437.5
ns9839292
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
936124.5
ns1051031
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns250
ns1.33
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
35119.5
ns34603
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
308104.5
ns433687.5
ns0.71
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
50550
ns49160
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7937.5
ns6958
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7625
ns7250
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7521
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8167
ns7208.5
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
218323.5
ns210766
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4836354
ns5193791.5
ns0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
381674
ns378014
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
33417
ns32342
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
259375
ns261521
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
43851
ns39160
ns1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2666
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2875
ns2667
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2916
ns2834
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2667
ns2625
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
205231.5
ns202783.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
1294875
ns969250
ns1.34
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
166746
ns154716.5
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
437042
ns457625
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
422021
ns453792
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
424229
ns426146
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
425834
ns456125
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
142985.5
ns142160
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2238375
ns2271875
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
375684
ns326853
ns1.15
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3809770.5
ns3802938
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3802375
ns3809708
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3804250
ns3801896
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3793125
ns3792625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
782254
ns781504
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11146187.5
ns11052792
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1312364
ns1495896
ns0.88
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49907416.5
ns49881521
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35559584
ns26009250
ns1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35529250
ns35546334
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96899084
ns96980062.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1625871
ns1600432
ns1.02
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1003290
ns1012971
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154966354
ns154537104
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112363000
ns88927125
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112555750
ns112528667
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
296527604.5
ns298524146
ns0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6450345
ns6474447
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5530212.5
ns5518798
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
19374.5
ns19062.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
18750
ns15542
ns1.21
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17353.5
ns17042
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15188
ns16021
ns0.95
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20779
ns20743
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
224333
ns252583
ns0.89
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
26660
ns26040
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10917
ns10917
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
8834
ns7416
ns1.19
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9291
ns9208
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17291
ns17375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
299343
ns296392
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1655375
ns1636083.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
155331
ns155431
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8312.5
ns8729
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8459
ns9000
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10895.5
ns9229.5
ns1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
9312.5
ns8562.5
ns1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
142637
ns139671.5
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
798083
ns799833
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
241143
ns242752
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10333.5
ns9312.5
ns1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9042
ns9416
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9583
ns10167
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8937.5
ns9250
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
705801.5
ns704800.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5435917
ns5428520.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
657647
ns674252.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9020.5
ns9333
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10229
ns9709
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11250
ns10625
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9792
ns9250
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
137059
ns136210.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
882166.5
ns947792
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
78120
ns69541
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13020.5
ns13062.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12583.5
ns13542
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13583
ns13916.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13458
ns13000
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
651470.5
ns647891
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4779312.5
ns4788583
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
356033
ns349204
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
459
ns500
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
458
ns459
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
459
ns500
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
35430
ns34950
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
385417
ns441000
ns0.87
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
210072
ns208662
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8166
ns7916
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8000
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8937.5
ns8729.5
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8208
ns8417
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
238141
ns235567
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5550500
ns5655333.5
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
670717
ns664097
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16417
ns16375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
16709
ns14604.5
ns1.14
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
15209
ns14708
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10312.5
ns10459
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
21707
ns21454
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
217458
ns214750
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
194532
ns188482
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
31854.5
ns31708
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32167
ns31875
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32250
ns32146
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
32125
ns31917
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
316460
ns314264
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1889916
ns1721916
ns1.10
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
608847
ns610347
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
450417
ns441229.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
482813
ns445062.5
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
444604
ns447666
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
440875
ns446000
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193879
ns194324
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2124500
ns2129687.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
376794
ns356014
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3673458
ns3806062.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3802062.5
ns3830125
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3822709
ns3819020.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3821333
ns3829625.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
588897
ns580459
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9577042
ns10082833.5
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1393435
ns1390109
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
783185125
ns833503354
ns0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
542907542
ns415838000
ns1.31
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
543132625
ns544434542
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1514951833.5
ns1561715250
ns0.97
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22763713
ns22756243
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14159478.5
ns14023836
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2527739209
ns2997704083
ns0.84
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1799023667
ns1512242750
ns1.19
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1787795417
ns2248995791
ns0.79
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4787274417
ns5261167167
ns0.91
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
333649192
ns364718000
ns0.91
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
88087394
ns87342499
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76666.5
ns77833
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79083
ns76542
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79375
ns78708
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
78124.5
ns76354.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
238895.5
ns235898
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
542209
ns551041.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
111271
ns109786.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
277000
ns282312.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
278895.5
ns251104
ns1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
194979
ns197208
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
259250
ns192416
ns1.35
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1134646.5
ns1133383
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6160709
ns6595833
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
645127
ns643627
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199977437.5
ns199406375
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139216750
ns104150500
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139454459
ns139302333
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389873250
ns388728500
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5849131.5
ns5827807.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3425810.5
ns3416565
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
621409333
ns621451500.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
440537375
ns353591958
ns1.25
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
440145604
ns438706083.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1186223625
ns1195242542
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26711378
ns26241215
ns1.02
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21741902
ns21717195
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6084
ns5292
ns1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6291
ns6000
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10292
ns10042
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28202.5
ns27646
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
601583
ns620417
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48405.5
ns50410
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220749.5
ns213208
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222374.5
ns221104.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222542
ns221854
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217625
ns216000
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
245623
ns239232
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8971334
ns9004750
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
543906
ns536025
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8145.5
ns8333.5
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10083
ns10250
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10833.5
ns9937.5
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
10000.5
ns9416
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
136003.5
ns133822.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
906833
ns904312
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
72945.5
ns72841
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7667
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7209
ns7917
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8292
ns8167
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns7833
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
587405
ns581095
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4757959
ns4731020.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
326203
ns326163
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns459
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns584
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
375
ns500
ns0.75
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26999
ns26581
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
493458.5
ns473959
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
49231
ns49351
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9458
ns10166
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10250
ns10334
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10521.5
ns10416
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10125
ns9584
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
275766.5
ns272007
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6076395.5
ns5995833.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
401444
ns394569
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
107104.5
ns107229.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
99896
ns85749.5
ns1.16
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
101145.5
ns99417
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146459
ns146291
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24813
ns24482
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
277416.5
ns274937.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
192192
ns192342
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
479500
ns478334
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
494084
ns500041
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
478958
ns478375
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
528667
ns478708
ns1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
258431
ns255734
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2276458
ns2286625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
624467
ns624721
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5750.5
ns4937.5
ns1.16
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
6917
ns7000
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6833.5
ns7792
ns0.88
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4458
ns4333
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18139
ns16407
ns1.11
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
73231
ns78321
ns0.94
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11854
ns11542
ns1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
10500.5
ns9666.5
ns1.09
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11104.5
ns10792
ns1.03
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17083
ns16958
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
235890
ns233195.5
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
372074
ns378594
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
38750
ns39417
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51292
ns50250
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52729.5
ns51417
ns1.03
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
15834
ns13833
ns1.14
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20456
ns19791.5
ns1.03
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
87011
ns85261
ns1.02
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36875
ns51020.5
ns0.72
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
34729
ns28646.5
ns1.21
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
32167
ns31146.5
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57000
ns64625
ns0.88
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
212876
ns208902
ns1.02
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
418835
ns415884.5
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1791
ns1875
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1708
ns1667
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2187.5
ns2250
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1750
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
20570.5
ns20332
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
329917
ns324854.5
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
31020
ns28921
ns1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2209
ns2083
ns1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2250
ns2208
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2500
ns2291
ns1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2208
ns2250
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
226270.5
ns222981
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1683458.5
ns1764708
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
142136.5
ns139241
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5042
ns4667
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4500
ns4750
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6208.5
ns5750
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4666.5
ns4333
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
163224.5
ns161766.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
800792
ns453291.5
ns1.77
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
75611
ns62650
ns1.21
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8291
ns8667
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8209
ns7958
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns8250
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8166
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
960930
ns958412
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5752708
ns5932250.5
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
398144
ns385774
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56791
ns57250
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57459
ns56916
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57667
ns58250
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58208
ns58667
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
38436
ns37674
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
411813
ns380459
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
218852
ns208842
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448812.5
ns449562.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
499084
ns466895.5
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465709
ns465833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
481396
ns434708
ns1.11
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
282356.5
ns276347
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7964500
ns8199750
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
842729
ns814928
ns1.03
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3322916
ns3302500
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2338771
ns1770792
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2339375
ns2337291.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6304166.5
ns6303499.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204545
ns204292.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
202912
ns203467.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11552375
ns11464458
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8313541.5
ns6552083
ns1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8336875
ns8324666.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21101437.5
ns21058833.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
734673
ns741274.5
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1078791.5
ns1081561
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6166
ns4583
ns1.35
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4916.5
ns5667
ns0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6541
ns6104
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4875
ns4854.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
158133
ns156234.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
887167
ns827584
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
57035.5
ns58490
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7166
ns7750
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7209
ns7042
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7292
ns7416
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns6959
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
816855
ns812002.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
6166979.5
ns5657917
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
384744.5
ns382614
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
123458
ns95791
ns1.29
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
131229
ns98000
ns1.34
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
100000
ns125333
ns0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
94625
ns98604
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
160516.5
ns158376.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2207458
ns2249500
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
187112
ns189172
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1964000
ns2001625
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2023146
ns1968041.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2028667
ns2021312.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2018916.5
ns2030708.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
789517
ns779642
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11417250
ns11090459
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1260093
ns1124561.5
ns1.12
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33813
ns34541.5
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36729
ns35875
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
34708.5
ns33958
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
667
ns625
ns1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15818
ns15484
ns1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
82161
ns80681
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2583
ns2667
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2709
ns2791
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2959
ns3000
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2125
ns2250
ns0.94
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
152979.5
ns148962
ns1.03
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
352884
ns353673
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7291
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6042
ns5292
ns1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns6084
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns10125
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37656
ns36617
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
431042
ns574854
ns0.75
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49591
ns49650
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214000
ns213333.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232937.5
ns220792
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221834
ns221208.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
232000
ns214813
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
258714
ns253557.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7857271
ns7960167
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
526085
ns522265
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22767
ns22029
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
244500
ns250333
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
47941
ns45980
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14667
ns14958
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15000
ns14625
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14959
ns14958
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14959
ns14875
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
344878
ns339194
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1074437.5
ns1025166
ns1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
201792
ns196272
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
120021
ns103041
ns1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
98958.5
ns125083
ns0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
104666.5
ns132667
ns0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144250
ns100249.5
ns1.44
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
160419
ns160077
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2228291
ns2853125
ns0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
170682
ns205222
ns0.83
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1891375
ns1923874.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1833541.5
ns1935583
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1894375
ns1923375
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1924667
ns1927062.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
772105.5
ns765025
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10866208
ns10829375
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1240333
ns1233282
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20250
ns18791
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18937.5
ns18729.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20542
ns20145.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20208
ns19646
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
127944
ns123582.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1385750
ns1393500
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
82111
ns76250
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216708
ns215917
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
255583
ns216688
ns1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
218146
ns219083
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217458
ns216250
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
580859
ns569648
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6240292
ns6226521
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
484605
ns496345
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
25687
ns25312.5
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
31687.5
ns28312.5
ns1.12
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
29145.5
ns29041
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1541
ns1458
ns1.06
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17059
ns16184
ns1.05
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
83471
ns88291
ns0.95
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4896
ns4875
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4687.5
ns4895.5
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5208
ns5437.5
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4708
ns4875
ns0.97
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
231729
ns227416
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
400815
ns387604
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
304916
ns305625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
307083.5
ns305812.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
310250
ns309146
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
307458
ns307792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
260954.5
ns259343.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1003667
ns655771
ns1.53
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
282392
ns277977.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
530417
ns532041
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
536417
ns530083
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
533416.5
ns538458
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
540917
ns533250
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1194615.5
ns1187558.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6650583.5
ns6496375
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
886938
ns870989
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19292
ns19792
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20437.5
ns21104
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21583
ns22312.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19250
ns20542
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
134679
ns131573
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1513292
ns1498125
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76825.5
ns75971
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215083
ns214917
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212625
ns213083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215021
ns213958
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
249312.5
ns212500
ns1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
889532
ns880154.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7210062.5
ns7325541
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
554056
ns546485
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6583
ns6334
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6937.5
ns7458
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9208
ns8083
ns1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6792
ns6583
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
160487
ns157693
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
869792
ns839500
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
69890
ns69580
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10000
ns11041
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9854.5
ns9917
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10292
ns10729
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10375
ns10209
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
896806
ns890019.5
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5937375
ns5554084
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
398234
ns391634
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4125
ns4667
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5542
ns5124.5
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6750
ns5833
ns1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4750
ns6458
ns0.74
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
162556
ns161059
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
844750
ns822917
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
62561
ns62251
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7500
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7333
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7708
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7166
ns7166
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
844691
ns835582
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5794250.5
ns5986084
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
401898.5
ns405494.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14528708
ns14490500
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10144083
ns7719208
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10119791
ns10131041
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27783209
ns27827208
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
561716
ns529747
ns1.06
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
405538.5
ns389754
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46624812
ns46259291.5
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33411666.5
ns26496000
ns1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33562500
ns33451708
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85401583
ns85583541
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2800168
ns2650995
ns1.06
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3289235
ns3276734
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66500
ns68250
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
68375
ns68104.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68875
ns69312
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67250
ns66125
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
138855.5
ns134037
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1526666.5
ns1521625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
238492
ns232902
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
444500.5
ns449854
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
442146
ns440625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
441583
ns442209
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
493750
ns440396
ns1.12
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
807637.5
ns796931.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7704542
ns7473000
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
803267.5
ns813753.5
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns625
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
666
ns625
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
33435
ns31856
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
422917
ns476979
ns0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
52200
ns51801
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9458.5
ns10937.5
ns0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10771
ns9542
ns1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10416.5
ns10312.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9666.5
ns10895.5
ns0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
303460.5
ns298325
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5666958.5
ns5492937.5
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
397994
ns381794
ns1.04
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9875
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9916
ns9834
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9792
ns9834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9792
ns9792
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24011
ns23467
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
225541
ns227250
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
218962
ns218063
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
46000
ns46250
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
46167
ns45750
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46416
ns46625
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46334
ns46625
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
315869
ns308147
ns1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
1098270.5
ns981645.5
ns1.12
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
628475.5
ns625806
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56250
ns56500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57208
ns56333
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57167
ns57292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57833
ns57958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
29662
ns28681
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
616041
ns679666.5
ns0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
218842
ns206472
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
450333
ns454625
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
473958
ns465208
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
468792
ns467459
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
442709
ns435834
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
260564.5
ns255741
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9323750
ns9276416.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
849638
ns857403.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
607437.5
ns647417
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
677167
ns646792
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
619062.5
ns649354.5
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
645083.5
ns663709
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
227369
ns225589
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1393791.5
ns1395125
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
251853
ns235913
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2229542
ns2227104.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2242667
ns2251250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2238417
ns2225292
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2233500
ns2242250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1055691
ns1068301.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7106083
ns7711771
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1380353
ns1379184
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20396
ns22916
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20625
ns20146
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21333.5
ns21833
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23708
ns20709
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
128483
ns127032
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1530250
ns1515770.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
82281
ns84371
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219229.5
ns253853.5
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223875
ns220458
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221083
ns221000
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219917
ns219020.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
851484
ns840768
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7710292
ns7691791.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
562290
ns560576
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns584
ns0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23568
ns22755
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
453729.5
ns466021
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
50170
ns50411
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10937.5
ns11229.5
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10479
ns10542
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
11166
ns10771
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10208
ns10312
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
278881.5
ns277858
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6153209
ns6009125
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
418644
ns412724
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10416.5
ns9209
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9500
ns10125
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11000
ns9750
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8958
ns8917
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
137213
ns135766
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
886500
ns904792
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
74561
ns67721
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7750
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7666
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8208
ns8312.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7541
ns7500
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
553485
ns551973.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4191417
ns4446687.5
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
340023
ns336393
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1791.5
ns1437.5
ns1.25
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1500
ns1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2083
ns2000.5
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1354.5
ns1.17
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21340
ns21147
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
310625
ns311875
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
192401.5
ns190276.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3292
ns3333
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3375
ns3333
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3583
ns3458
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3375
ns3333.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
244685
ns241366
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1830688
ns1889917
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
598576
ns597216
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148667
ns148042
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
128833
ns106084
ns1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
129604
ns128375.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225042
ns225104
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
24647
ns24502.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
278416
ns306333
ns0.91
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
37400
ns36970
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
143709
ns174999.5
ns0.82
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
124625
ns87125
ns1.43
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
110395.5
ns110792
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
287812.5
ns250729
ns1.15
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
242298
ns240885.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2059479
ns2110083.5
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
238587
ns226383
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7250
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5292
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6084
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10062.5
ns10083
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33200
ns32889
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
358750
ns369062.5
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
52880
ns51151
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220291
ns223583
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231500
ns228584
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229125
ns228917
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
245229.5
ns213604
ns1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272719
ns270279
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8345291.5
ns8277437.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
536095
ns534116
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
15417
ns14833
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
15167
ns15125
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
17042
ns16500
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15375
ns15917
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
158597.5
ns157359.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
852042
ns824458
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
242502
ns240222
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23937.5
ns23687
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24666
ns23500
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23874.5
ns23854
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23291.5
ns23292
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
931616
ns926538.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5615896
ns5882625
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
698756
ns690662
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9958
ns9812.5
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10166
ns9542
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12333
ns10583
ns1.17
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9083
ns10167
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
141537
ns140467
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
805292
ns821479
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
77251
ns71471
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14083
ns13917
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14250
ns13166
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14208.5
ns14458
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13375
ns13583
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
768706
ns766881
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5278042
ns5288584
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
378343
ns372183.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10104.5
ns9459
ns1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10000
ns9542
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11354
ns10958
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9791.5
ns10333
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
139922.5
ns138627.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
897959
ns927624.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
77161
ns72865.5
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12333
ns12583
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12709
ns12646.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13000
ns13083.5
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12833.5
ns11937.5
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
626138
ns624799.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4505687.5
ns4551375
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
350573
ns348243
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
27729
ns31083.5
ns0.89
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
35375
ns32937.5
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32291
ns31583
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2041
ns2042
ns1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16815
ns16203
ns1.04
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
83101
ns73550
ns1.13
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5291.5
ns5229.5
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5146
ns5063
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5209
ns5562.5
ns0.94
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6229.5
ns6416
ns0.97
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
151130
ns148737.5
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
372413
ns374559
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns292
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns291
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns292
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
26290
ns26129
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
357500
ns467478.5
ns0.76
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48805.5
ns48501
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7354
ns7209
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7042
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8041
ns7708
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6979
ns7458
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
200306
ns198167
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6097521
ns6016959
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
397569
ns396144
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
1958
ns2042
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2041
ns1917
ns1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2084
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2000
ns1959
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
27273
ns26961
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
493416.5
ns473229.5
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
209702
ns211192
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17541
ns17312.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18166.5
ns16979.5
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17667
ns17958
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17666.5
ns17291.5
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
285604
ns284214
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6161167
ns5834000
ns1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
724677
ns717577
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
174417
ns188417
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
167583.5
ns169438
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151417
ns149396
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145583
ns175916
ns0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
225867
ns221937
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1429395.5
ns1550833
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
227572
ns199412
ns1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1321729
ns1315271
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1323417
ns1324083
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1328313
ns1325000
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1325750
ns1331833.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1001329
ns998483
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6753917
ns6733584
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1011639.5
ns1130086
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24896.5
ns27020.5
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25250
ns24792
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28000
ns26416
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25542
ns24687.5
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
271026.5
ns268327.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
986750
ns621687.5
ns1.59
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
119521
ns117991
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
117833
ns131333
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
120083
ns116958
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118375
ns125645.5
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
176875
ns127375
ns1.39
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1213900
ns1214493
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6376312.5
ns6553167
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
614965
ns601326
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23468
ns22301
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
446666
ns447500
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
49170
ns51730.5
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7562.5
ns7541.5
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7584
ns7167
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns7792
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6875
ns7416
ns0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
206004
ns204142.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5961166
ns5695875
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
407824
ns401495
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5812
ns5896
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5937.5
ns5708
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7333
ns6667
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6854.5
ns6208
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
167575
ns167740
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
672646
ns488083
ns1.38
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
240143
ns238573
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9875
ns10083.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9834
ns9709
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns9958.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9854
ns9708
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
978859.5
ns976109
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5692125.5
ns6285500
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
683826
ns679397
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns625
ns1.13
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns666
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
666
ns667
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
625
ns666
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23025
ns22844
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
214354
ns335708
ns0.64
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
216952
ns216152.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4667
ns4583
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4667
ns4542
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4958
ns4792
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4542
ns4584
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
242032
ns237762
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1648667
ns1793708
ns0.92
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
606251
ns600666.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8750
ns9542
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8500
ns8375
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9917
ns9542
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8375
ns8375
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
139395
ns138258.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
800687.5
ns834417
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
77821
ns69561
ns1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns8541
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8625
ns8166
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8979.5
ns9083.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8479
ns8209
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
674531.5
ns673050
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4665667
ns5316625
ns0.88
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
358963
ns354703
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126000
ns125917
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
130375
ns96125
ns1.36
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129416
ns130167
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183687.5
ns183437
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46315
ns45933
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
97061
ns98581
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
332208
ns339916
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
323917
ns166583
ns1.94
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
315709
ns348854.5
ns0.90
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
569000
ns574020.5
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
209770
ns207728
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
517105
ns495960
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397958
ns397708
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288166
ns215083
ns1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288250
ns288291
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756041.5
ns756250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44247
ns43863
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
421167
ns508833
ns0.83
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
84151
ns84981
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1380646
ns1459874.5
ns0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1132937.5
ns862000
ns1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1131583.5
ns1134791.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2441875
ns2443958
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
276054.5
ns264585.5
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1744958
ns1843542
ns0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
354794
ns355253
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
655000
ns614666
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
645458
ns586000
ns1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
606125
ns645874.5
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
651333
ns657000
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
211637.5
ns222791
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1332417
ns1392125
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
234477
ns247582
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2442417
ns2443375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2443729
ns2464833.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2460479.5
ns2434958
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2466125
ns2451958
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1084419
ns1084693
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9616354
ns9656375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1491474
ns1475249.5
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
32917
ns33979
ns0.97
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
35833
ns35146
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
35333
ns34541.5
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
958
ns958
ns1
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16181
ns15785
ns1.03
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
81781
ns72911
ns1.12
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3083
ns3166
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3166
ns3208
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3417
ns3459
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3042
ns3166.5
ns0.96
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
149907.5
ns147758
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
345503
ns345553
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
406833.5
ns406875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
408833
ns401958
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
409208.5
ns409250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
420333
ns421375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
44137
ns43841
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1179333.5
ns1170812
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
242582
ns242582.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3874541
ns3882208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3981625
ns3924041.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3995271
ns3998375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3778020.5
ns3776500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
254416
ns250561
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
12000083
ns11700333.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1240627
ns1246592
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
35129
ns34574
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
181625
ns264250
ns0.69
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
42720
ns40720
ns1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15500
ns15750
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15708
ns15542
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
16084
ns15917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15834
ns15792
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
276415
ns273311
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
889271
ns885792
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
176511
ns167912
ns1.05
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404209
ns404125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295395.5
ns220833
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295625
ns295250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760584
ns760375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113822.5
ns113355
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
409229
ns483500
ns0.85
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
92275.5
ns90391
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1418500
ns1480125
ns0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1143416
ns886750
ns1.29
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1157042
ns1160937.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2464062
ns2466312.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
252054
ns264186
ns0.95
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1932667
ns1873812.5
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
360264
ns357734
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
458
ns584
ns0.78
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
542
ns458
ns1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
500
ns541
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
26614
ns26163
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
362145.5
ns465459
ns0.78
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
209492
ns210412
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns8604.5
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8167
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9125
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8208
ns8750
ns0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
219325
ns212800
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6248208.5
ns5710375
ns1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
707647
ns711177
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
835021
ns833479.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
618583
ns471667
ns1.31
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
620791
ns618333
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1547209
ns1549979.5
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
131693
ns129908.5
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
167721
ns169932
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2699249.5
ns2690812.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
2010542
ns1528250
ns1.32
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2008750
ns2007542
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4923458
ns4933833.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
254591.5
ns255516
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
880209
ns874763.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
291
ns292
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32661
ns31620
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
283208
ns434791
ns0.65
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
49561
ns49800
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7417
ns7646
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7417
ns7084
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7958
ns7875
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns7333
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
230552
ns227155.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5450104
ns4969834
ns1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
374993
ns366063.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2388875
ns2419959
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2390042
ns2370750
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2387625
ns2383667
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2385041
ns2405250
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
222782
ns221771
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1608854
ns1606125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
336514
ns359644
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4653250
ns4630917
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4641333
ns4535583
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4667333
ns4657333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4656333
ns4651709
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
986732
ns989560.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6571104
ns6807396
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1423514
ns1409064
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6875
ns15188
ns0.45
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7396
ns6875
ns1.08
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7583
ns7459
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6958.5
ns9416.5
ns0.74
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
24376
ns24119
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
275584
ns280270.5
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
34810
ns34491
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
33521
ns67062.5
ns0.50
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
33500
ns45729.5
ns0.73
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33583
ns47833
ns0.70
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
32667
ns48416
ns0.67
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
243530.5
ns241118
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2038145.5
ns2256625
ns0.90
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
242918
ns244442
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21625
ns22000
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
26250
ns24167
ns1.09
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
25209
ns24291.5
ns1.04
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5167
ns5333.5
ns0.97
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18282
ns17742
ns1.03
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
86261
ns91171
ns0.95
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11875
ns12250
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10417
ns9229
ns1.13
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10833
ns10708.5
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17792
ns17979.5
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
249417
ns247367
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
378534
ns394269
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406250
ns405958
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297375
ns223625
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296750
ns296750
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762958
ns762959
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47260
ns46786
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
509104
ns437125
ns1.16
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
89561
ns92421
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1445500
ns1485583
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1166562.5
ns892146
ns1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1168167
ns1165042
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2472542
ns2472709
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
314496
ns308920
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2114437.5
ns2073458
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
384754
ns380064
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
434833.5
ns435750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
436583
ns430312.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
436750
ns438875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
447625
ns448792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
55692
ns54925.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1118708.5
ns1149375
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238023
ns238282
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3881625
ns3884333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4013979
ns3995458.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4029083
ns4027188
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3805271
ns3806541.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
274092
ns270795
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10308938
ns10301625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1240392
ns1244507
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8792
ns8750
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7666
ns6875
ns1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7709
ns7708
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12417
ns12458
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24383
ns24004
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
228000
ns231583.5
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
220382
ns219512
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
44708
ns45125
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45250
ns44791
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45208
ns45166
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45209
ns45542
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
367981
ns364741
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1846667
ns1791396
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
666456
ns666126
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
83167
ns85666
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
83416
ns82854.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
83917
ns90541
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
94583
ns123042
ns0.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190250
ns190268
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2072000
ns2136500
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
172412
ns206862
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1982729
ns1990916
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2023063
ns1994062.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2022000
ns2022062.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2016958
ns2019666
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
583620.5
ns579448
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9865250
ns9777083
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1098935.5
ns1101570
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.