This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: allow zero-sized arrays in bias_activation
- Loading branch information
Showing
3 changed files
with
25 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "0.3.46" | ||
version = "0.3.47" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17ac9a2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
17ac9a2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/113527
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
17ac9a2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4937.5
ns5854
ns0.84
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5666
ns5333.5
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8042
ns7937.5
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5687.5
ns6187.5
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
120909
ns118032
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2455225
ns2423568
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
791750
ns782792
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
413945
ns415924
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns9708
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10250
ns9729.5
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9875
ns10541
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9584
ns9709
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
558079
ns543351
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
17669056
ns18092128
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2765041
ns2659875
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
664078
ns681837
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1375
ns1500
ns0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1500
ns2917
ns0.51
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
2083
ns2208
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1667
ns2958
ns0.56
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21790
ns21829
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1329237
ns1310773
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
216396
ns241583
ns0.90
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
31411
ns31380
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4229.5
ns4313
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3666
ns4541
ns0.81
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4166
ns3875
ns1.08
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4208
ns3833
ns1.10
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
149451
ns145099
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
9238119
ns9229160
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1690125
ns1544146
ns1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
153327
ns151422
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58500
ns58000
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39500
ns46458
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47042
ns46042
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83333
ns82625
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37308.5
ns36359
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
568690.5
ns644849
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1066021
ns1068687
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
81381
ns80010.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2032541.5
ns2025458
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2086500
ns2080416.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2080042
ns2072458.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1986125
ns1991834
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
235686.5
ns229025
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7616963
ns8450207
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7909459
ns7471708
ns1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1203034
ns1429005
ns0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
148292
ns175729
ns0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
166416.5
ns148708
ns1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150375
ns165458.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
153437
ns147208
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165231.5
ns166770
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7240146
ns7919821.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1574250
ns1556812.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
180947
ns218943
ns0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1115708.5
ns1113104
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1115583
ns1103291.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1111895.5
ns1131500
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1116604
ns1112791.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
717544
ns699462.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33548293
ns35512444
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
5783062
ns6463312
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1033041
ns932249.5
ns1.11
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4958
ns4042
ns1.23
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4334
ns4250
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5667
ns5625
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4771
ns5709
ns0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
95604
ns91912
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5339458
ns5255099
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
722292
ns437917
ns1.65
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
60161
ns71131
ns0.85
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns8583
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8542
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns9417
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8541
ns8708
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
618298
ns595424
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
37406952
ns35680059.5
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6128688
ns5662792
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
393664
ns389839
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17312.5
ns17854.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18834
ns17916
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20375
ns20520.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18291.5
ns18083.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
67939.5
ns65742
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
2980613.5
ns2917446
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1353958
ns1373187
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76051
ns74701
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223334
ns212334
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
211917
ns217375
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219896
ns218646
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212084
ns211791
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
360648.5
ns352483
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
14475566
ns13596474
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5859625
ns5745167
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
479156
ns480345
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
583.5
ns708
ns0.82
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
666
ns666
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
834
ns1000
ns0.83
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
708
ns625
ns1.13
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20628
ns20944
ns0.98
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1259163
ns1204727
ns1.05
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
300000
ns280583.5
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32990
ns32630
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1395.5
ns1375
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1541
ns1416
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1417
ns1542
ns0.92
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1334
ns1375
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
126083.5
ns125999
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8839991
ns8883264
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1618083
ns1494917
ns1.08
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
126566.5
ns137492
ns0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7333
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns6125
ns0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6166
ns6125
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns9959
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23573
ns23485
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1332882
ns1300745.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
626500
ns576646
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
47160
ns49320
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
234375
ns220500
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
242458
ns268854.5
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
270958
ns269333
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
251083
ns254708
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
185731
ns183065
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
32262788
ns29898186.5
ns1.08
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9574500
ns9134229.5
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
623787
ns618976
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4083
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4084
ns4084
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23126
ns22910
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2060093
ns1968950
ns1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
229375
ns219458
ns1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
48711
ns50520
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17041
ns16833
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16500
ns16834
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17250
ns16958
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16833
ns16875
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
195934
ns197795
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
9776499
ns10645383
ns0.92
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
972750
ns930708
ns1.05
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
179972
ns177772
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
509541.5
ns509792
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
332334
ns404834
ns0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
404875
ns404666.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
865104.5
ns864750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113032
ns113219.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
393314
ns402480
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
448145.5
ns453521
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
248713
ns249427.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2319667
ns2317166
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1752729.5
ns2034833
ns0.86
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2031958
ns2026750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3283979.5
ns3276667
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
244203
ns244909.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
10141578
ns10934161
ns0.93
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2016625
ns1922875
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
763594
ns761538
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6042
ns6729.5
ns0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6458
ns6875
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7708
ns7458
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6333
ns6270.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
93025
ns94787
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5860829
ns5253207.5
ns1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
901583
ns752958
ns1.20
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
62671
ns60151
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11583
ns10521
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10500
ns11958
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11458
ns11458
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10979
ns12062.5
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
646277
ns664213
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
39475366
ns39026819.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5976917
ns5529917
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
418444
ns415644
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23258
ns23657
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2210297.5
ns2187739
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
327292
ns319854.5
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
52080
ns51140
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2125
ns2209
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
221917
ns243631
ns0.91
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
11090915.5
ns11206966.5
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
2054166
ns1981375
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
182777
ns178346.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9062.5
ns8500
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9792
ns9354.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10167
ns9875
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8896
ns8896
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
106196.5
ns116168
ns0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3337522.5
ns3094072
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
876291.5
ns733250
ns1.20
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
75871
ns78331
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16854.5
ns17562.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17624.5
ns17916
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18958
ns18270.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17187.5
ns17937.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
603520
ns635290
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
17553358.5
ns16856244
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5108208
ns4597917
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
396274
ns390934
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns708
ns0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
35127
ns35377
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1236195
ns1189275
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
475271
ns273875
ns1.74
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
49441
ns45830
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9792
ns10375
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10125
ns9750
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10229.5
ns10666.5
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9334
ns10125
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
257318.5
ns269773.5
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19179819.5
ns18833230.5
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5182437.5
ns4684416.5
ns1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
380274
ns378479
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397083
ns397125
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215250
ns288208
ns0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287916
ns287833
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756041
ns756000
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111427
ns112391.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
329641
ns327166
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
363792
ns387167
ns0.94
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
78871
ns78291
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1454375
ns1454292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
859500
ns1135437.5
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1129916
ns1134916.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440417
ns2439500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
209113
ns208924
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
10754018
ns10539379
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1658937.5
ns1551334
ns1.07
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
328243
ns324908
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6916.5
ns7312.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7084
ns6833
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8188
ns8166.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7042
ns7250
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
152190.5
ns157256
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5924676
ns5707911.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
764458
ns708479.5
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
60511
ns70880
ns0.85
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16083.5
ns14479
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15625
ns15042
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14167
ns15208.5
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14333
ns14563
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
1030700
ns1058911
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
41391929
ns41408429
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6599291
ns5845541
ns1.13
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
440235
ns427694
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25292
ns24854.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
29625
ns28542
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26500
ns29541
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25291
ns25375
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228373.5
ns227961
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7754403
ns7745684.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1026771
ns1037604
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
118522
ns117166
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
146334
ns147667
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118854
ns114041.5
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
148958
ns149708.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
117125
ns152584
ns0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1207252
ns1197557
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43560506
ns42890857
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6191583
ns5764959
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
601256
ns597616
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73833
ns76917
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
78104.5
ns76667
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78021
ns80458
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77750
ns77500
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
234865
ns232480
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7715200.5
ns7739750
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
534625
ns524583
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
125536.5
ns125411.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
305354
ns301958
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
321166
ns321188
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
295667
ns307271.5
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
304625
ns297896
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1245639
ns1235149.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
42472482.5
ns40402340
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6703875
ns6347749.5
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
703602.5
ns702732
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
16812.5
ns16416
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
16334
ns17166
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17438
ns17708
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
17083
ns16417
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
166179
ns165184
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5720112.5
ns5708995
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
615083
ns664229.5
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
240073
ns239872
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27771
ns26125
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
30354.5
ns27958
ns1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26791
ns26792
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26770.5
ns27500
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
1050438
ns1039155
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
42382106
ns40056686
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6159041
ns5713084
ns1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
717457
ns706062.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11125
ns11624.5
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11624.5
ns12250
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11958
ns12771
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11229.5
ns10959
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
139887
ns138899
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
4251683
ns3796355
ns1.12
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
817625
ns788438
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
244463
ns243833
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20917
ns22166.5
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21833.5
ns22250
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22563
ns22375
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21416.5
ns21958.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
755838.5
ns748333
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
21707541
ns21361866
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5465833
ns5064604
ns1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
695787.5
ns689278
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
69479
ns63125
ns1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
66041
ns63625
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66229
ns67479
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67791.5
ns63917
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
119885
ns117712.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3482594.5
ns3707369
ns0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1370041.5
ns1370062.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
239323
ns238362
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
484042
ns483270.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
465541
ns451625
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
439354.5
ns448416
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
437646
ns449458
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
558258
ns554127
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20869371
ns20983063
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6275458.5
ns6194854
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
737788
ns715517
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7166.5
ns7271
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7792
ns6958
ns1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8375
ns9042
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7292
ns7166.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
161904.5
ns160659.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5642311
ns5587356
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
456041
ns442520.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
61410
ns59231
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14708
ns13521
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17500
ns13500
ns1.30
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14646
ns15291.5
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15250
ns15833
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1023653
ns1017254.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
39308326
ns39424822
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6089229.5
ns5535917
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
412184
ns406944
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6148208
ns6145625
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3227583
ns6372750
ns0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6378333
ns6371166
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11914959
ns11907459
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301812
ns351332.5
ns0.86
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
296489
ns295593
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19106770.5
ns19085062.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11136250
ns19924479
ns0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19962416
ns20021500
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36542271
ns36494374.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1158703
ns1097383
ns1.06
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1169188
ns1163817
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
959
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
959
ns958
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23501
ns23661.5
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2086104
ns2047052
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
329667
ns226209
ns1.46
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
216992
ns214002
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3709
ns3667
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3666
ns3708
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns3750
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3667
ns3625
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
297158.5
ns298918
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10890085
ns11777281.5
ns0.92
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2191521
ns2108708
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
650431.5
ns643757
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8500
ns8958
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9250.5
ns8583
ns1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9396
ns9271
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8125
ns8250
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
136116.5
ns134550.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3625618.5
ns3553980.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
819208
ns721791
ns1.13
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
67611
ns67561
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11250
ns12042
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12667
ns12500
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11729
ns12312.5
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11042
ns11917
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
721603
ns711267.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22100212
ns22128749
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5441770.5
ns4449541
ns1.22
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
373594
ns363664
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22886
ns22725.5
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2122211.5
ns2110212
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
331291
ns216958
ns1.53
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
51921
ns52851
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3000
ns2875
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2958
ns2917
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns3125
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
213807.5
ns212530.5
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9555854.5
ns10303702
ns0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1713479.5
ns1562292
ns1.10
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
168911.5
ns171862
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11833
ns12084
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11896
ns11458
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13000
ns12750
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12291
ns11583
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
137978.5
ns136669.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3346916
ns3560577
ns0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
900666.5
ns853666
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
239817.5
ns243422.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23042
ns20646
ns1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
22104
ns20792
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20458
ns21812.5
ns0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23917
ns20854.5
ns1.15
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
653027.5
ns647113
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21573566
ns19242302
ns1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4833270.5
ns4418416
ns1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
673012
ns655347
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24516
ns24882
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2141686
ns2124569.5
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
231666.5
ns219041
ns1.06
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
54410
ns52591
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16625
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16167
ns16750
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16833
ns16916
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16459
ns16625
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
353559.5
ns352424
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
13538541
ns12518270.5
ns1.08
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1092020.5
ns1126416
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
216712
ns214922.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
1959
ns2000
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
1959
ns2042
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2167
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2125
ns1917
ns1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35968
ns36064
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1179215
ns1237482
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
444042
ns276625
ns1.61
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
208282
ns207022
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17958.5
ns19104.5
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
16958
ns19250
ns0.88
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17333.5
ns18437.5
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17646
ns17687
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
305401
ns302079
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
22626384
ns20285887
ns1.12
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5381292
ns5279999.5
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
703887
ns703572
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59250
ns59084
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
60625
ns65312.5
ns0.93
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
64167
ns65833
ns0.97
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51291
ns54042
ns0.95
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66533
ns66307
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
101811
ns102391
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
196208
ns182750
ns1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
139333
ns137479
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
155270.5
ns130291
ns1.19
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
285354
ns309103.5
ns0.92
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
231110.5
ns230456.5
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
587041
ns582421
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82771
ns83250
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
87959
ns85417
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85959
ns84021
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81812.5
ns85333
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192437.5
ns193872
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5643042.5
ns5235813
ns1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2001125
ns2079791
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
172856.5
ns209407.5
ns0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1915166.5
ns1884937.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1905625
ns1903375
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1906791.5
ns1896375
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1867521
ns1907458
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
575411.5
ns571543
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
26938211
ns26257898
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9319437.5
ns8902417
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1079271.5
ns1079882
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21855
ns21813
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2160069.5
ns2084650
ns1.04
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
370125
ns320667
ns1.15
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
45340
ns45201
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1791
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
268250
ns267247
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
10449469
ns10318095
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1115417
ns1507562.5
ns0.74
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
183362
ns186822
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8250
ns9291
ns0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
11209
ns10416
ns1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9708
ns10708.5
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9084
ns8083
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
135375
ns133754
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3475930
ns3421507
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
905833
ns847208
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
242893
ns239032
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
12042
ns9084
ns1.33
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11125
ns9084
ns1.22
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8833
ns9125
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
12083
ns9167
ns1.32
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
583740
ns568427
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21716042.5
ns19345703
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4734458
ns4300021
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
652127
ns631937
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58458
ns58042
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39584
ns46584
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47104.5
ns46458
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83084
ns83125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39769
ns39683
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1485296
ns1371601
ns1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1151625
ns1146041.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78761
ns77101
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1929833
ns1921833
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1940687
ns1952917
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1942312.5
ns1960375
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1910500
ns1841250
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
236370
ns234062
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
34305466
ns32924888
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11016708.5
ns10966291.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1026191
ns1020771
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
417125
ns415291
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
417396
ns417687.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
419312.5
ns422375
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
416334
ns418000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
238661.5
ns235407.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7635909
ns7542978.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
553834
ns531500
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
288843
ns288923
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
709000
ns774208
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
734313
ns682542
ns1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
671250
ns767312.5
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
669791.5
ns737938
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1151563
ns1139485
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44436982
ns44764339
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6696083
ns6953229
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
931160
ns928979
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3399479.5
ns3441042
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3363750
ns3456937.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3425625
ns3428771
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3391083.5
ns3464583
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
177139
ns175145
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8299345
ns8202705.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1423625
ns1354666
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
416864
ns437135
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6186791
ns6187166
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6198687.5
ns6181167
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6090875
ns6197500
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6187875
ns6196250
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1083853
ns1070107
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
51969867.5
ns52446997.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8058500
ns7250062.5
ns1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1565741.5
ns1563197
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
471667
ns472000
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
253791
ns340625
ns0.75
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
342583
ns341708
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902708
ns902667
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46521
ns46753
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
889304
ns384887.5
ns2.31
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
448250
ns404791
ns1.11
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
251212
ns251683
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2350667
ns2320833
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1761583.5
ns2040917
ns0.86
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2037792
ns2031584
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3284625
ns3282833
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
258155.5
ns255715
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
13627654
ns8191343
ns1.66
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2294875
ns2170729.5
ns1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
791358
ns784548
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58292
ns57500
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39584
ns46250
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46542
ns45834
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82833
ns82875
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
27855
ns27970.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1365660
ns1021421
ns1.34
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1156292
ns1149062.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77241
ns74351
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2035459
ns2033375
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2077875
ns2085958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2072875
ns2091250
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1932083
ns1948958.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
241361.5
ns239148
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
37426143
ns37589753
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11703125
ns11610833
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1056652
ns1049280.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58333
ns57875
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39458
ns46667
ns0.85
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46834
ns46375
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83250
ns82708
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
49658
ns49341
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
784251.5
ns807803
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1110916
ns1096916.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
75300.5
ns73081
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1894292
ns1916250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1940666
ns1967625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1969937.5
ns1972395.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886875
ns1891979.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
247040
ns245040
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
17460173
ns18113352
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9839292
ns9728333
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1051031
ns928945
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
250
ns292
ns0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34603
ns34818
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1179890
ns1198925.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
433687.5
ns405563
ns1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
49160
ns47730
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6958
ns7312
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7416
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7521
ns7812.5
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208.5
ns8333
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
210766
ns207888
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
20576421.5
ns20491372
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5193791.5
ns4662542
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
378014
ns380574
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32342
ns32837
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1286880
ns1211839
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
261521
ns251125
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
39160
ns39501
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2666
ns2667
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2667
ns2875
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2834
ns2959
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2917
ns0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
202783.5
ns202228
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7191833
ns7788464.5
ns0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
969250
ns942542
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
154716.5
ns154372
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
457625
ns429542
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
453792
ns473375
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
426146
ns426771
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
456125
ns443791.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
142160
ns140775
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5858214
ns6055220
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2271875
ns2980417
ns0.76
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
326853
ns374113.5
ns0.87
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3802938
ns3786687.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3809708
ns3800125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3801896
ns3802250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3792625
ns3800125.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
781504
ns773073.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32606612
ns32060419
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11052792
ns11414458
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1495896
ns1481216
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49881521
ns49836458.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
26009250
ns35531854
ns0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35546334
ns35532958
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96980062.5
ns96940104.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1600432
ns1598677.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1012971
ns1003070
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154537104
ns154620438
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
88927125
ns112348896.5
ns0.79
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112528667
ns112268042
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
298524146
ns297276875
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6474447
ns6507145
ns0.99
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5518798
ns5527328
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
19062.5
ns19333.5
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
15542
ns18417
ns0.84
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17042
ns17041.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
16021
ns15895.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20743
ns20523
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1114798
ns1185183
ns0.94
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
252583
ns216083
ns1.17
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
26040
ns25770
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10917
ns10666.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7416
ns9125
ns0.81
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9208
ns9333
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17375
ns17333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
296392
ns294450.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
9784850
ns10063127
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1636083.5
ns1473458
ns1.11
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
155431
ns152692
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8729
ns9125
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9000
ns9208
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9229.5
ns9917
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8562.5
ns7709
ns1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
139671.5
ns138926.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3414992.5
ns3463205
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
799833
ns776250
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
242752
ns242133
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9312.5
ns9250
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9416
ns9792
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10167
ns10083
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9250
ns9333
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
704800.5
ns694521
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23114743.5
ns23124170
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5428520.5
ns4958562
ns1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
674252.5
ns652436.5
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9333
ns9354
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9709
ns9521
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10625
ns10833
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9250
ns9333.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
136210.5
ns134783
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3600263.5
ns3350897
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
947792
ns835416
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
69541
ns74881
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13062.5
ns13042
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13542
ns12917
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13916.5
ns14021
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13000
ns12708
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
647891
ns641313
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19445945.5
ns19842694
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4788583
ns4446583
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
349204
ns353113.5
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
459
ns542
ns0.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
584
ns583
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns458
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34950
ns35065
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1182154
ns1207795
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
441000
ns272000
ns1.62
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
208662
ns208482
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7916
ns8084
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8583
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8729.5
ns9021
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8417
ns7999.5
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
235567
ns232978
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
23108838
ns21485714
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5655333.5
ns4575750
ns1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
664097
ns677037
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16375
ns14000
ns1.17
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
14604.5
ns16833
ns0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
14708
ns14708
ns1
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10459
ns11084
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
21454
ns22205
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1118960.5
ns1162979
ns0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
214750
ns202500
ns1.06
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
188482
ns192482
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
31708
ns32208
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
31875
ns32458
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32146
ns32500
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
31917
ns32292
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
314264
ns312235
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11477943.5
ns11511065
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1721916
ns1679791
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
610347
ns604116
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
441229.5
ns441708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
445062.5
ns448937.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
447666
ns446250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
446000
ns480084
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194324
ns196019
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6172811
ns5816450
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2129687.5
ns2098250
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
356014
ns375818.5
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3806062.5
ns3828583
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3830125
ns3823145.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3819020.5
ns3821375
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3829625.5
ns3827208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
580459
ns576589.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27992292
ns28049485
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10082833.5
ns9440833.5
ns1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1390109
ns1386199.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
833503354
ns832497916.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
415838000
ns542428167
ns0.77
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
544434542
ns543721583
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1561715250
ns1563747062.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22756243
ns22544597
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14023836
ns14054025
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2997704083
ns3015248500
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1512242750
ns1790623833
ns0.84
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
2248995791
ns2952821875
ns0.76
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
5261167167
ns5283328917
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
364718000
ns308874000
ns1.18
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87342499
ns87902788
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
77833
ns77188
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76542
ns76417
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78708
ns77021
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76354.5
ns77750
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
235898
ns234575.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7929573
ns7672129
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
551041.5
ns527333
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
109786.5
ns110611
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
282312.5
ns260625
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
251104
ns274229
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
197208
ns272750.5
ns0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
192416
ns235500
ns0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1133383
ns1125486
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45395412
ns47620210
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6595833
ns6079250
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
643627
ns649537
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199406375
ns199482541.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
104150500
ns139345542
ns0.75
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139302333
ns139029792
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388728500
ns392535083
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5827807.5
ns5822474
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3416565
ns3382050
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
621451500.5
ns619752416.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
353591958
ns442441875
ns0.80
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
438706083.5
ns440256854
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1195242542
ns1198345083
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26241215
ns26648833.5
ns0.98
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21717195
ns21780965
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7209
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns6083
ns0.87
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6250
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns9875
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27646
ns27541
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1194589
ns1291872
ns0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
620417
ns587750
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50410
ns47840
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213208
ns214771
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221104.5
ns221145.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221854
ns222354.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216000
ns207000
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
239232
ns238084
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
30973712
ns32570629.5
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9004750
ns9544417
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
536025
ns538400.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8333.5
ns9854
ns0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10250
ns8750
ns1.17
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9937.5
ns10479.5
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9416
ns8000
ns1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
133822.5
ns133164.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3586391
ns3474874
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
904312
ns869583
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
72841
ns72621
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns7416.5
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7917
ns7875
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8167
ns8416
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7833
ns7542
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
581095
ns569420
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
18441555
ns20340438.5
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4731020.5
ns4250145.5
ns1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
326163
ns321593.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
459
ns458
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns500
ns1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns417
ns1.20
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26581
ns26321
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1237995.5
ns1240804
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
473959
ns300875
ns1.58
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
49351
ns48251
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10166
ns10167
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10334
ns9791
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10416
ns10875
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9584
ns9667
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
272007
ns269831
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
23722074
ns23376313
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5995833.5
ns5253375
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
394569
ns398219
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
107229.5
ns107312.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
85749.5
ns99312
ns0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
99417
ns100395.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146291
ns146708
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24482
ns24943
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
1201696.5
ns1222208
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
274937.5
ns258062
ns1.07
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
192342
ns190671.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
478334
ns477917
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
500041
ns478000
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
478375
ns500667
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
478708
ns496666
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
255734
ns253423.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11849860
ns11761727
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2286625
ns2149375
ns1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
624721
ns618976
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
4937.5
ns5458
ns0.90
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
7000
ns6749.5
ns1.04
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7792
ns6250
ns1.25
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4333
ns6417
ns0.68
batchedmm(16, Bsize=32)/forward/GPU/CUDA
16407
ns16082.5
ns1.02
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
78321
ns79671
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11542
ns11646
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
9666.5
ns11166.5
ns0.87
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10792
ns11104
ns0.97
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16958
ns16416
ns1.03
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
233195.5
ns231483.5
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
378594
ns373474
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39417
ns39458
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50250
ns47083
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
51417
ns52417
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13833
ns13791
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA
19791.5
ns19645
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
85261
ns88401
ns0.96
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
51020.5
ns36062.5
ns1.41
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
28646.5
ns30937.5
ns0.93
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31146.5
ns31854.5
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
64625
ns57291
ns1.13
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
208902
ns206595
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
415884.5
ns420164
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1875
ns1625
ns1.15
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1667
ns1979.5
ns0.84
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2250
ns2333
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1750
ns1708
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
20332
ns20532
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1179138
ns1200494
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
324854.5
ns296854
ns1.09
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
28921
ns33920
ns0.85
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2084
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2208
ns2208
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2291
ns2500
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2250
ns2167
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
222981
ns222308
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
9345973.5
ns9522565
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1764708
ns1461292
ns1.21
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
139241
ns138611
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4667
ns3958.5
ns1.18
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4750
ns4771
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5750
ns6083
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4333
ns4166.5
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
161766.5
ns160299
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5876374
ns5753962
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
453291.5
ns436958
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
62650
ns73001
ns0.86
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8667
ns8208
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7958
ns8500
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8958
ns0.92
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8166
ns8459
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
958412
ns947464
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
39258406
ns38499411
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5932250.5
ns5531166
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
385774
ns391074
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
57250
ns56917
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56916
ns57917
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
58250
ns58000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58667
ns58500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37674
ns37518
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1210249
ns1241903
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
380459
ns555417
ns0.68
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
208842
ns218173
ns0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
449562.5
ns447979
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
466895.5
ns464396
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
465833
ns473021
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
434708
ns434125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
276347
ns273702
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26892491
ns28360190
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8199750
ns8213312.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
814928
ns849179
ns0.96
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3302500
ns3314750
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1770792
ns2333729
ns0.76
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2337291.5
ns2339125
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6303499.5
ns6302291.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204292.5
ns204384
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
203467.5
ns209662
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11464458
ns11431208
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6552083
ns8359770.5
ns0.78
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8324666.5
ns8320667
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21058833.5
ns21055375
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
741274.5
ns741163
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1081561
ns1077386
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4583
ns6666
ns0.69
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5667
ns5042
ns1.12
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6104
ns5792
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4854.5
ns4750
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
156234.5
ns153420
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5915211.5
ns5572475
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
827584
ns774395.5
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
58490
ns58011
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7083.5
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7042
ns7229.5
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7416
ns7625
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6959
ns7250
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
812002.5
ns801539
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
34732254.5
ns36387767
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5657917
ns5231166.5
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
382614
ns377908.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
95791
ns123729
ns0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
98000
ns101000
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
125333
ns104000
ns1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
98604
ns126792
ns0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
158376.5
ns156884
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6145153.5
ns6223875
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2249500
ns2963062.5
ns0.76
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
189172
ns208512
ns0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2001625
ns2029666.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1968041.5
ns2022354.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2021312.5
ns1991125
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2030708.5
ns1994708
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
779642
ns768699
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32168960
ns31458817
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11090459
ns10927541.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1124561.5
ns1258693
ns0.89
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
34541.5
ns32625
ns1.06
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
35875
ns36833.5
ns0.97
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
33958
ns35792
ns0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
625
ns500
ns1.25
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15484
ns14999
ns1.03
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
80681
ns72471
ns1.11
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2667
ns2583.5
ns1.03
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2791
ns2917
ns0.96
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3000
ns3041
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2250
ns2125
ns1.06
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
148962
ns147569
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
353673
ns353134
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7208
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns6000
ns0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns6041
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10209
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36617
ns36526
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1195969
ns1258443
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
574854
ns344041.5
ns1.67
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49650
ns49210
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213333.5
ns212937
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220792
ns220562.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221208.5
ns221542
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214813
ns206666
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
253557.5
ns251971
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26578838
ns28011462
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7960167
ns7840500
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
522265
ns585241
ns0.89
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22029
ns21871
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2084023
ns2206686
ns0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
250333
ns242458
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
45980
ns45741
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14958
ns14958
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14625
ns15000
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14958
ns15041
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14875
ns14959
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
339194
ns338565
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11224100
ns11553768.5
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1025166
ns976583
ns1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
196272
ns200867
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
103041
ns102500
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
125083
ns100312.5
ns1.25
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
132667
ns108417
ns1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
100249.5
ns121104.5
ns0.83
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
160077
ns151313
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5649119.5
ns6198682
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2853125
ns2966354
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
205222
ns209207
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923874.5
ns1900250
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1935583
ns1919459
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1923375
ns1905125
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1927062.5
ns1916250
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
765025
ns757300
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31727518
ns32108317
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10829375
ns11078458
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1233282
ns1229222
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18791
ns18250
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18729.5
ns18687.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20145.5
ns20958
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19646
ns18417
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
123582.5
ns123983
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3867602
ns3759508.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1393500
ns1408000
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76250
ns76771
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215917
ns216375
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216688
ns216583
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219083
ns225958.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216250
ns225688
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
569648
ns570521
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20650113
ns20158203.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6226521
ns6272542
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
496345
ns495115
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
25312.5
ns23938
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
28312.5
ns30688
ns0.92
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
29041
ns29062.5
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1458
ns1250
ns1.17
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16184
ns16428
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
88291
ns83121
ns1.06
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4875
ns4646
ns1.05
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4895.5
ns4896
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5437.5
ns5000
ns1.09
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4875
ns4958
ns0.98
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
227416
ns227807
ns1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
387604
ns388114
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
305625
ns305583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305812.5
ns306229.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
309146
ns306417
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
307792
ns305458
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
259343.5
ns261264.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7522169
ns7704179
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
655771
ns1023292
ns0.64
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
277977.5
ns277833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
532041
ns529834
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
530083
ns542250
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
538458
ns540625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
533250
ns564458
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1187558.5
ns1187321
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45262893
ns41669413
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6496375
ns5875584
ns1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
870989
ns878529
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19792
ns19458
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21104
ns20250
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22312.5
ns22125
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20542
ns20000
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
131573
ns130701
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3585949
ns3847882
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1498125
ns1505125
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
75971
ns82371
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214917
ns211958
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213083
ns213833
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213958
ns221208
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212500
ns225958.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
880154.5
ns881389.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
24833949
ns26325325.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7325541
ns7153208
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
546485
ns544575
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6334
ns6917
ns0.92
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7458
ns6542
ns1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8083
ns8000
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6583
ns6458
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
157693
ns156818.5
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5843649.5
ns5948934
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
839500
ns752520.5
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
69580
ns69091
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11041
ns10125
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9917
ns9916
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10729
ns10542
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10209
ns10917
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
890019.5
ns888274
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
39536698
ns40943719
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5554084
ns5136625
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
391634
ns391869
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4667
ns4500
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5124.5
ns5375
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5833
ns6542
ns0.89
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6458
ns5167
ns1.25
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
161059
ns160006.5
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5609112
ns5573705.5
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
822917
ns760125
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
62251
ns70591
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7542
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7208
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns7792
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7166
ns7292
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
835582
ns834345
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
41010539
ns40626772
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5986084
ns5526750
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
405494.5
ns398254
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14490500
ns14520542
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7719208
ns10131583
ns0.76
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10131041
ns10128208
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27827208
ns27740417
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
529747
ns528864
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
389754
ns389824
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46259291.5
ns46258062.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26496000
ns33606750
ns0.79
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33451708
ns33452875
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85583541
ns85111667
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2650995
ns2665492
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3276734
ns3283266
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
68250
ns66291
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
68104.5
ns68291.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69312
ns70667
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66125
ns68459
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
134037
ns136351
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3550202
ns3622418.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1521625
ns1516937
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
232902
ns229622
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
449854
ns441083
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
440625
ns442083
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
442209
ns451271
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
440396
ns448709
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
796931.5
ns797999
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26432216.5
ns27539269
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7473000
ns7538041.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
813753.5
ns798133.5
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31856
ns32316
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1165311
ns1191148
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
476979
ns282250
ns1.69
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
51801
ns51341
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10937.5
ns10333
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9542
ns10145.5
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10312.5
ns10750
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10895.5
ns9708
ns1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
298325
ns298535
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21243700
ns22166302
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5492937.5
ns5112604
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
381794
ns389099
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9833
ns9833
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9834
ns9875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9834
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9792
ns9792
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23467
ns23326
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2094657
ns2165543
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
227250
ns221000
ns1.03
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
218063
ns217222
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
46250
ns46042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45750
ns46083
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46625
ns46333
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46625
ns45958
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
308147
ns311769
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11866227
ns9346963.5
ns1.27
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
981645.5
ns1406750
ns0.70
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
625806
ns625172
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56500
ns56291
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56333
ns57084
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57292
ns57166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57958
ns57833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28681
ns29441
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1215113
ns1309673.5
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
679666.5
ns525875
ns1.29
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
206472
ns205592
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
454625
ns453959
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
465208
ns497667
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
467459
ns473604.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
435834
ns480063
ns0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
255741
ns257462
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
31365693.5
ns35120868.5
ns0.89
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9276416.5
ns9345875
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
857403.5
ns846349
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
647417
ns642041.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
646792
ns579916.5
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
649354.5
ns645666
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
663709
ns646458
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
225589
ns228679
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8440846.5
ns8172629
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1395125
ns1355750.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235913
ns253213
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2227104.5
ns2224417
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2251250
ns2220292
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2225292
ns2237020.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2242250
ns2228625
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1068301.5
ns1068960.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
49297045.5
ns47850344
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7711771
ns7115750
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1379184
ns1248002
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22916
ns20375
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20146
ns20167
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21833
ns22708
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20709
ns22291.5
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
127032
ns126170.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3587279
ns3616907
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1515770.5
ns1510500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
84371
ns83861
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
253853.5
ns220209
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220458
ns219167
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221000
ns228104
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219020.5
ns262208.5
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
840768
ns842194.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27055091.5
ns27546993
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7691791.5
ns7841187.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
560576
ns570991
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns708
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns541
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
22755
ns23369
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1200857
ns1263372.5
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
466021
ns433416.5
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
50411
ns52440
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
11229.5
ns10750
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10542
ns10895.5
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10771
ns10875
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10312
ns10395.5
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
277858
ns277798
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
26122331
ns25466793
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6009125
ns5692709
ns1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
412724
ns417444.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9209
ns8209
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10125
ns8750
ns1.16
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9750
ns9958
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8917
ns8834
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
135766
ns135599.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3551579
ns3368218
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
904792
ns871041
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
67721
ns73140
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7334
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7666
ns8000
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8312.5
ns8208
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns7667
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
551973.5
ns551087
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
17337125
ns17674213
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4446687.5
ns4122708
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
336393
ns329793.5
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1437.5
ns1458
ns0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1500
ns1500
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2000.5
ns1958
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1354.5
ns1604
ns0.84
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21147
ns21901
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1189038
ns1190713
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
311875
ns295917
ns1.05
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
190276.5
ns192142
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3333
ns3250
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3333
ns3542
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3458
ns3667
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3333.5
ns3208
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
241366
ns238995.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9747135
ns10823565
ns0.90
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1889917
ns1600541.5
ns1.18
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
597216
ns595486
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148042
ns149479
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106084
ns128917
ns0.82
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
128375.5
ns130000
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225104
ns233187.5
ns0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
24502.5
ns24863
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1170307
ns1175801
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
306333
ns266083
ns1.15
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
36970
ns37325.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
174999.5
ns160187
ns1.09
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
87125
ns123542
ns0.71
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
110792
ns126354.5
ns0.88
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
250729
ns269771
ns0.93
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
240885.5
ns242114
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10729195
ns10717525
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2110083.5
ns2011459
ns1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
226383
ns238832.5
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7209
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns6042
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns5917
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10083
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32889
ns32666
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1323363
ns1158312.5
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
369062.5
ns569812.5
ns0.65
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
51151
ns50740
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223583
ns220250
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228584
ns227542
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228917
ns236271
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213604
ns253792
ns0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
270279
ns271669.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26319175
ns29231373
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8277437.5
ns8143771
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
534116
ns535646
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
14833
ns14958
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
15125
ns14666
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16500
ns16250
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15917
ns15917
ns1
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
157359.5
ns156410
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5880290
ns5735978
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
824458
ns768020.5
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
240222
ns238883
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23687
ns22083
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23500
ns22958
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23854
ns23375
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23292
ns23000
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
926538.5
ns926946
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
39379136.5
ns38775799
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5882625
ns5549917
ns1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
690662
ns693337
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9812.5
ns9438
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9542
ns9875
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10583
ns11000
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10167
ns9333
ns1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
140467
ns139616.5
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3419289
ns3454137.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
821479
ns723042
ns1.14
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
71471
ns76450
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13917
ns13584
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13166
ns13792
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14458
ns14125
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13583
ns13500
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
766881
ns764786.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
22531916
ns20882662
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5288584
ns4491625
ns1.18
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
372183.5
ns374309
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9459
ns10458.5
ns0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9542
ns9375
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10958
ns11125
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10333
ns10021
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
138627.5
ns138381
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3561997
ns3357197.5
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
927624.5
ns840979.5
ns1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
72865.5
ns75811
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns12083
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12646.5
ns12833
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13083.5
ns13292
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11937.5
ns12604
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
624799.5
ns621892.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
20341049.5
ns18616134
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4551375
ns4378854
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
348243
ns351323.5
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
31083.5
ns27375
ns1.14
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
32937.5
ns32542
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
31583
ns32042
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2042
ns1750
ns1.17
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16203
ns16401
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
73550
ns74361
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5229.5
ns5250
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5063
ns5375
ns0.94
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5562.5
ns5458
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6416
ns6354.5
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
148737.5
ns149643.5
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
374559
ns372579
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
291
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
26129
ns26282
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1175092.5
ns1235843
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
467478.5
ns429000
ns1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48501
ns48191
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7209
ns6875
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7042
ns7333
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns8250
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7458
ns7250
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
198167
ns199861.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
22361121
ns23031425.5
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6016959
ns5383603.5
ns1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
396144
ns399594
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2042
ns2000
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
1917
ns2042
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2084
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
1959
ns1917
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
26961
ns27232
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1272300.5
ns1267736
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
473229.5
ns288917
ns1.64
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
211192
ns210802
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17312.5
ns17312.5
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16979.5
ns16750
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17958
ns18271
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17291.5
ns17792
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
284214
ns284556
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
25656179.5
ns28374240
ns0.90
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5834000
ns5355833
ns1.09
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
717577
ns716677
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
188417
ns152500
ns1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
169438
ns151917
ns1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149396
ns152875
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
175916
ns197875
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
221937
ns223454.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7838028
ns9157076.5
ns0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1550833
ns1404125
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
199412
ns219493
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1315271
ns1335250
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1324083
ns1317937.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1325000
ns1295104.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1331833.5
ns1330416.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
998483
ns1005186
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
47666418
ns47695710
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6733584
ns6492770.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1130086
ns1013325.5
ns1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
27020.5
ns24312.5
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24792
ns25125
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26416
ns27125
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24687.5
ns25312.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
268327.5
ns268978
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8104229
ns7881859
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
621687.5
ns605042
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
117991
ns121672
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131333
ns117958.5
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
116958
ns117458
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
125645.5
ns176312.5
ns0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
127375
ns125666.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1214493
ns1219058
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43264858
ns45063058
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6553167
ns6043584
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
601326
ns618336
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
250
ns334
ns0.75
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns250
ns1.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22301
ns23021
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1209248
ns1205681
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
447500
ns416084
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
51730.5
ns51680
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7541.5
ns7250
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7167
ns7500
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7792
ns8250
ns0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7416
ns7041
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
204142.5
ns205357.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25835569
ns24678245
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5695875
ns5614708.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
401495
ns401394
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5896
ns5250
ns1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708
ns5917
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6667
ns6709
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6208
ns5250
ns1.18
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
167740
ns167889
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5615184
ns5730740.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
488083
ns442416
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
238573
ns237332
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10083.5
ns9792
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9709
ns10291
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9958.5
ns10417
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9708
ns9979.5
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
976109
ns974688.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
41486841
ns40869083
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6285500
ns5716125
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
679397
ns681047
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
625
ns667
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
666
ns667
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
666
ns666
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22844
ns22997
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
1942240
ns2184475
ns0.89
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
335708
ns222542
ns1.51
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
216152.5
ns214122
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4583
ns4583
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4542
ns4667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4792
ns4875
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4584
ns4584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
237762
ns241286.5
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10676201
ns10599719
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1793708
ns1575875
ns1.14
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
600666.5
ns597171.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
9542
ns8042
ns1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8375
ns8375
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9542
ns9687.5
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8375
ns7583
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
138258.5
ns137743.5
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3579498
ns3699354
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
834417
ns778937
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
69561
ns77070
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8541
ns8583.5
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8166
ns8666
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9083.5
ns8958
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8209
ns8520.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
673050
ns671848
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
22514145.5
ns21090224
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5316625
ns4565417
ns1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
354703
ns354384
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
125917
ns127708
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
96125
ns129375
ns0.74
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
130167
ns130250
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183437
ns180687
ns1.02
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45933
ns46493
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
98581
ns98931
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
339916
ns344083
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
166583
ns324375
ns0.51
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
348854.5
ns344583
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
574020.5
ns606833
ns0.95
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
207728
ns208834
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
495960
ns512935
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397708
ns398042
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215083
ns288042
ns0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288291
ns288646
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756250
ns756209
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43863
ns43829
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1414380.5
ns1374897
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
508833
ns409979
ns1.24
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
84981
ns83555.5
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1459874.5
ns1451083
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
862000
ns1136063
ns0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1134791.5
ns1133125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2443958
ns2442166.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
264585.5
ns265950
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
11951229.5
ns10929832
ns1.09
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1843542
ns1780583
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
355253
ns351723.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
614666
ns638500
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
586000
ns643687
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
645874.5
ns647458
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
657000
ns646208
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
222791
ns223801.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8463661
ns8161714
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1392125
ns1357208
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
247582
ns249462
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2443375
ns2446416
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2464833.5
ns2441562
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2434958
ns2458583
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2451958
ns2438604
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1084693
ns1085043
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
52848514
ns52292780.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9656375
ns7220166
ns1.34
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1475249.5
ns1491070.5
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
33979
ns32062
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
35146
ns36083
ns0.97
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34541.5
ns34292
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
958
ns750
ns1.28
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15785
ns15707
ns1.00
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
72911
ns87960
ns0.83
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3166
ns3042
ns1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3208
ns3417
ns0.94
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3459
ns3458
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3166.5
ns3020.5
ns1.05
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
147758
ns148084.5
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
345553
ns359633
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
406875
ns406833.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
401958
ns408000
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
409250
ns408292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
421375
ns420458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
43841
ns44216
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1372730.5
ns1415070
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1170812
ns1153458.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
242582.5
ns240777
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3882208
ns3864458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3924041.5
ns3999792
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3998375
ns3979541.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3776500
ns3775854.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
250561
ns252337
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35883398
ns38063076.5
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11700333.5
ns11500395.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1246592
ns1237002
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3916
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34574
ns34777
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1219036
ns1220867
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
264250
ns174292
ns1.52
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
40720
ns40930
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15750
ns15750
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15542
ns16000
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15917
ns16000
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15792
ns15708
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
273311
ns275461
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8765176.5
ns9012714
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
885792
ns864791
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
167912
ns169182
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404125
ns404416
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
220833
ns295625
ns0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295250
ns295500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760375
ns760500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113355
ns113590.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
999579
ns1022527
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
483500
ns398708
ns1.21
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
90391
ns88881
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1480125
ns1487187
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
886750
ns1158000
ns0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1160937.5
ns1155479.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2466312.5
ns2466208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
264186
ns270328
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
10729803
ns10184609.5
ns1.05
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1873812.5
ns1820375.5
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
357734
ns358073
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
458
ns583
ns0.79
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns542
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
541
ns500
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
26163
ns26734
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1282772
ns1351582.5
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
465459
ns291500
ns1.60
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
210412
ns212062
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8604.5
ns8229.5
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8167
ns8583
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9541.5
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8250
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
212800
ns214078
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25355208.5
ns25901884
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5710375
ns5489167
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
711177
ns700618
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
833479.5
ns833854
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
471667
ns622042
ns0.76
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
618333
ns622979
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1549979.5
ns1540520.5
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129908.5
ns129761.5
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
169932
ns232503
ns0.73
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2690812.5
ns2695146
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1528250
ns1998709
ns0.76
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2007542
ns2002084
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4933833.5
ns4935604.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
255516
ns253692
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
874763.5
ns775058
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
250
ns375
ns0.67
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns291
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31620
ns32085
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1200788
ns1210183
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
434791
ns398167
ns1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
49800
ns48991
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7646
ns7125
ns1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7084
ns7542
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns7959
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns7208
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
227155.5
ns228384.5
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
22078020
ns22179958
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4969834
ns5066187
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
366063.5
ns371134
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2419959
ns2380250
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2370750
ns2375084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2383667
ns2384771
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2405250
ns2400396
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
221771
ns222871
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7972987.5
ns7920377
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1606125
ns1468250
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
359644
ns333598.5
ns1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4630917
ns4645416
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4535583
ns4654896
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4657333
ns4551000
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4651709
ns4663375.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
989560.5
ns991786.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
47401923
ns49524120.5
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6807396
ns6445334
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1409064
ns1420379
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
15188
ns7625
ns1.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6875
ns6917
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7459
ns7417
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
9416.5
ns6916.5
ns1.36
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
24119
ns24247
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1200176
ns1191973
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
280270.5
ns267833.5
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
34491
ns39055.5
ns0.88
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
67062.5
ns48958.5
ns1.37
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
45729.5
ns52041.5
ns0.88
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
47833
ns50209
ns0.95
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
48416
ns70292
ns0.69
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
241118
ns242503
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10961456
ns10913747
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2256625
ns2020458
ns1.12
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
244442
ns238762
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
22000
ns21687.5
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
24167
ns26292
ns0.92
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
24291.5
ns24292
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5333.5
ns7104.5
ns0.75
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17742
ns18134
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
91171
ns86280
ns1.06
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12250
ns12375
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9229
ns10333
ns0.89
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10708.5
ns10959
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17979.5
ns18020.5
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
247367
ns247945.5
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
394269
ns390974
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
405958
ns406000
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
223625
ns296750
ns0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296750
ns296833
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762959
ns762416
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46786
ns47301
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1380803
ns1385156
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
437125
ns480417
ns0.91
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
92421
ns90581
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1485583
ns1489959
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
892146
ns1168167
ns0.76
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1165042
ns1168541.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2472709
ns2469208.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
308920
ns301442
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
13822167
ns12084462
ns1.14
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2073458
ns2052625
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
380064
ns375484
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
435750
ns433833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
430312.5
ns437292
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
438875
ns437959
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448792
ns447542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54925.5
ns55480
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1013411
ns1017218
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1149375
ns1108958.5
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238282
ns238782
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3884333
ns3902500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3995458.5
ns4024833
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4027188
ns4023500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3806541.5
ns3799395.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
270795
ns271271
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
30409622
ns38442703
ns0.79
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10301625
ns10071250
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1244507
ns1237173
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8750
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
6875
ns7667
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7708
ns7708
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12458
ns12417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24004
ns24191
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2037132.5
ns2251764
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
231583.5
ns220958
ns1.05
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
219512
ns217462
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45125
ns45042
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
44791
ns45125
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45166
ns45459
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45542
ns45375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
364741
ns366950
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
13229894.5
ns13502470
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1791396
ns1612375
ns1.11
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
666126
ns663217
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
85666
ns82104
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82854.5
ns82167
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
90541
ns83916
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
123042
ns122312.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190268
ns189921
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6129927
ns6075927
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2136500
ns2073438
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
206862
ns203742
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1990916
ns2018750.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1994062.5
ns2019062.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2022062.5
ns1986604
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2019666
ns1993875
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
579448
ns579704
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27636360
ns30614260
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9777083
ns9357458.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1101570
ns1104701
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.