-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
test: try re-enabling enzyme testing on 0.13.16 (#1042)
* test: try re-enabling enzyme testing on 0.13.14 * fix: cache invalidation tests * fix: more test fixes and standardize grad tests * fix: avoid LV or Octavian with Enzyme * fix: enzyme support for pooling * fix: more enzyme support * ci: temporarily disable other tests (drop me) * test: cleanup conv tests * ci: temporarily disable other tests (drop me) * test: dense tests * test: try fixing more tests * test: workaround Enzyme warning * test: enzyme only on linux * fix: more BN test fixes * test: newest release fixes more issues * fix: print error in CI * fix: more test fixes * chore: apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * test: mark remaining tests as broken * fix: bypass enzyme bmm failure * chore: apply suggestions from code review
- Loading branch information
Showing
47 changed files
with
495 additions
and
837 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "Lux" | ||
uuid = "b2108857-7c20-44ae-9111-449ecde12c47" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.3.3" | ||
version = "1.3.4" | ||
|
||
[deps] | ||
ADTypes = "47edcb42-4c32-4615-8424-f2b9edc5f35b" | ||
|
@@ -69,15 +69,15 @@ LuxZygoteExt = "Zygote" | |
ADTypes = "1.10" | ||
Adapt = "4.1" | ||
ArgCheck = "2.3" | ||
ArrayInterface = "7.10" | ||
ArrayInterface = "7.17.1" | ||
CUDA = "5.3.2" | ||
ChainRulesCore = "1.24" | ||
Compat = "4.16" | ||
ComponentArrays = "0.15.18" | ||
ConcreteStructs = "0.2.3" | ||
DispatchDoctor = "0.4.12" | ||
Enzyme = "0.13.13" | ||
EnzymeCore = "0.8.5" | ||
Enzyme = "0.13.16" | ||
EnzymeCore = "0.8.6" | ||
FastClosures = "0.3.2" | ||
Flux = "0.14.25" | ||
ForwardDiff = "0.10.36" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.3.8" | ||
version = "1.3.9" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
@@ -65,8 +65,8 @@ ChainRulesCore = "1.24" | |
Compat = "4.16" | ||
CpuId = "0.3" | ||
DispatchDoctor = "0.4.12" | ||
Enzyme = "0.13.13" | ||
EnzymeCore = "0.8.5" | ||
Enzyme = "0.13.16" | ||
EnzymeCore = "0.8.6" | ||
FastClosures = "0.3.2" | ||
ForwardDiff = "0.10.36" | ||
Hwloc = "3.2" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/LuxTestUtils
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/LuxLib
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/119929
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/119930
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3875
ns3875
ns1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4208
ns4375
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5250
ns5083
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4333
ns4208
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61892.5
ns60144
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10542
ns10625
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10209
ns10666
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10459
ns11375
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10417
ns10334
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
433097
ns421452
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1084
ns1250
ns0.87
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1291
ns1292
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1292
ns1250
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1209
ns1167
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18531
ns18149
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4167
ns4167
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3917
ns4042
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4250
ns4292
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4083
ns3625
ns1.13
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
111975
ns109548
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57583
ns56166
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46292
ns46709
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38042
ns46334
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83125
ns82291
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37370
ns37127
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031625
ns2031334
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2085958
ns2096166.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2088333.5
ns2086458
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2005041
ns1997167
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198108
ns197158.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143750
ns143042
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
146063
ns145583.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145209
ns146709
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144583.5
ns149500
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166112.5
ns166231
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1118042
ns1138708.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1114250
ns1128583
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1153000
ns1062083.5
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1068770.5
ns1115041.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
533468
ns530934
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3584
ns3125
ns1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3750
ns3458
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4417
ns4292
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3958
ns3375
ns1.17
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
72081
ns70464
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns9208
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8917
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9041
ns9125
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8916
ns9166
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
503190.5
ns483194.5
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15000
ns15333
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15250
ns15458
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16708
ns17333
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15542
ns17062.5
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55903
ns53962
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214187.5
ns214583.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213604.5
ns212667
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215395.5
ns214625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212917
ns225250
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
278881
ns273370
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns458
ns1.09
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns666
ns0.81
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns750
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns500
ns1.17
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17733
ns17502.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1625
ns1542
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1500
ns1667
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1625
ns1834
ns0.89
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1583
ns1375
ns1.15
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
105125.5
ns101667.5
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7125
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5833
ns5917
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5250
ns5792
ns0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10084
ns9917
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24106
ns23886
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220750
ns221417
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228084
ns228125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230459
ns228666
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213708.5
ns220500
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
169707.5
ns169891
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3958
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23637
ns23537
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16708
ns16750
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16834
ns17042
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16875
ns16875
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16625
ns16750
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
161602
ns159725
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
578416.5
ns570333
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
569958
ns574000
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
579292
ns579125
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
578291
ns571125
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113009
ns113492
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1417979.5
ns1428041
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1419167
ns1422333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1424875
ns1423708
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1426416
ns1423458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
210883
ns208571.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1067000
ns1051187.5
ns1.02
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
958417
ns971896
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1336917
ns1346062.5
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1304396
ns1306416
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
271759
ns272301
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5795104.5
ns5990916
ns0.97
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4601125
ns4519875
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4929084
ns4948416.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5750083
ns5523125
ns1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1068932
ns1070952
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23274
ns23553
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2166
ns2167
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2208
ns2125
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
171283
ns168963.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4333
ns3875
ns1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4125
ns4167
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5083
ns5250
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4292
ns3666
ns1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66130
ns65091
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11625
ns11416
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11458
ns11292
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12458
ns12333.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11709
ns11209
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
452684.5
ns446962.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6375
ns6458.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6959
ns6792
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8229.5
ns7833.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6916
ns6250
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52019
ns52555
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16875
ns16584
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17000
ns17791
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18166
ns17375
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17542
ns17125
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301500.5
ns308634
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns625
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns666
ns0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
666
ns583
ns1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32512
ns32320
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8500
ns8541
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8750
ns9167
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns9500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8959
ns9479.5
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
157915
ns159616
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64542
ns64750
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64625
ns64625
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64750
ns64292
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64875
ns64542
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111658.5
ns111041.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
279708
ns292000
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
283750
ns292084
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
293250
ns275666
ns1.06
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
284521
ns275708
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
185586.5
ns183441
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3282500
ns3191791
ns1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3076875
ns3043437.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
2795834
ns3020437.5
ns0.93
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4063541.5
ns4089708
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
567714
ns601857
ns0.94
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7638583
ns7582625
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7366000
ns7473208.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7289042
ns7437833
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8172916
ns8187292
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1335450
ns1317154
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17555833
ns18957000
ns0.93
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17413291.5
ns19047250
ns0.91
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17640417
ns19104542
ns0.92
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14085667
ns15686625
ns0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23644667
ns23902625
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33391375
ns34420458
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
40912708
ns37002333
ns1.11
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35048479
ns34848770.5
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1855237.5
ns1857006
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189754584
ns191696375.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
232353000
ns164341792
ns1.41
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
201284750
ns152698167
ns1.32
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
435226125
ns439655916
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13860033
ns13895377
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
290571042
ns292126520.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
334832916
ns340023312
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
303703583
ns298857875
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
393811604
ns335240875
ns1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21541
ns22250
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22375
ns23083
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23354
ns23959
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24500
ns23417
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95582
ns96101
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103250
ns103542
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
115312.5
ns103541
ns1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104625
ns104791
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
102667
ns113250
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
503695.5
ns512131
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5834
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5791
ns6375
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7666
ns7000
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns6125
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68642
ns68297.5
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14875
ns15208
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14625
ns15750
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16250
ns16583
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14833
ns15062.5
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
478112.5
ns474148.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3019792
ns3053958
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2069896
ns2089500
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2279000
ns2270042
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4750917
ns4804875
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583001
ns582756
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23604770.5
ns23872458.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18003875
ns18056937.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18293125
ns17766021
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35919729.5
ns35515208
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3106744
ns3103295.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33297687
ns33801000
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27474958
ns27630916.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
29070229.5
ns27435750
ns1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41830959
ns41597458
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73396
ns74917
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75125
ns72541
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74875
ns76416
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72959
ns74375
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103514
ns103583
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
274208
ns221146
ns1.24
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
205959
ns219166
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
255333
ns208875
ns1.22
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
296916
ns206542
ns1.44
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
554316
ns560403
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11167
ns12166
ns0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11875
ns12208.5
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13458
ns13167
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12458
ns12042
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
72256.5
ns71403
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26583.5
ns26979.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26833
ns27167
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28084
ns27958.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26708
ns26459
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
483481.5
ns472464
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11520.5
ns12437.5
ns0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13041
ns12979
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13750
ns14167
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12875
ns12125
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52959.5
ns53400
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25500
ns25625
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25542
ns26292
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26375
ns26416
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26542
ns26167
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
310926
ns306626.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179125
ns180729
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182625
ns182709
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183958
ns183875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182416
ns180833
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58111
ns56252.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
582958
ns593541.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
583209
ns593916
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
610042
ns584021
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582000
ns582917
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
286370
ns289288.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5729.5
ns6500
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6334
ns6125
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7500
ns7792
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6083
ns6145.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
71136.5
ns70132.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14167
ns14271
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14500
ns14916
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15667
ns15500
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14667
ns14000
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
468005
ns460852.5
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1186749.5
ns1175354
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1247334
ns1353000
ns0.92
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1282666.5
ns1269979
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
841729
ns1317500
ns0.64
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301667
ns302455
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4101771
ns4288500
ns0.96
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4417458
ns4366958
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4790916
ns4543917
ns1.05
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3731833.5
ns4469000
ns0.84
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1043818
ns1030148
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23460
ns23497
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4834
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4834
ns5041
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4958
ns4875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
189873
ns185923.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5792
ns5500
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6125
ns6167
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7187.5
ns6459
ns1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6208
ns5583
ns1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55970.5
ns55454.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10625
ns10667
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11083
ns11750
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11584
ns11458
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11500
ns10667
ns1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
332298.5
ns337381
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns375
ns0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22660
ns22737
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2708
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns3000
ns0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3000
ns3000
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2709
ns2750
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
159360
ns157057
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11292
ns11625
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11792
ns12250
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13250
ns12708
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12229.5
ns11417
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57130.5
ns56422
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24708
ns24250
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24167
ns25208
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25854
ns25000
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24916.5
ns25437.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
300198
ns294376.5
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4208
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24574
ns24716
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16166
ns16042
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16000
ns16417
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16042
ns16250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16375
ns16167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
201392
ns193381
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5750
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5750
ns6083
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5750
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5916
ns5833
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33153
ns33569
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20333
ns20479.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20792
ns21000
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20917
ns21208
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21375
ns21104.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
175780
ns174365.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
417417
ns375416.5
ns1.11
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
378854.5
ns374666.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
487270.5
ns488312.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103917
ns524187.5
ns0.20
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66399.5
ns66372.5
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
877583
ns931978.5
ns0.94
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
949562.5
ns880291.5
ns1.08
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1206625
ns1223791.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
469167
ns1351833.5
ns0.35
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
191112
ns192149.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
85417
ns81312.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81083
ns80750
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84625
ns80792
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85417
ns80937
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193239.5
ns192807
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1913750
ns1932917
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1913542
ns1916542
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1943083.5
ns1926479
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1906896
ns1921042
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
406558
ns394461
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22047.5
ns22118
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1750
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
171306.5
ns166019.5
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6209
ns6250
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6625
ns7208
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8542
ns8166
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7125
ns6312.5
ns1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60422
ns57360.5
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns8917
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8958
ns9167
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9584
ns9208
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9416
ns9250
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
313100.5
ns301535
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
119013624.5
ns156508063
ns0.76
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174073709
ns173937500
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
154836458
ns148141208
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106465208
ns106478500
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5473107.5
ns5474150
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
615549000
ns673237875
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555627500
ns556883000
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
469486625
ns453960458.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
758488604
ns759297583
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34956527
ns38204722
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
650955333
ns701496583
ns0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
665997520.5
ns667076166
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
596311875
ns586800771
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
746344250
ns744632000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59041
ns56833
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47750
ns48042
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39041
ns47125
ns0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84708.5
ns84541
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36941
ns37576
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1922166
ns1935541
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1978041
ns1985208
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1990167
ns1979834
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1920167
ns1893771
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
173728
ns174934
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
282041.5
ns267875
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
266458
ns288042
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
273853.5
ns270229.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
270333
ns267250
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
135453.5
ns128767
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
674666
ns665041
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
684354
ns668958
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
676145.5
ns589167
ns1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
596375
ns596209
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
752272.5
ns703647.5
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2253417
ns2205417
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2217895.5
ns2188541
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2190479
ns2100166.5
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2202416.5
ns2225499.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133169
ns133307.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5479500
ns5538625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5506916
ns5527958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5588312.5
ns5503250
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5564021
ns5491271
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
794371.5
ns759584.5
ns1.05
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
646958
ns638667
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
656500
ns640458
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
640416
ns648875
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
657291
ns636167
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47817
ns47137
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1822375
ns1796937.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1719708
ns1724292
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1665541
ns1720542
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2108083
ns2104520.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
227850
ns218174.5
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58458
ns57000
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
45083
ns46833
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38041
ns47083
ns0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84958
ns84542
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28842
ns28335
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030375
ns2047750
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2084312.5
ns2077083
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1787459
ns2092083
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2014583.5
ns1939979
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192397.5
ns191381.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13382625
ns13410020.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12433458.5
ns12472750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12571375
ns12570979
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15143562.5
ns15234500
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
514602
ns512740.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47546916
ns47584458
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41875708
ns41911083
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41161020.5
ns41152979.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58396167
ns58152541
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3251545
ns3249099
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
75047125
ns74313208.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
67897459
ns91931958.5
ns0.74
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90940166.5
ns91156000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
99460667
ns76595709
ns1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58750
ns57334
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46875
ns47417
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38333
ns47250
ns0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80334
ns84375
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46475
ns48075
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921416
ns1930959
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1976416
ns1977562.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1721708.5
ns1977250
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1905000
ns1816292
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190253.5
ns196217.5
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns334
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns417
ns0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns334
ns1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
417
ns333
ns1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
31709.5
ns32756
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6125
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6208
ns6583
ns0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6583
ns6542
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6854.5
ns6208
ns1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
176344
ns178147.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31144
ns31948
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2625
ns2875
ns0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2833
ns2834
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2625
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
164923.5
ns164100
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
285479083.5
ns323244146
ns0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
340672292
ns340740458
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
320528833.5
ns314512041.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
267627833
ns271130916
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7061953.5
ns7115553
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1000752000
ns1053603541.5
ns0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
941508917
ns941056333
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
849741542
ns854610104
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1162624583
ns1162236250
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33972568.5
ns33945165
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1314224145.5
ns1364084083.5
ns0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1312834041.5
ns1705661833
ns0.77
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1621294583
ns1621953875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1681368042
ns1313183229.5
ns1.28
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1461562.5
ns1410000
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1416958
ns1408291.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1414750
ns1453645.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1412375
ns1407209
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127713.5
ns127861
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5020125
ns5051959
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5027042
ns5013583.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4740833
ns5028416.5
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5044042
ns5027271
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
510137
ns604299
ns0.84
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
171071812.5
ns161226250
ns1.06
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
126739625
ns131446875
ns0.96
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
146147041
ns127042083
ns1.15
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
168329334
ns155626750.5
ns1.08
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4881506
ns4974919.5
ns0.98
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
622612209
ns850481958
ns0.73
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
538980667
ns644255791
ns0.84
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
504257334
ns496077667
ns1.02
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
656863250
ns685984875
ns0.96
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16684647
ns15948822
ns1.05
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8964583
ns9064833.5
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8900333
ns8770396
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7993333
ns7878104.5
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9790312.5
ns10163000
ns0.96
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1594468.5
ns1608837.5
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36115750.5
ns37348729
ns0.97
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36971083.5
ns36970124.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34444208
ns33623167
ns1.02
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37794834
ns38875729.5
ns0.97
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6465190.5
ns6455570
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47292
ns47375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47542
ns47750
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47584
ns47583
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47500
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18793
ns18855
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50291.5
ns50250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50417
ns50750
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50833
ns50416
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50750
ns50292
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
231220
ns202264
ns1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6291
ns6375
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7084
ns7187.5
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7792
ns8417
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7542
ns6708
ns1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
106604.5
ns108599.5
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10209
ns9604.5
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9833
ns10209
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10270.5
ns10292
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10459
ns10583
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
619990
ns610519
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5792
ns5958
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6416
ns6375
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7958
ns7583
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6042
ns5542
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
121725
ns131186.5
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13375
ns12875
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13000
ns13208
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13584
ns13583
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13375
ns12875
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
528027
ns530393
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
959
ns1167
ns0.82
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1125
ns1042
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
31705
ns32479.5
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7792
ns7833.5
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7667
ns8042
ns0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8209
ns8083
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8666
ns7916
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
204125.5
ns216406.5
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23000
ns23042
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23084
ns23542
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23584
ns23333
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23500
ns23375
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18461
ns19066
ns0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52458
ns52291.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52291
ns52500
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52791
ns53166.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52458
ns52125
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
286087.5
ns309714.5
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1397209
ns1413917
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1395917
ns1401104
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1400209
ns1457583.5
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1398500
ns1402271
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195540.5
ns196285
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5008458.5
ns5045083
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5018750
ns4724458
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4722750
ns5023021
ns0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4703042
ns4706104.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
626852.5
ns644560.5
ns0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3063416
ns3086125.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2063875
ns2087104.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2311417
ns2281125
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4823500
ns4848375
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580360
ns580262
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24332959
ns24765000.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18875458
ns18889791.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18989334
ns19005084
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36748479.5
ns36681292
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3188758
ns3253871.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34048562.5
ns34537875
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28257854
ns28314500
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28468541.5
ns27967000
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41851021
ns41702500
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144123292
ns144041208
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
147912291
ns143168583
ns1.03
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
128219729
ns124247521
ns1.03
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
175666645.5
ns173506729
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22797470
ns22768605
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1274551333
ns957619479
ns1.33
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1209986250
ns1175957479.5
ns1.03
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
717258459
ns739734292
ns0.97
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
669341542
ns672317125
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118134658
ns118020449
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75042
ns73979
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73833
ns75750
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75813
ns75416
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74125
ns72854.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
248024.5
ns300521.5
ns0.83
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
202750
ns287875
ns0.70
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
283250
ns285333
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
194000
ns204208
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
189583
ns287375
ns0.66
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1272660.5
ns1342742
ns0.95
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35542000
ns36185500
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36428479
ns35466000.5
ns1.03
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32734792
ns32336688
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40941958
ns40972250
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5852888
ns5837876
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
147574354
ns151179834
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
154842271
ns151456979
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
142249771
ns136606104
ns1.04
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
285430916
ns287372208
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34907859
ns34877857
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
119543458.5
ns155986916
ns0.77
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173916625
ns174507459
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
155928584
ns148111416.5
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
103545938
ns102908562.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5470774
ns5463707
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
471171395.5
ns520380250
ns0.91
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
467366000
ns465489750
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
456719729
ns439138000
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
738831458
ns742252417
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32277660
ns35175845
ns0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
709159062
ns698201250
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
654555208.5
ns654820792
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
585803354.5
ns571273229.5
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
726547959
ns850215250
ns0.85
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1242646
ns1101520.5
ns1.13
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
968625.5
ns970208.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
674709
ns920500
ns0.73
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1941770.5
ns1945375.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
569058
ns580245.5
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2969916
ns2907896
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2603708
ns2595708
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
1985166.5
ns2606333
ns0.76
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3729625
ns3655000
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1762089
ns1734207
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5801458
ns6744875
ns0.86
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5780958
ns6498208
ns0.89
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5645834
ns6503854.5
ns0.87
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2921042
ns4423604.5
ns0.66
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7208
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns6083
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5333
ns5958.5
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns9959
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25119
ns25201
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215750
ns212291
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
258458
ns220750
ns1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221291.5
ns220125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207146
ns206792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
264756
ns262467.5
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
308377104
ns316552750
ns0.97
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
231656291
ns221682708
ns1.04
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
224042396
ns187257688
ns1.20
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
307881333
ns311596375
ns0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7678620
ns7676203
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1097604312.5
ns1093022833.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
920148521
ns911616145.5
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
858485833.5
ns815656375
ns1.05
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1150798750
ns1161401125
ns0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26497955
ns26547253
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4958.5
ns5292
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5583
ns5667
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6916.5
ns6625
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5541
ns5125
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
171524
ns167889.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7083
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns7375
ns0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7458
ns7459
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7875
ns7437.5
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
670577.5
ns650263
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns542
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
541
ns709
ns0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23778
ns23809
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8708
ns9041.5
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8541.5
ns9791
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns9208.5
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9541.5
ns9042
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
233071
ns233459
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
353250
ns351417
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
353208
ns352250
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352667
ns353063
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352125
ns353333
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21348
ns21613
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
822333
ns791250
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
774854
ns808979
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
777042
ns773625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
825999.5
ns824084
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
286748
ns305844
ns0.94
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
336833
ns314958
ns1.07
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
335917
ns333625
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
445708
ns448667
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10917
ns331833
ns0.03289907875346937
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17559
ns17811
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
713499.5
ns682125
ns1.05
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
730834
ns746791.5
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1027167
ns1029167
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26500
ns700937.5
ns0.03780650913954525
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
260521.5
ns273907.5
ns0.95
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
371375
ns328083
ns1.13
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
346250
ns348979
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
445812.5
ns424375
ns1.05
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30479
ns370666
ns0.0822276658770969
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22136
ns22237
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
734062.5
ns743604
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
773750.5
ns750229
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1061729
ns1076375
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
98521
ns822541
ns0.12
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
220018.5
ns220485.5
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3375
ns3334
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3542
ns3792
ns0.93
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3687.5
ns3625
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
ns3583
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17780
ns18068
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4125
ns4166
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4167
ns4542
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4375
ns4250
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4500
ns4334
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
258504
ns278097
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3750
ns3292
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3500
ns3645.5
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4917
ns4708
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4083
ns4042
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
200777
ns212235.5
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8417
ns8042
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8417
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8625
ns8792
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8604.5
ns8167
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1183716
ns1255478
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
205708
ns204000
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210125
ns211375
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210375
ns211042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200375
ns200541
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34375
ns34367
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
650916
ns605708.5
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
666959
ns625021
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
624167
ns620792
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
632458
ns582583
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
343648
ns361289.5
ns0.95
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1000479
ns973333
ns1.03
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1007958
ns950209
ns1.06
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
974396
ns955541
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
894770.5
ns1286000.5
ns0.70
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207021.5
ns207830
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4512146
ns4594084
ns0.98
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4708729.5
ns4500750.5
ns1.05
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4609875
ns4304583
ns1.07
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5171208.5
ns6304625
ns0.82
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
947853.5
ns925479
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3333
ns3333
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3083
ns3583
ns0.86
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4333
ns4250
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3917
ns3541
ns1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
218377.5
ns240989.5
ns0.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7375
ns6875
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6833
ns7542
ns0.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7458
ns7375
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7459
ns7042
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1012916
ns1039649.5
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1641584
ns1636792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1193979
ns1175749.5
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1342687.5
ns1347167
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2486625.5
ns2463271
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214048
ns213096
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12366291.5
ns12388416
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9556958
ns9551437.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9332500
ns9305937.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18065166.5
ns18088000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1946882
ns1951605
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17346750
ns17398084
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14347000
ns14348854.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14486917
ns14347271
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21148167
ns21112104
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134750
ns94729.5
ns1.42
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88584
ns90667
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92042
ns92375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
89042
ns114395.5
ns0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126624
ns125574
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031958
ns2039792
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2023083.5
ns1808208.5
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1756000
ns2033666.5
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2029583
ns2022500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1029084
ns1052869
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
1750
ns326041.5
ns0.005367414884301538
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
2833
ns344833
ns0.008215571015535056
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
2458
ns396416
ns0.0062005569906360995
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
2166.5
ns314708
ns0.006884159284161826
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16055
ns15677
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2583
ns701042
ns0.00368451533574308
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2500
ns733209
ns0.0034096690029718677
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2750
ns1020500
ns0.0026947574718275357
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2750
ns656250
ns0.004190476190476191
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
191618
ns196145.5
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7416
ns7084
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5541
ns1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5125
ns6084
ns0.84
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10166
ns10000
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33917
ns34060
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226396.5
ns221166.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222521
ns220916.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221584
ns220167
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207458
ns217124.5
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
311723.5
ns344547
ns0.90
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3709
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22860
ns22568
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14458
ns14167
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14291
ns14375
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14250
ns14458
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14667
ns14416
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
472859.5
ns487124.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
137417
ns97500
ns1.41
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
96458.5
ns93417
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
95833
ns96687.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
93125
ns91875
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125940
ns124929
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921458.5
ns1940875
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1918166.5
ns1919916.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1817687.5
ns1931229.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1914458
ns1917271.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
951464
ns955641
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
869042
ns854084
ns1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
815167
ns826333
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1175833
ns1211000
ns0.97
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
967562.5
ns955354.5
ns1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA
276671
ns272141
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2830583
ns2801124.5
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2508062.5
ns2515333
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3332875
ns3309625
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3328000
ns3416625
ns0.97
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1576106.5
ns1612126.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16000
ns17062.5
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15625
ns16708.5
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16458
ns18937
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16417
ns15167
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
143900.5
ns142123.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
255875.5
ns223437.5
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
254271
ns215958
ns1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216250
ns216125
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
258021
ns255708.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
637843.5
ns644779
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
220792
ns222292
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220667
ns221750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
221208
ns222542
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
222208.5
ns220917
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
270997
ns271274.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
504458
ns509083
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
507416.5
ns501292
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
499833.5
ns496750
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
498875.5
ns550583
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1304306.5
ns1401190
ns0.93
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
3459
ns304437.5
ns0.011361938000410594
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
3854.5
ns331687.5
ns0.01162087808554739
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5375
ns376292
ns0.014284119779320315
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4042
ns321812.5
ns0.012560108758982327
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16660
ns16554
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7166
ns708875
ns0.010108975489331687
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
6458
ns736875
ns0.00876403731976251
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7209
ns1020209
ns0.0070661991807561
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7541.5
ns668458
ns0.011281935439474132
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
194930.5
ns196065
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17666
ns17854
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17125
ns18520.5
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19729
ns19667
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18000
ns16209
ns1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146357.5
ns146750.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
244562
ns247604
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
237417
ns212500
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214500
ns212917
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225208
ns211750.5
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
894981
ns1011803
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4416
ns4125
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
3917
ns4125
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5334
ns5187.5
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4833
ns4084
ns1.18
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
187684
ns201325
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10500
ns10667
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9708
ns10875
ns0.89
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11167
ns10500
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11250
ns10375
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1024651
ns1050725
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3209
ns3375
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3250
ns3625
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4687.5
ns4167
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3791
ns3291
ns1.15
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
218725.5
ns242454
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7833
ns7542
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7291
ns7666
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7750
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7917
ns7333
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1043721.5
ns1067571
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23437104.5
ns24057353.5
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
35045979.5
ns34753459
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
41490500
ns37792125
ns1.10
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34913479
ns34828583.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2126334.5
ns1854184
ns1.15
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184798459
ns187222542
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159330000
ns160010375
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
151477459
ns146721854.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
411547250
ns412776417
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16524151
ns16508303
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
427197208
ns437495583
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
252723645.5
ns253838438
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
305721250
ns232343979.5
ns1.32
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
481095166
ns483540875
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
182854.5
ns183854
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182791.5
ns183625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185292
ns185334
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
185750
ns184167
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
173677.5
ns220968
ns0.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
629833
ns594000
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
631375
ns632437.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
590542
ns586084
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630770.5
ns628500
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1010062
ns1061303.5
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3848041.5
ns3892042
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4009000
ns3642708
ns1.10
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3525583
ns3572042
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4614917
ns5353250
ns0.86
batchedmm(128, Bsize=512)/forward/GPU/CUDA
536882
ns549368
ns0.98
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17371917
ns17901624.5
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17740624.5
ns17281292
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16856312.5
ns16574875
ns1.02
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20403334
ns22050250
ns0.93
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2613028
ns2630980
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns541
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns625
ns0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
667
ns584
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31917
ns31762
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9334
ns9145.5
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8708
ns9208
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9875
ns9417
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9417
ns9208
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
260614
ns262912.5
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
503086958
ns505346750
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
424620083.5
ns429818666.5
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
462339520.5
ns433256333.5
ns1.07
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
673052062
ns677373875
ns0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12478664.5
ns12487373
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1872018104.5
ns2066713500
ns0.91
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1625413500
ns1635890000
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1546440125
ns1494391792
ns1.03
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2200566458.5
ns2208031208.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49139909
ns49163495.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1647791.5
ns1632500.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1202542
ns1173583
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1365999.5
ns1383958
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2393042
ns2483292
ns0.96
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215162
ns214736
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12703083.5
ns12776042
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9880000
ns9939062.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9761146
ns9686917
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18559417
ns18349375
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2005712
ns2056758
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17693854
ns17758729.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14669187.5
ns14689958
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14767500
ns14551125
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21469542
ns21399666
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26208
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26292
ns26333
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26292
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23799
ns24146
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66666
ns66791
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66750
ns67292
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67209
ns68417
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67500
ns66709
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
380551.5
ns391053.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203917
ns204333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209750
ns210125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210000
ns209458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199958
ns198792
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25800
ns26289
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
648229.5
ns642083
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
661271
ns624354.5
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622750
ns621729.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
586375
ns627000.5
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
308724.5
ns357106
ns0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
600291
ns645625
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
594125
ns636292
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
544666
ns602667
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
652208
ns672375
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131751
ns132245.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2235000
ns2294979
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2235625
ns2157208
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2300854
ns2246208
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2253125
ns2249458
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1127758
ns1236985
ns0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17541
ns17937.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16958
ns18416.5
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19917
ns20083
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17958
ns18895.5
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145385
ns145580
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
261583
ns259583
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
260812.5
ns261791
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220937.5
ns219084
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230896
ns257520.5
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
982925
ns1034996
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
542
ns667
ns0.81
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
667
ns583
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23015
ns23604
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9479.5
ns9750
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9042
ns10292
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10292
ns10250
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9625
ns9333
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
257388
ns260113.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5458
ns5083.5
ns1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5417
ns5792
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6625
ns6833
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6083
ns5375
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
233603.5
ns229273.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7083
ns6709
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7041
ns7667
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7833
ns7583
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7375
ns6937.5
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
800650
ns777061.5
ns1.03
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2000
ns1917
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2125
ns2500
ns0.85
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2458
ns2208
ns1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2459
ns2250
ns1.09
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17988
ns18340
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6500
ns6542
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6291
ns6667
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6708
ns6666
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6542
ns6584
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
330671
ns320616.5
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749709
ns750542
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
747104
ns746792
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749208
ns746916
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
751791.5
ns750584
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21045
ns21795
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
791000
ns805145.5
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
791062.5
ns791604
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775875
ns772584
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
775250
ns810645.5
ns0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
294695
ns302046.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns6959
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns5917
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5291
ns6000
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10208
ns10167
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32534
ns32896
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233291
ns228770.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
267375
ns227709
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227812.5
ns228084
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213583
ns225625.5
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
361573
ns359979
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10020.5
ns10250
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10042
ns10208
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11625
ns11042
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10208
ns9958
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
248981.5
ns245976
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
26791
ns24896
ns1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24292
ns24000
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24750
ns25416.5
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25000
ns24625
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1132389
ns1114734
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
107227250
ns106794687
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117058791.5
ns118367979
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
124034229
ns120992291
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117545541.5
ns118045833
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2659866
ns2655666
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
393155000
ns397097667
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
366597250
ns368138875
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
357674666
ns357737125
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
490403667
ns483722209
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15157994
ns15195689
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
758865499.5
ns769405854
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
580033084
ns762934333
ns0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
748265062.5
ns748099729.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
948608916.5
ns772112770.5
ns1.23
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6916.5
ns6417
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7000
ns7375
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8042
ns8187
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7625
ns8708.5
ns0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
242461.5
ns243458.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14084
ns13625
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13500
ns14834
ns0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14208
ns14834
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14333
ns14000
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1085062
ns1081512.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5541
ns5500
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6563
ns6083.5
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7666
ns7500
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6291
ns5625
ns1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
235371.5
ns236881
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12542
ns12583
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12104.5
ns12750
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13042
ns13000
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12750
ns12542
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
793450.5
ns792100
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5125
ns328937.5
ns0.015580467414022421
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5750
ns345250
ns0.0166545981173063
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6333
ns398625
ns0.01588711194731891
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5625
ns315687.5
ns0.01781825381112651
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16571
ns17026
ns0.97
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15792
ns701750
ns0.022503740648379053
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15417
ns734417
ns0.020992161129167762
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15625
ns1025666
ns0.015234004052001334
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15750
ns663750
ns0.023728813559322035
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
200110.5
ns202330
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns417
ns0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23594.5
ns23795
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
5959
ns6250
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6083
ns6750
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6500
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6834
ns6104.5
ns1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
242427.5
ns242897.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5834
ns6042
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6000
ns5917
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6041
ns5875
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24342.5
ns24778
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20875
ns21834
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21042
ns21542
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21666
ns21750
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21875
ns21417
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
262727.5
ns265364.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
185833
ns184375
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144916.5
ns185000
ns0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146875
ns149541
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144416.5
ns190750
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167734
ns168165
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1323750
ns1361667
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1312209
ns1306875.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1332875
ns1318541.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1333770.5
ns1332084
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1339118
ns1372553
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24041.5
ns24458
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22312.5
ns22729
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24833
ns25000
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24667
ns22374.5
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
351890.5
ns355948
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
170708
ns176958
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
177875
ns131167
ns1.36
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118625
ns126166.5
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
120020.5
ns177542
ns0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1461877
ns1491511
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns417
ns0.70
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
416
ns333
ns1.25
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22590
ns23138
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6250
ns6125
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6250
ns6917
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6667
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6583
ns6250
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
255552.5
ns259300
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4291
ns4458
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4417
ns4875
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5708
ns5708.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5292
ns4833
ns1.09
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256272
ns258768.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10042
ns9709
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9833
ns10083
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10417
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10333
ns10041.5
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1354208
ns1358754
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1666
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1666
ns1667
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22798
ns23306
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5833
ns5625
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5709
ns6125
ns0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6000
ns6041
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5916
ns5625
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
274328
ns275587
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6866624.5
ns6813916.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6433708
ns6428416
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6554499.5
ns6554167
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7548875
ns7571104.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213149
ns213811
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24100417
ns24163500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21294521
ns21359167
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21070125
ns21066083
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29826667
ns29670209
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2116806
ns2101483
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37336834
ns37462416
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34197292
ns45862833.5
ns0.75
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45794042
ns45876667
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49624208
ns38235959
ns1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5750
ns5459
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5625
ns6250
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6791
ns6958
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6667
ns5292
ns1.26
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
236202.5
ns238588.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8084
ns7959
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7875
ns8334
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns8250
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9167
ns8250
ns1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1060405
ns1068264.5
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1553542
ns1529292
ns1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1263041.5
ns1266666.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1622041
ns1623709
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2175916
ns2163750
ns1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA
272178
ns279544
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7902375
ns7968292
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6258292
ns6533250
ns0.96
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7165958
ns7125792
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10478104.5
ns10479375
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1852121.5
ns1874497
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
361584
ns320667
ns1.13
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
370750
ns346291
ns1.07
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
456417
ns428584
ns1.06
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
24999.5
ns345375
ns0.0723836409699602
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46439.5
ns46619.5
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
738895.5
ns745958.5
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
809958
ns791666.5
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1082542
ns1073208.5
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
76708
ns776479
ns0.09878953584063445
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
301861.5
ns311670
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397459
ns396708.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288084
ns287917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
212208
ns288250
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
755209
ns753417
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43701
ns44556
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
665625
ns645167
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
530417
ns527667
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
473750
ns532000
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974458
ns974292
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
189749
ns190424
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
649583
ns668958
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
641833
ns629749.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
545458.5
ns544375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
653167
ns643396
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131877
ns132592.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2454834
ns2485646
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2460271
ns2448562.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2500666
ns2450292
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2518479
ns2461146
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1202049
ns1408688
ns0.85
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
3000
ns324000.5
ns0.009259244970300971
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
3500
ns344459
ns0.010160860944263323
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
3500
ns396583
ns0.008825390901778443
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2708
ns314083.5
ns0.008621911052315705
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15904
ns16193
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5375
ns700875
ns0.007668985197075085
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5292
ns734292
ns0.007206942197381968
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5666
ns1020625
ns0.005551500306184936
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5750
ns656584
ns0.008757447638078297
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
196388
ns201017
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1465625
ns1461042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1502708
ns1503750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1496875
ns1504625
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1444792
ns1442917
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40558
ns40991
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5125396
ns5155750
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5286583
ns5279833.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5312375
ns5308333.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4974792
ns4987604
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195790.5
ns200839
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32748
ns33187
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15083
ns14958
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15083
ns15395.5
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15167
ns15375
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15375
ns15083
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
375651.5
ns379072.5
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71125
ns71541
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71167
ns71542
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71208
ns71270.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71083
ns71083
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112958
ns112914
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
323791
ns325333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
320458
ns320729.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
326875
ns318792
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
323000
ns317333
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
193747
ns193733
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
958
ns1125
ns0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1084
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23358
ns23845
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7875
ns7750
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7834
ns8583
ns0.91
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8458
ns8500
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8833
ns7750
ns1.14
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
259209
ns262768.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
505375
ns456417
ns1.11
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
484292
ns472584
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
564542
ns554479
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
215062.5
ns550167
ns0.39
batchedmm(128, Bsize=32)/forward/GPU/CUDA
128754
ns128330
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1371334
ns1408750
ns0.97
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1393812.5
ns1380958
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1732333
ns1632666.5
ns1.06
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
870083.5
ns1597604
ns0.54
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
276302
ns274089
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns417
ns0.70
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31400
ns31588
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6167
ns6083
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6000
ns6750
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6458
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6958
ns6125
ns1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
263074.5
ns263587.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1767042
ns1767792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1725208
ns1726375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1727292
ns1725708
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1726271
ns1773250
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168554
ns168887
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4357521
ns4406958
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4359541
ns4358916
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4379875
ns4369792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4377583
ns4367125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1157059
ns1241756.5
ns0.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6666
ns6750
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6666
ns7000
ns0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6916
ns6792
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7041.5
ns6750
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20567
ns19512
ns1.05
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
32834
ns51584
ns0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
51229.5
ns48771
ns1.05
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33541.5
ns33250
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51062.5
ns52958
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
209739.5
ns210086
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17250
ns328750
ns0.05247148288973384
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17812.5
ns344958
ns0.05163672099212078
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18292
ns408250
ns0.044805878750765464
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17708
ns323500
ns0.05473879443585781
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17907
ns18058
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53208
ns719583.5
ns0.0739427738407009
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
52959
ns735666.5
ns0.07198778250742693
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53541
ns1034250
ns0.051767947788252354
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53291
ns684646
ns0.07783730570250903
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
344400
ns345041
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75333
ns75459
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
74959
ns75292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75292
ns75167
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75000
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47022
ns46969
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
325292
ns332833
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
324417
ns325833
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
343042
ns324583
ns1.06
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
327084
ns323834
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
210359
ns207979
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1488333
ns1487708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1527917
ns1530375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1521042
ns1530750
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1466167
ns1466417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51138
ns51505.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5120375
ns5146312.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5285750
ns5151604.5
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5309459
ns5003270.5
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4973917
ns4984709
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202631
ns205494.5
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28167
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28125
ns28334
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28209
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24478
ns24407
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66208
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66167
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66250
ns67458
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66959
ns66417
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
533201
ns525547
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1463833
ns1383749.5
ns1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1144583
ns1059771
ns1.08
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
832188
ns1061458
ns0.78
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2217792
ns2248687.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
576305
ns581876.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3077958.5
ns3035479
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2733167
ns2745250
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2620334
ns2740958
ns0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3782000
ns3811500
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2001343
ns2064611
ns0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7887749.5
ns8921042
ns0.88
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7887771
ns8776625
ns0.90
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7989000
ns8768729.5
ns0.91
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4832458
ns6359583
ns0.76
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134958
ns82083.5
ns1.64
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
78917
ns81562.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82625
ns83125
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81250
ns80583
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193237.5
ns192403.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017354.5
ns2040625
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2006750
ns1935354.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2041167
ns2023083
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2018875
ns2003562.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
797402
ns805958
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
132619c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/119938
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via: