-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add Reactant and TPU to autodiff.md (#1101)
* Add Reactant to autodiff.md * Update autodiff.md * Update autodiff.md * Apply suggestions from code review --------- Co-authored-by: Avik Pal <[email protected]>
- Loading branch information
Showing
1 changed file
with
29 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d755929
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4083
ns4125
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4458
ns4083.5
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4583
ns5167
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4458
ns4250
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61537
ns60836
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9958
ns10458
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11083
ns10208.5
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10333
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10292
ns10292
ns1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
428120
ns426426
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1208
ns1000
ns1.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1333
ns1291
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1333
ns1437.5
ns0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1042
ns1208
ns0.86
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
17813
ns17928
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3959
ns4125
ns0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4042
ns4084
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4375
ns4167
ns1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4000
ns3958
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
110308
ns109688.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57500
ns57625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38333
ns38333
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46625
ns46792
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82166
ns81167
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36705
ns37191
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2027541
ns2025916.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2090041.5
ns2084833.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2097083
ns2091333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1999875
ns1993604
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195283
ns194623
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143625
ns144416
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143417
ns147520.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145584
ns144062.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147187.5
ns144041
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166525
ns165620
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1109542
ns1116375.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1126812.5
ns1135458
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1122083
ns1116021
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1020645.5
ns1117250
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
533338
ns525200
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3583
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3416
ns3416
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4541
ns4417
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3604.5
ns3750
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
68868.5
ns67680
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9292
ns9083
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9542
ns9042
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9792
ns9291
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8833
ns8750
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
494765.5
ns488913
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15583
ns16583.5
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16458
ns15000
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16500
ns16937.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15083
ns14521
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54721
ns55104
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212833
ns215166.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215167
ns213375
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214416
ns212833
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212417
ns213208
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
274119.5
ns272083
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
583
ns542
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
792
ns625
ns1.27
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns687.5
ns1.09
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns542
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17270
ns17338
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1667
ns1583
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1667
ns1666
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1458
ns1708
ns0.85
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1708
ns1625
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
103124
ns102756.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7083
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5292
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958
ns5875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns10083
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23563
ns23408
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220708
ns221750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
236874.5
ns231917
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228875
ns228875
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220166
ns214167
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
169828.5
ns169815.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3917
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23299
ns23411
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16708
ns16583.5
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16833
ns16459
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16834
ns16709
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16667
ns16791
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
162920
ns162393
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
574791
ns569208
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
578334
ns569667
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
574000
ns570125
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
574333
ns578750
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113504
ns113197
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1420083
ns1418708
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1415750
ns1421583
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1420208
ns1420834
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1425187.5
ns1432291
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
212199
ns211123.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1067895.5
ns1076625
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
940416
ns938625
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1346520.5
ns1353166
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1295333
ns1298500
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
276087
ns277930.5
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
6005792
ns5845333
ns1.03
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4619125
ns4593146
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4921458.5
ns4960354
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5705500
ns5524145.5
ns1.03
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1093586
ns1090079
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23336
ns23601.5
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2166
ns2209
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2209
ns2083
ns1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
170662.5
ns169946.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4083
ns3666
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4250
ns4417
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5250
ns4709
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4250
ns4500
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66890.5
ns65407
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11166
ns10834
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11750
ns11292
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11792
ns11667
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11145.5
ns10958
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
455730.5
ns453534
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6708
ns6167
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6917
ns7479.5
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8000
ns8500
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6833
ns6375
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
53251
ns52550.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17646
ns16583
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17687.5
ns17500
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17583
ns19833
ns0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18520.5
ns16625
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
303857.5
ns303262
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32349
ns31843
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8500
ns8542
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9458
ns8875
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9291
ns9250
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9375
ns8208
ns1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
158134
ns159642
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64375
ns64792
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64500
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64542
ns64542
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64417
ns64375
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111051
ns111120
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
280917
ns280042
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
285417
ns291791
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
280750
ns279250
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
279291.5
ns277208
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
185526.5
ns184735.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3281750
ns3278875
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2797500
ns2813375
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3018917
ns3029687.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4088625
ns3938209
ns1.04
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
571296
ns578907.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7642500
ns7620083
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7291354
ns7352417
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7449292
ns7457271
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8096333
ns8189500
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1326986
ns1328385
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17512333
ns17561125
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17557479.5
ns17648625
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17568792
ns17534459
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14165000
ns14095167
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23618750
ns23588417
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43411666
ns44459541
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37050562
ns37064416.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34914229.5
ns34977333.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1853387
ns1845684
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
187623875
ns189659041
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
247457083
ns250146875
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
194208333
ns193409375
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434785500
ns434181959
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13912861.5
ns18049039.5
ns0.77
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
289468416
ns290672125
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
350360437.5
ns356317062.5
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
297011958
ns296289666.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
409128187.5
ns392800437.5
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24042
ns22875
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23958
ns22938
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23916
ns24562.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22020.5
ns24416
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96407
ns96194.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103208.5
ns103875
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104791
ns103416
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104667
ns104292
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103417
ns103125.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
511501
ns506291.5
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6145.5
ns5917
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5625
ns6000
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6979.5
ns6584
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5709
ns6209
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
69596.5
ns68552.5
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14520.5
ns15166.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15542
ns15500
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16125
ns15542
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14625
ns14958
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
479202.5
ns480464
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3041750
ns2996875
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2066041.5
ns2072750
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2266312
ns2257667
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4490041.5
ns4838583
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
590463
ns584192
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23486917
ns23549437
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18259854
ns18342167
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17822021
ns17896791
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35704478.5
ns35570625
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2768088
ns2764116
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33321020.5
ns33587937.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28000312.5
ns28029333
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28560333.5
ns28377209
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41618958
ns41334187.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72209
ns75479
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
81645.5
ns73958.5
ns1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74917
ns74125
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72396
ns72166
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
105122.5
ns104339
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
278083
ns203458.5
ns1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
314375
ns280916.5
ns1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
208562.5
ns209583
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
241750.5
ns216291.5
ns1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
565906
ns562778.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11417
ns11708
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11833.5
ns12833
ns0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12250
ns13042
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12166
ns11917
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
73969
ns72705
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26125
ns26645.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27542
ns26458
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26708
ns27458
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26708
ns26792
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
488459.5
ns488247
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12208
ns12000
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13084
ns13750
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13833
ns14000
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12687.5
ns12500
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
55593
ns55166
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25333
ns25583
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26458
ns26416
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26250
ns26375
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26458
ns28167
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
314229
ns313572.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181625
ns181541.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180250
ns181104
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183667
ns181895.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179417
ns181916
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58869
ns59339.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
587667
ns612417
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
585625
ns590459
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583584
ns583541
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
584708
ns582416
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
294563.5
ns294347
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5395.5
ns5854.5
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6042
ns7000
ns0.86
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8416.5
ns7167
ns1.17
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8791
ns6042
ns1.45
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
73281.5
ns72861
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13834
ns14208.5
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15125
ns14333
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14333
ns15084
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14250
ns14208
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
478456
ns476457
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1191541
ns1198334
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1236750
ns1236458
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1285583.5
ns1270167
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1003417
ns1009834
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302585
ns301349
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4114354
ns4121104
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4527875
ns4571459
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4560333.5
ns4583146
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3695000
ns3708333
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1056192.5
ns1054428
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23824
ns24401
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns5042
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4959
ns4875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
193428
ns192852.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6084
ns5916.5
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6166
ns6625
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6917
ns7625
ns0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6209
ns5916
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
57953
ns57663
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10333
ns10562.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11709
ns11417
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11333
ns12083
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11583
ns10459
ns1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
343622
ns339260
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
334
ns333
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns375
ns0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23294
ns23460
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2791
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3042
ns2792
ns1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3000
ns2709
ns1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2791
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
163978
ns162941.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11375
ns11542
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11666
ns12209
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12500
ns13875
ns0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11542
ns11583
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
59566.5
ns59011.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24459
ns24375
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25042
ns24583
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25083.5
ns25208
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25083
ns24792
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
305262.5
ns303188
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4209
ns4208
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25152
ns25111
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16083
ns16042
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16042
ns15917
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16375
ns16291
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16167
ns16291
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
203575
ns202144.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5875
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5833
ns5833
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5916
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5750
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34167
ns34056
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20291.5
ns20520.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21041
ns21000
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21500
ns21167
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21167
ns21333
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
180386.5
ns179609.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
420667
ns425458.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
363520.5
ns364854.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
482000
ns482520.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
125291.5
ns103125
ns1.21
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67480
ns67737
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
897041
ns906625
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
967000.5
ns982042
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1167958
ns1181333
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
396500
ns377458
ns1.05
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
197078.5
ns194135
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80125
ns81333
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81020.5
ns82041
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82625
ns84291
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83458
ns81813
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194831
ns194522
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1694000
ns1927625
ns0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1917291.5
ns1941000
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1931459
ns1930917
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1896062.5
ns1842062
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
416256.5
ns390656
ns1.07
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22312
ns22388
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
176862.5
ns171479
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6208
ns6542
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6875
ns7083.5
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7750
ns8020.5
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7000
ns6500
ns1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
62506
ns60274
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8833
ns8917
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9250
ns9417
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9417
ns9916
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9333
ns9208
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
325531
ns311149
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121103854.5
ns120884833.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181392229
ns181722750
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147959958.5
ns148231625
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
103681750
ns108144417
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5500074
ns5478841
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
613086875
ns615355583.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
578493750
ns581447666.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
454857041.5
ns451634708.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
752941812.5
ns757933250.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35077599
ns34994190
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
649102417
ns649420209
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
685608520.5
ns687787021
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
589011249.5
ns584232000.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
739858625
ns744942000
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59500
ns59500
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38708
ns39125
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
48000
ns48020.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82708
ns83458
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38528
ns38331
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1741292
ns1946625
ns0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1966416
ns1985458
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1984416
ns1983521
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1859270.5
ns1887334
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
177396
ns176268
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
271125
ns265750
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
274250
ns268104.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
268416
ns269291.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
267791.5
ns265125
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
137600.5
ns125359
ns1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
587833
ns690208
ns0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
666917
ns658417
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
587208
ns603125
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
665917
ns594458
ns1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
757074
ns701612
ns1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2224291.5
ns2169417
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2235083
ns2237833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2099770.5
ns2188625
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2218208
ns2203000
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
135238
ns133751
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5494167
ns5513083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5547875
ns5572520.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5497792
ns5508208
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5395666.5
ns5485271
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
797087
ns720574
ns1.11
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
643250
ns638458
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
646958
ns640250
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
642375
ns640416
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
640208
ns642666.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47636
ns46893.5
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1820958
ns1824209
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1668166
ns1666417
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1721291
ns1728208
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2100708
ns2102708
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
227359.5
ns220656.5
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58583
ns58500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38208.5
ns38584
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47292
ns46208
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82750
ns83042
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
29299.5
ns28530.5
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2023770.5
ns2056084
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2018000
ns2102729.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2096292
ns2102270.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1983895.5
ns1992792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191243
ns189031.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13392479
ns13396167
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12447084
ns12488625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12573562.5
ns12567208
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15225667
ns14924083
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
515936
ns512412.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47214583.5
ns47267416.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
42007792
ns42078000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40831167
ns40824125
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58287250
ns58451854
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2893597
ns2895350
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
73879562
ns74360062.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91062583
ns91413375
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90595250
ns90659959
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
98708500
ns76716041
ns1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59041
ns59208
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38458
ns38833
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47500
ns47125
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83041
ns78625
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46889
ns48139.5
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1914042
ns1938145.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980250
ns1984167
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1983041.5
ns1977812.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1895208.5
ns1877083
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191685.5
ns195830.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
416
ns333
ns1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
31909.5
ns32688
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5958
ns6083
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6666
ns6334
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6666
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6667
ns6000
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
174339.5
ns173538
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31434
ns32105
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2584
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2959
ns2792
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2792
ns2791
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2833
ns2584
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161698.5
ns160748.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
284655874.5
ns287049250
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
346665396
ns347795687.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314185249.5
ns314367979.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
271410834
ns271524458
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7071052.5
ns7120410.5
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
986652459
ns1003307875
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
960769500
ns964885125
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
837320313
ns835293000
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1160509417
ns1152976875
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34004605
ns34058870
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1311324917
ns1312833396
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1697266750
ns1706336084
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1638971166
ns1599191959
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1734387958.5
ns1309056604.5
ns1.32
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1414375
ns1408791
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1459333
ns1452791.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1417583
ns1449625
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1464750
ns1407209
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127631
ns128282.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4707666.5
ns5034917
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5056666.5
ns5065916.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5045625
ns5035937.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5028167
ns5012729
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
589690
ns483777.5
ns1.22
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
174231250
ns171224875
ns1.02
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
167491167
ns167755167
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
128702541
ns128923708
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
154878708
ns154904187
ns1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4890073
ns4889428.5
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
622332667
ns621337542
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
581984000
ns581831583
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
496978166
ns460212833
ns1.08
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
643892875
ns643084792
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16065970
ns16318390
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8934042
ns8919875
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9020375
ns9050687.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7917083
ns7921583
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9692542
ns9747084
ns0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1603050
ns1600463.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36495271
ns36566209
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
38137292
ns38511167
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33438520.5
ns33595375
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37760500
ns37796583
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6473707
ns6471792
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47375
ns47291
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47417
ns47479.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47500
ns47729.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47542
ns47334
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18555
ns18559
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50375
ns50417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50667
ns50417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50375
ns50375
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
207795
ns167009.5
ns1.24
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6375
ns6459
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7041
ns7770.5
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7958
ns8041
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7208.5
ns7000
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
101178.5
ns76373.5
ns1.32
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns10000
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10625
ns10458
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10625
ns10250
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10417
ns10084
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
593102.5
ns456260
ns1.30
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5708
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6208.5
ns6708
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6750
ns7458
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6084
ns5917
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
121281
ns91945.5
ns1.32
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12708
ns12917
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13541
ns13625
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13250
ns13416
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13208
ns13292
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
511694
ns417439.5
ns1.23
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32282
ns32442
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7542
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns7875
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns8291
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8041
ns7834
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
210142.5
ns192614
ns1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23166
ns23250
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23209
ns23250
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23250
ns23416
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23104.5
ns23292
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18312
ns18706.5
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52416
ns52417
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52542
ns52625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52709
ns52959
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52625
ns52875
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
291833.5
ns226057.5
ns1.29
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1400833
ns1403937.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1445959
ns1409291.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1396833
ns1405208
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1398917
ns1402896
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
197117.5
ns196688.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5008208
ns5027625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5030250
ns5036500.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5026354
ns5008875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4996437.5
ns5003083.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
600264
ns565308
ns1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3038708
ns3058166
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2105979
ns2060229
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2274062.5
ns2301833
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4858083
ns4897625
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586328
ns586278
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24399625
ns24473708.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19072583.5
ns19098958
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18904750
ns18981042
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36638687.5
ns37019125
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2819518
ns2831934
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33955417
ns34098417
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28785062.5
ns28724166.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28141333
ns28239458
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41707708.5
ns41378063
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
142540583
ns146235958
ns0.97
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
146733875
ns147965500
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
125527687.5
ns127304667
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
174248667
ns172673353.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22566115
ns22564119
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
968276062.5
ns1235304437.5
ns0.78
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
860326354.5
ns869077229.5
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
858659167
ns769904041
ns1.12
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
683117959
ns666199333
ns1.03
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118099274
ns118146881
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72375
ns73812
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74000
ns73875
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76250
ns75687.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73208
ns76416
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
235570
ns208579
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
203292
ns295500
ns0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
282896
ns193958
ns1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
203583
ns287395.5
ns0.71
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207583
ns282729
ns0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1260670
ns1165959
ns1.08
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35143208
ns35776083
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36705709
ns36529041
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32591958.5
ns32581292
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40607646
ns40338396
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5841170.5
ns5849817
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148155791.5
ns148302541
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
158417083.5
ns158881084
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
137765333
ns138956354.5
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
283770667
ns284123584
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34905958
ns34596502
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120795375
ns120211625
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181579562.5
ns182136458
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148004834
ns148062084
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
108061458.5
ns105814875
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5466179.5
ns5475710.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
468909791.5
ns469150645.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
485490958.5
ns486184250
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
438520417
ns437949792
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
742778708
ns739059333
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32266057
ns32333012
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
707166333
ns712730687.5
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
671742104.5
ns678064125
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
577648896
ns570651646
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
734518917
ns732192500
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1349520.5
ns1338854
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
780417
ns764333
ns1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
909417
ns971166
ns0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2087500
ns2047291
ns1.02
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
566986
ns582645.5
ns0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2979167
ns2995792
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2496208
ns2516000
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2619166
ns2623541.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3728333
ns3683208
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1738136
ns1752698
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5799875
ns5821709
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5883292
ns5892750
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5800167
ns5806979
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2892541.5
ns2887229
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7500
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6208
ns6042
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10041
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25118
ns25775
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212333
ns225958.5
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221583
ns220750
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220562.5
ns220625
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215896
ns206167
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
262400.5
ns259112
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
307233708
ns308668791.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
279732584
ns282575646
ns0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
198830375
ns199775042
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
309726917
ns309205458
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7656813
ns7688394
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1090685500
ns1093080750
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
1068219000
ns1075916375
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
818375167
ns810723875
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1160424021
ns1146255478.5
ns1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26548125.5
ns26478179
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5812.5
ns5042
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5708
ns6250
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6959
ns6584
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5458
ns5458
ns1
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
154820
ns170923.5
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7125
ns7333
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7708
ns7416
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7417
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7542
ns7041
ns1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
618164
ns648059.5
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns583
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
584
ns541
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23615
ns24468
ns0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9250
ns9333
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9458
ns9000
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9625
ns9729.5
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9750
ns8792
ns1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
207782.5
ns223281
ns0.93
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
356333
ns351708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352417
ns352583
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
356083
ns352708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
357500.5
ns351416.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21053.5
ns21843
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
780146
ns811563
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
776312.5
ns793583.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
809375
ns812375
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
826750
ns804291
ns1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
303323.5
ns279114.5
ns1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
338396
ns338875
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
325208
ns321459
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
453375
ns450271
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10542
ns10750
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17732
ns18538
ns0.96
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
718917
ns712021
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
732645.5
ns730333
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1009833
ns1002270.5
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26583
ns26708
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
257155
ns261073.5
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
374000
ns381875
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
331500
ns326167
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
441875
ns443625
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30917
ns30417
ns1.02
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22404
ns23393
ns0.96
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
739437.5
ns731937.5
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
779666.5
ns784187.5
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1041375.5
ns1027875
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
104312.5
ns89584
ns1.16
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
235395
ns220484
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3625
ns3375
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3625
ns3708
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3625
ns3833
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3459
ns3458
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17702
ns17892
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4250
ns4292
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4250
ns4250
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4334
ns4333
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4375
ns4417
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
245299
ns288266.5
ns0.85
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3479.5
ns4083
ns0.85
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3792
ns4062.5
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4334
ns4334
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3709
ns3833
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
185222
ns243078.5
ns0.76
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8125
ns8417
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8687.5
ns8208
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8666
ns8583
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8500
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1127148
ns1294141
ns0.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206541
ns203583
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212000
ns209750
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211000
ns209750
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
202291
ns199542
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34888
ns35748
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
648750
ns610959
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
634312.5
ns629979
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632771
ns632042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
596417
ns624312.5
ns0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
322649.5
ns366873
ns0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
998333
ns1020270.5
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1039375
ns1019375
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
952083
ns956541
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
904292
ns862917
ns1.05
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208498.5
ns208035
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4540000
ns4555583
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4817791.5
ns4847250
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4468750
ns4461541
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5130375
ns5174375
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
959939
ns927061
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3875
ns4042
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3334
ns3500
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4125
ns4250
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3750
ns3375
ns1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
197248.5
ns241039.5
ns0.82
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7645.5
ns7500
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7062.5
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7292
ns7333
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7458
ns6916
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1027567
ns1063926.5
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1650375
ns1524958
ns1.08
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1182479.5
ns1178854.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1370292
ns1368709
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2441916.5
ns2362167
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215671.5
ns218600.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12370500
ns12347875
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9601667
ns9603708
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9328687.5
ns9285208.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18097145.5
ns17994500
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1953457
ns1959865.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17380125
ns17343125
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14471146
ns14424146
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14397875
ns14365583
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21055583
ns21176708
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
91125
ns90520.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90875
ns90208
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
94958
ns94500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
88000
ns133292
ns0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126032
ns126385
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2023583.5
ns2059229.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2028542
ns2014083.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2033312
ns2030292
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2043416.5
ns2020416.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1084734
ns1061374.5
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
3458.5
ns2375
ns1.46
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1625
ns1834
ns0.89
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3500
ns3542
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
1750
ns2167
ns0.81
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15936
ns16672
ns0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2584
ns2541
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2791
ns2917
ns0.96
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2917
ns2750
ns1.06
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2833
ns2792
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
195099.5
ns197485.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7333
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5416
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns5958
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns9916
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33830
ns34400.5
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224916
ns213812.5
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234875
ns221000
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231083
ns231917
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218917
ns208604
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
348229.5
ns352524
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21982
ns22677
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14459
ns14416
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14208
ns14125
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14417
ns14500
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14584
ns14417
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
489892.5
ns511650.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
94917
ns93854
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93416.5
ns97145.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
99875
ns98417
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
92625
ns140083
ns0.66
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125549
ns125784
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921625
ns1964729
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1933333.5
ns1938562.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1928500
ns1927041.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1950604.5
ns1920667
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
964756
ns1039090
ns0.93
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
873521
ns877500
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
804167
ns800812.5
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1218520.5
ns1223937
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
954959
ns969958
ns0.98
lenet(28, 28, 1, 32)/forward/GPU/CUDA
285492.5
ns285567
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2830854
ns2803854
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2531000
ns2511750
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3356083
ns3356541.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3412042
ns3428708
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1671062
ns1675606
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16271
ns15958
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16500
ns16562.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18666.5
ns17041.5
ns1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18916
ns17375
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144500.5
ns145484
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
260708
ns223104
ns1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
254749.5
ns222896
ns1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227979
ns226708
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226584
ns253167
ns0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
650846.5
ns664599
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222167
ns221146
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222041.5
ns221500
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222166
ns221666.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220000
ns221042
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
277439
ns276464
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
561333.5
ns551791
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
549000
ns505375
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
558813
ns509750
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
557729.5
ns508666.5
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1450310.5
ns1493627
ns0.97
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4000
ns4000
ns1
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4166
ns4104.5
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5750
ns4667
ns1.23
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4042
ns4042
ns1
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17089
ns17326
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7000
ns7042
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7208
ns7417
ns0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7166
ns7250
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7542
ns7458
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
196929
ns198652.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18083
ns17875
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18959
ns18333
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19250
ns19750
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18124.5
ns17146
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
165663
ns230076
ns0.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222875
ns219250
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213896
ns216020.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225792
ns212500
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222042
ns212479.5
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1029397
ns1050719
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4500
ns4500
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
3958
ns4583
ns0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5125
ns4667
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4333
ns4583
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
204180
ns252077
ns0.81
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10917
ns10833
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10583
ns10500
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10500
ns10250
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10750
ns10250
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1058573
ns1102570
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3291.5
ns3312.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3542
ns3708
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4417
ns3959
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3458
ns3125
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
245634
ns243703
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7229.5
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7583
ns7333
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7417
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7541
ns7209
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1074772.5
ns1111590.5
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23471041.5
ns23487541.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43849166
ns43971125
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37957792
ns37463166.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34964125
ns34877416
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1792082
ns1842834.5
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184426958
ns184200958
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
173017604
ns173422437.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
147161645.5
ns146460271
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
411405916
ns410950833
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16521696
ns16526176
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
426004833.5
ns425975000
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
259123250
ns259298209
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296958750
ns296349208.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
480245750
ns479307000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183042
ns183167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
185188
ns183917
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
186041.5
ns185291.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
184333.5
ns183708.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
226412
ns232992
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
597750
ns588709
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
598229
ns595709
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632895.5
ns596042
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
586958
ns597500
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1097502
ns1113560
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3838542
ns4043292
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4115979
ns4012396
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3571292
ns3557000
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4600166.5
ns4569124.5
ns1.01
batchedmm(128, Bsize=512)/forward/GPU/CUDA
534974
ns531536
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17343875
ns17494562.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18514250
ns18560917
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16537292
ns16622646
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20367667
ns20213416.5
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2795688
ns2619803.5
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
666
ns542
ns1.23
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32682
ns32024.5
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9042
ns9334
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9709
ns9291
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns9666.5
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9666
ns9000
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
266437.5
ns264542.5
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
499772583
ns496971791
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
504959958
ns509285541
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
422832542
ns421912146
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
673427063
ns672227417
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
11842270.5
ns12489793.5
ns0.95
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1875482271
ns1883911021
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1653498000
ns1668824291
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1486024395.5
ns1489797958.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2210913770.5
ns2201017208.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49084588.5
ns49197806.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1649062.5
ns1600645.5
ns1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1182584
ns1172708
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1392250
ns1388125
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2377145.5
ns2344958.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
218920
ns218458
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12688458.5
ns12685750
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
10001583.5
ns9976000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9698792
ns9656709
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18502292
ns18427396
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2042988
ns2044469
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17689291
ns17712834
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14793041.5
ns14779375
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14622084
ns14604916
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21477583.5
ns21383042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26292
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26250
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26291
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24105
ns24118
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67000
ns67000
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67042
ns66833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67875
ns67500
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67208
ns66834
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
396461.5
ns410737.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204959
ns203917
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209958
ns208625
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209875
ns209084
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199833
ns199500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26682
ns27195
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
646208
ns625958.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
670000
ns629916
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
644166
ns632125
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630416
ns600062.5
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
354787
ns358637.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
598417
ns658417
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
657292
ns641625
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
664187.5
ns647542
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
659708
ns666291.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132717
ns132681.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2235958
ns2274708
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2279125
ns2300125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2249833
ns2238125
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2316042
ns2241291
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1193695.5
ns1242340
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18500
ns18020.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19250
ns18292
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19292
ns20250
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17500
ns17500
ns1
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146082
ns146876.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
259917
ns231458
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
259625
ns227333.5
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230208.5
ns227500
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
256708
ns229792
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1005431.5
ns1067171
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns667
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23900
ns23878
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9750
ns9833
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10333
ns9875
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10000
ns10000
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10000
ns9541
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
259163
ns263281
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5833
ns5687.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5916
ns6208
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6459
ns7125
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5833
ns5417
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
228223.5
ns235834
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7250
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7666
ns8042
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7666
ns7541.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns6979.5
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
770644
ns811982.5
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2333
ns2125
ns1.10
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2187.5
ns2312.5
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2292
ns2500
ns0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2250
ns2125
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17986
ns18261
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6500
ns6375
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6666
ns6520.5
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6666
ns6708
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6625
ns6375
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
321059
ns336632.5
ns0.95
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749208.5
ns749209
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
748958
ns748895.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
750125
ns749542
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748834
ns754083
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21410
ns21329
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
798125
ns818750
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
791208
ns788167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
837729.5
ns791584
ns1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
775270.5
ns790584
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
301663.5
ns299791
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7500
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5291
ns5334
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns5916
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10292
ns10208
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33301
ns33718
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232896
ns256167
ns0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
268833.5
ns235520.5
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
267354.5
ns240500
ns1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215500
ns250875
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
361937
ns365654
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10000
ns10312.5
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9833.5
ns10416
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11042
ns10812.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10333
ns10166.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
250034.5
ns245731
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24334
ns25083
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25250
ns24667
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24542
ns24125
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24334
ns24500
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1111417.5
ns1139764
ns0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106812374.5
ns106439229
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
126726167
ns127176500
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
121727417
ns120453645.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
118228479
ns117602312.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2616848
ns2646453
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
391804291
ns394264417
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
379056792
ns380211666
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
355535666
ns421708312.5
ns0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
486452916
ns479818917
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15186296
ns15158878
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
756685666.5
ns756832624.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
774854291
ns775894292
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
746786813
ns748243271.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
947077458
ns761933208.5
ns1.24
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8416
ns7145.5
ns1.18
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7125
ns7834
ns0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8125
ns9541
ns0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9604
ns7417
ns1.29
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
240976
ns241749
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14250
ns14291.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14291
ns14166
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14167
ns14167
ns1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14166
ns13708
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1095523
ns1098247
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5917
ns6042
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6750
ns0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6687.5
ns7083
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6292
ns5834
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
239291
ns240471.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns12667
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13125
ns13333
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13291.5
ns13354.5
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12417
ns12334
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
797358.5
ns800476.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5459
ns5333
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5833
ns5875
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
7000
ns6000
ns1.17
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5542
ns5500
ns1.01
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16938
ns17559
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15500
ns15459
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15458
ns15437.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15666
ns15667
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15875
ns15750
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
200590
ns202574
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
417
ns417
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
416
ns292
ns1.42
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23824
ns24102
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6583
ns6333
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6209
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6666.5
ns6750
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6583
ns6333
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
239979.5
ns242831.5
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5916
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6000
ns5917
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5917
ns5958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5958
ns5792
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24627
ns25033
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20916.5
ns21375
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21209
ns21125
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21833
ns21375
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21292
ns21020.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
265615.5
ns267836
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
192687.5
ns144833
ns1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
146521
ns145250
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149374.5
ns150083.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
142250
ns188375
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168462.5
ns168310
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1318667
ns1351833
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1326875
ns1369333
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1328208
ns1322041
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1311167
ns1327250
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1370856
ns1368007
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22125
ns23042
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22083
ns24041
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24209
ns24917
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24417
ns21833
ns1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
357178
ns356401.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
130958
ns126958
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
180395.5
ns120333
ns1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
130875
ns180250
ns0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
178917
ns180749.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1498842
ns1484885
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23528
ns23370
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6417
ns6479.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6791
ns6416
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6834
ns7042
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6792
ns6333
ns1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
258073.5
ns260419.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4500
ns4333
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5250
ns5041.5
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5125
ns5459
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4667
ns4583.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256140
ns255220
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns10166.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10416
ns10167
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10292
ns10250
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10208
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1357774
ns1368092
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1666
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23069
ns23227
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5750
ns5750
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6084
ns5750
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5917
ns5750
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5667
ns5583
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
275859
ns278026
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6814167
ns6781854.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6368854.5
ns6363854.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6497917
ns6534166
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7560667
ns7654958.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215030
ns216771
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24038396
ns24093667
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21318250
ns21335604
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21055625
ns21037958
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29800458
ns29730292
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2117334
ns2100300
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37406895.5
ns37311042
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45481041
ns45649479
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45606750
ns45692458
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49407375
ns38098959
ns1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6375
ns5520.5
ns1.15
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6208
ns6708.5
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7292
ns7250
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5916
ns6125
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
237163.5
ns240533.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns8083
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8666
ns9083
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8416
ns8417
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8250
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1062411
ns1077102
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1544167
ns1489187.5
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1249833.5
ns1236771
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1625709
ns1617916
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2004375
ns2170020.5
ns0.92
lenet(28, 28, 1, 128)/forward/GPU/CUDA
275720
ns282849
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7903083
ns7909229.5
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6659625
ns6634750
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7184500
ns7161708
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10128083
ns10483708.5
ns0.97
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1884846.5
ns1903700.5
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
369396
ns367625
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
353625.5
ns349896
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
456542
ns453917
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
24041.5
ns24459
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46544
ns43502
ns1.07
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
743500
ns727167
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
796417
ns803167
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1071583
ns1057604
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
125958
ns121792
ns1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
312111.5
ns307546.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397375
ns397583
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
212250
ns213333
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288125
ns288209
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
753500
ns751125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44394
ns44141
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
673292
ns675500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
472125
ns475667
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531791
ns531375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974625
ns972666.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191967.5
ns191213
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
657167
ns658208.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
669958.5
ns643834
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
661104
ns655125
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
662708
ns681792
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132971.5
ns132164.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2458250
ns2526833
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2498250
ns2530541
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2467687
ns2451667
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2501875
ns2454146
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1568577
ns1206173
ns1.30
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
4333
ns2604
ns1.66
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2583
ns2459
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4417
ns4375
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2750
ns2583
ns1.06
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16411
ns16766
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5375
ns5333
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5458
ns5542
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5625
ns5583
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5625
ns5542
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199892.5
ns199467
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1463541
ns1459833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1497208
ns1490334
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1503375
ns1497791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1442834
ns1439750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41596
ns41167
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5109479
ns5155562
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5289042
ns5314187.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5301333.5
ns5282833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4680604
ns4979791
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198982.5
ns198405.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3708
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33311
ns33352
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns15208
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15084
ns15000
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15250
ns15209
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15250
ns15291
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
376159
ns379437.5
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71208
ns71625
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71209
ns71416
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71250
ns71208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71500
ns71083
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112893
ns113188.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
317750
ns321770.5
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
323708
ns330770.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
334166
ns319333
ns1.05
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
320500
ns326458
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
195635
ns194877
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1041
ns1000
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1125
ns1083
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1083
ns959
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23896
ns23702
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns7917
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8625
ns8125
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8333
ns8125
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8167
ns7916
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
263562
ns263485
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
509521
ns497624.5
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
479125
ns471604
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
564625
ns563708
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
232458.5
ns218208
ns1.07
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129625
ns129739
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1393208
ns1355292
ns1.03
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1479000
ns1470187.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1765792
ns1719583.5
ns1.03
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
868125
ns867375
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
276144
ns275487
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31637
ns31436
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6334
ns6208
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns6333
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6458
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6667
ns6333
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
263537
ns262275
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1722958.5
ns1727063
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1735250
ns1729458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1733292
ns1725417
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1763312
ns1768875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169598.5
ns168537
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4353521
ns4367874.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4379875
ns4385375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4349063
ns4367104
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4390959
ns4357459
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1422688.5
ns1262273
ns1.13
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6938
ns6708
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6875
ns6541
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7166
ns7000
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6583
ns6875
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20547
ns20525
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
50541
ns33063
ns1.53
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
50312.5
ns33083
ns1.52
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
51250
ns48041.5
ns1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
58249.5
ns53792
ns1.08
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
308428
ns291536.5
ns1.06
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17750
ns17333.5
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17875
ns17792
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
19125
ns18209
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17500
ns17666
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18339
ns18396
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53375
ns53209
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53166
ns53417
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53250
ns53292
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53458
ns53375
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
344770
ns338706.5
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75459
ns75500
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75375
ns75417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75395.5
ns75292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75458
ns75292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47276
ns46489
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
336417
ns329084
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
341125
ns336667
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
339250
ns328958
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
336541
ns323917
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
213552
ns209091.5
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1489000
ns1486166
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1522292
ns1517709
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1529458
ns1525792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1468458
ns1464375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52575
ns52406
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5115542
ns5153729.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5292541
ns5303250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5289458.5
ns5257500
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4978625
ns4990145.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
206120
ns203681
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28125
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28250
ns28250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28375
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24358
ns24536
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66334
ns66708
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66167
ns66125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66209
ns66250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66750
ns66416
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
526089
ns535849
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1498042
ns1468041
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
911000
ns912854
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1149625
ns1130187.5
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2098500
ns2251604
ns0.93
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
582137
ns583084
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3080771
ns3113959
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2593125
ns2660771
ns0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2751125
ns2734000
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3818125
ns3802646
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2100592
ns2002672
ns1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7913063
ns7929500
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8011208
ns8011167
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7901167
ns7911791.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4863125
ns4826833
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81500
ns81437.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82000
ns83395.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84125
ns84437.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83083
ns136500
ns0.61
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194175
ns193251.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020500
ns2033479
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2036292
ns2014584
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2018708
ns2016000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2021916
ns2013958
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
810603
ns792396
ns1.02
This comment was automatically generated by workflow using github-action-benchmark.