Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: change update step to thousands in PINN 2D PDE (#1153)
For 50,000 training steps, an update every 1000 step is enough detail
- Loading branch information
46a012d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3791
ns4042
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4500
ns4125
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4875
ns4833.5
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3666
ns3958
ns0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59711.5
ns60780
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10167
ns10500
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10458
ns10333
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10750
ns10625
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10625
ns10833
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
419469
ns423470
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1062.5
ns1084
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1167
ns1125
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1500
ns1416
ns1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1125
ns1208
ns0.93
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18540
ns18313
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4083
ns4042
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4042
ns4083
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4208
ns4208
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3958
ns3625
ns1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109802.5
ns110716
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57542
ns57375
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46416
ns46292
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns46500
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80875
ns82709
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37744
ns37768
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2035395.5
ns2006604.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2078396
ns2082209
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2078708
ns2011667
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1998584
ns2018937.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195463
ns196514.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144250
ns141709
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144166.5
ns144000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145125
ns145187
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
153104.5
ns144208
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165592.5
ns165424.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1120291.5
ns1001541.5
ns1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1113167
ns1118791.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
832708.5
ns1097124.5
ns0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1117084
ns1141417
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
520015.5
ns532439
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3667
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3542
ns3542
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4166
ns3917
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3125
ns3541.5
ns0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
66073.5
ns71776.5
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9042
ns9042
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8750
ns9584
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10208
ns8500
ns1.20
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8833
ns9042
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
469701
ns486557
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17041
ns15125
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15834
ns17792
ns0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16604.5
ns16916.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16791
ns15250
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54530
ns56432
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213750
ns214500
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214875
ns214625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215667
ns215333.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226125
ns216041
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
269469
ns280343
ns0.96
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns667
ns0.81
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
708
ns584
ns1.21
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
709
ns708
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
541
ns667
ns0.81
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17336
ns17273.5
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1375
ns1583
ns0.87
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1375
ns1667
ns0.82
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1500
ns1667
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1458
ns1541
ns0.95
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
100554
ns103457
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7000
ns7000
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5750
ns5937.5
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns5709
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9750
ns9833
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23286
ns24396
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222021
ns222750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228542
ns229041
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229292
ns230041
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213937.5
ns213500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
166141.5
ns171992
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3916
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3959
ns4000
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23204
ns23948
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16917
ns16750
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16792
ns16583
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17250
ns17041
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
164061.5
ns165565.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
568792
ns572458
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
578645.5
ns576208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
578083
ns581250
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
575625
ns575042
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113438.5
ns113609
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1422625
ns1419604
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1420000
ns1420333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1422375
ns1421834
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1426708
ns1421062.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
213572
ns216706.5
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1077687.5
ns1089896
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
960917
ns966312
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1353229.5
ns1351792
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1315312
ns1307959
ns1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA
274529.5
ns276909
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5961958
ns5979271
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4633250
ns4608000
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4975188
ns4925667
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5557125
ns5767000
ns0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1081948
ns1097403.5
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns541
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23910
ns23800
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2208
ns2125
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2250
ns2084
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
176064.5
ns174099
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4125
ns4209
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4375
ns4042
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5167
ns5020.5
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4250
ns3667
ns1.16
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65504
ns66593
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11875
ns10958
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11000
ns11167
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11917
ns12083
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11500
ns11167
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
448080.5
ns455844
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7000
ns6583
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6958
ns6417
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8250
ns7562.5
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6125
ns6333
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52534
ns53149
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18708.5
ns17375
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18625
ns17250
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18375
ns18250
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16708
ns16458
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
296471
ns301789.5
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
708
ns542
ns1.31
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns584
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33481
ns33109.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8834
ns8542
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8875
ns8500
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9334
ns9375
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8354.5
ns8416.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
158505
ns161412.5
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64459
ns64666
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64750
ns64583
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64916
ns64459
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64625
ns64208
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112347
ns112066
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
279250
ns275959
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
282167
ns279333
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
284125
ns280167
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
278708
ns284791
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
187244.5
ns190816.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3278417
ns3359666.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3081000
ns3020708
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3021792
ns3019708
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4040979.5
ns4044937.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
573775.5
ns582824
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7620208
ns7633375
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7449187.5
ns7444749.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7493708.5
ns7451687.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8208791
ns8276916.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1340015.5
ns1416070
ns0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18366417
ns17541687.5
ns1.05
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17522312.5
ns17532229.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17580834
ns17547042
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14093354.5
ns14143625
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23631333
ns23437021
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33504604
ns33669000
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37034667
ns36847792
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34967583.5
ns35241729
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1860248
ns1852807
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189693000
ns188072458
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
165014875
ns164284791
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152416688
ns152400917
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434850958
ns434137916
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13871408
ns13886569
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
289105312.5
ns288796896
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
250867083
ns251588375
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296775875
ns296639417
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
473537562.5
ns474281875
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22083
ns22000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22459
ns22625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25375
ns24250
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24083
ns21812.5
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95417
ns98991
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103083
ns104791
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103250
ns103292
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104542
ns104708
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103041
ns103625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
502007.5
ns514494
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5917
ns5917
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5958
ns5834
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6708
ns6459
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5791.5
ns6167
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68401.5
ns69465
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14792
ns14417
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15000
ns15250
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16542
ns15459
ns1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14875
ns14666
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
475091.5
ns483934.5
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3002625
ns2986042
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2079375
ns2014792
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2272333
ns2274354.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4882708
ns4589125
ns1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586443
ns584502
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23536000
ns23505916.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18038562.5
ns18035749.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16972167
ns16922042
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
34545146
ns34856104.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2768189
ns2763874
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33221458
ns33341541.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27561792
ns27602208
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27327000
ns27326333
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42034750
ns41263417
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
71417
ns72791.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
71854.5
ns73208
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75708
ns83958
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74708
ns83208
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101188
ns103702
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
205250.5
ns286979.5
ns0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
206750
ns206625.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
208958
ns322750
ns0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217416
ns322333
ns0.67
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
541638
ns559306
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11875
ns11458.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11416
ns11666.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12958
ns12333
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11708
ns11958
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
70557.5
ns73645.5
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25667
ns26208.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26541.5
ns27000
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27729.5
ns27416
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26667
ns26645.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
468068.5
ns483328.5
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12812.5
ns11917
ns1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12209
ns14750
ns0.83
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14208
ns13708
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12291.5
ns12708
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52262
ns54699.5
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25625
ns25375
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25916.5
ns25500
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26250
ns26333
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26604
ns27875
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
297345.5
ns308185.5
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
178792
ns182041.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180750
ns181583
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
181917
ns183167
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179166
ns182167
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56939
ns58753
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
593333
ns592604
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
582708
ns583041
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583667
ns594209
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
584542
ns586791
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
282717
ns294181
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6167
ns6083
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5875
ns5958.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6875
ns6833
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5708.5
ns6250
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
69908.5
ns72095.5
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13791
ns14375
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13917
ns13083
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15667
ns14791
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns14292
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
454508
ns473402.5
ns0.96
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1225312.5
ns1210604.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1241959
ns1239854
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1289958.5
ns1297479
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1011625
ns1024875
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA
300319.5
ns300941
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4103042
ns4097875.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4403333
ns4434062.5
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4523854.5
ns4563541
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3709771
ns3722313
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1034770
ns1037751.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1875
ns1791
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1916
ns1834
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23619
ns23494
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4958
ns4834
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5000
ns4834
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4958
ns4917
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
186116
ns188396
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5833
ns5625
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5917
ns5459
ns1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6667
ns6500
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5209
ns5562.5
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54405.5
ns54865
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11125
ns10583
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11500
ns10500
ns1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11458
ns11125
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10500
ns10666
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
320192
ns324083
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
375
ns292
ns1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
375
ns292
ns1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22488.5
ns22774
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2708
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2833
ns2750
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3083
ns2959
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2708
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
157059.5
ns158123.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11459
ns11375
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11625
ns11083
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12875
ns12125
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10958
ns11542
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
55353
ns56425.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25020.5
ns24583
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25292
ns24667
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25125
ns24833.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24875
ns25250
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
284593.5
ns289503
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4250
ns4208
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4250
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24743
ns24426.5
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16333
ns16417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16375
ns16167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16520.5
ns16334
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16208
ns16125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
192574
ns194624
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5709
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5833
ns5708
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6042
ns5834
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5875
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33721.5
ns33182
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21000
ns20792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21000
ns20645.5
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21417
ns20792
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20709
ns20417
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
172002
ns174846
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
422124.5
ns423688
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
387791
ns381917
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
477333
ns480521
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103125
ns104125
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66716
ns66873.5
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
921333
ns934375
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
974250
ns984083
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1186458
ns1186625
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
457479.5
ns471042
ns0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
189036
ns189890.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80542
ns81458.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80709
ns80125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84896
ns81104.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79833
ns136333
ns0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193358.5
ns192847
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919250
ns1918292
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1876583
ns1908625
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1946041
ns1922750
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1921396
ns1953687.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
391971
ns394765
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21948.5
ns21680
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1917
ns1792
ns1.07
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1917
ns1833
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
166123
ns167307.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6417
ns6625
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6666
ns6333
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7771
ns7375
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6145.5
ns6667
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
56772
ns59094.5
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9604.5
ns8958
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9459
ns8959
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9500
ns9417
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9041
ns9416
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
294981.5
ns303401
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120459792
ns120415166.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173682208
ns173861833
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147804000
ns147873916
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105720875
ns104464750
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5472285
ns5466659
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
610206729.5
ns607892187.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555562500
ns555380583
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
452099291.5
ns449180562.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
626409896
ns624687437
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34955764
ns34960099
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
657253583
ns655676042
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
665008062.5
ns664719854.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
581676208.5
ns586317000.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
857648458
ns854444125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57875
ns57541
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47791
ns47500
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47500
ns46625
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83395.5
ns85500
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37072
ns37532
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1915500
ns1919792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1932792
ns1980000
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1995084
ns1978083.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1890500
ns1915584
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
171922.5
ns173336.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267854.5
ns266563
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267708
ns285125
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
269750
ns286313
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268166
ns267916
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
123763
ns130327.5
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
594417
ns588541
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
681291
ns688375
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
604895.5
ns691667
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
689917
ns713875
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
674236.5
ns704236.5
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2176375
ns2209792
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2222812.5
ns2211250
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2205042
ns2214666
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2093562.5
ns2251125
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133331
ns133526
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5514416
ns5473459
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5508500
ns5495771
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5535958
ns5506084
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5491750
ns5555625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
730299
ns758118
ns0.96
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
638167
ns641209
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
647708
ns638417
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
659416
ns648750
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
643750
ns647250
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46729.5
ns46678
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1822167
ns1823542
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1723042
ns1728500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1727833
ns1721125
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2106333
ns2101541
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
219682
ns220988
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58458
ns58375
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46917
ns47291
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47292
ns46667
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84125
ns84417
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28215
ns28560
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030041
ns2021604
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2004250
ns2078542
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2122125
ns2089792
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1985979.5
ns2018458
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
186715
ns188289
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13357770.5
ns13165083
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12440000
ns12437062.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12492250
ns12496625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15108458
ns15241708
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
510701.5
ns511138.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47178791.5
ns47044896
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41760334
ns41734229
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40950875
ns41006041
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58205437.5
ns58474250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2894239.5
ns2887641
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
97014458.5
ns74158583
ns1.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91152834
ns68293166
ns1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90701604.5
ns90787478.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
98541521.5
ns76120020.5
ns1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58959
ns58708
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47375
ns47417
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47750
ns47333
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79958
ns81500
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47779.5
ns48467.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1918645.5
ns1906541
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1971000
ns1966979
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1997667
ns1972250
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1889750
ns1919083.5
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192960
ns194955.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
416
ns292
ns1.42
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
33172
ns31682
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6292
ns5979.5
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6542
ns5959
ns1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6834
ns6417
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6250
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
171303
ns173280.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32323
ns31661
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2833
ns2583
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2917
ns2625
ns1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2834
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2708
ns2584
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
162112.5
ns162166.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
289426812.5
ns285912791.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339624334
ns341793875
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
315284104.5
ns314064437.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
274668667
ns269291750
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7120353.5
ns7104649.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1014634416
ns1013628833
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
953687125
ns955735416
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
857733312.5
ns855387437.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1265357333
ns1263250834
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33985258
ns33975753
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1675373667
ns1379120562.5
ns1.21
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1668941291
ns1314342812
ns1.27
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1606744000
ns1634956500
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1787636084
ns1372311479
ns1.30
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1409499.5
ns1410229
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1413833
ns1415750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1419895.5
ns1412896
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1458541.5
ns1460375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127493
ns127578
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5016749.5
ns5011584
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4651917
ns5015500
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5058791
ns5020521
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5012792
ns5052375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
551564
ns577903.5
ns0.95
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
171852250
ns171180458
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
129831062.5
ns128541250
ns1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
115995771
ns109850250
ns1.06
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
168839667
ns169107792
ns1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4879222
ns4873683
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
629070333
ns624949333
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
493488792
ns491287250
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
456364583
ns454790833
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
675660292
ns648542167
ns1.04
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16223916
ns16059874
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8950646
ns8910395.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8924625
ns8995792
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7865125
ns7901000
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9701750
ns9817770.5
ns0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1588053
ns1593491
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36024125
ns35975583
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37000208.5
ns37440812.5
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33425875
ns33423291.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37661542
ns38560271
ns0.98
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6463767
ns6452757.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47562.5
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47416
ns47583
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47666
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47375
ns47375
ns1
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
17907
ns18605
ns0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50542
ns50250
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50375
ns50417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50584
ns50625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50583
ns50459
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
184398
ns218596.5
ns0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6958.5
ns6416
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6500
ns6625
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8042
ns7209
ns1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6542
ns7000
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
89066
ns120537.5
ns0.74
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10042
ns9667
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10437.5
ns9583
ns1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10500
ns10625
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10209
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
510214.5
ns676959
ns0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5666
ns5584
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5958
ns6167
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7417
ns7146
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5458
ns5562.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
109271
ns144983
ns0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13125
ns12875
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13250
ns13084
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13375
ns13875
ns0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13208
ns12959
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
457940.5
ns555671
ns0.82
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns959
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns959
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32174
ns32054
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns7500
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8292
ns7875
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8167
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns7958.5
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
199053.5
ns215727.5
ns0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23354.5
ns23166.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23250
ns23292
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23542
ns23458
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23125
ns23334
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18347
ns18589.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52667
ns52625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52584
ns52500
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52750
ns52958
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52417
ns52333
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
291115
ns299146
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1398084
ns1401500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1402791
ns1396145.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1401792
ns1398562.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1402875
ns1435792
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195544.5
ns195172
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5010813
ns5009646
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5016584
ns4800875
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5062708
ns5005896
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5013500
ns5025041.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
617335
ns612010.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3040417
ns3032250
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2105083
ns2072292
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2280208
ns2300667
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4865521
ns4921042
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
579665
ns580134
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24414604.5
ns24343228.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18876208.5
ns18906020.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17652979
ns17758521.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35825688
ns35734042
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2847809
ns2830179
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34006188
ns33956916.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28283750
ns28347958
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27926083.5
ns28079666
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41742416.5
ns42065000
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144750166
ns144437916
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
146949375
ns147635291
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
126208208.5
ns125109916
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173205292
ns173674875
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22782449
ns22545545
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1847080125
ns908256562.5
ns2.03
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
809911709
ns1584608041.5
ns0.51
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
755677291
ns749118208
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
667449084
ns669868292
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118406338
ns118395391
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76791
ns81333
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76042
ns75042
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76417
ns77166
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72541
ns73625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
250232.5
ns243285.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
277229
ns287145.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
193583
ns285833
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
205417
ns283104.5
ns0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
303083.5
ns279041
ns1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1279646
ns1239705
ns1.03
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35472875
ns35487666
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36379896
ns36325875
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32315333.5
ns32416604
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40618416.5
ns40654875
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5840653.5
ns5840513
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
146765250
ns146753459
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153200125
ns153140083.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
137307792
ns135055542
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
285301125
ns286267791
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34880703
ns34875869
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120518062.5
ns120929708.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174031666
ns174008000
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148283312.5
ns147856792
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106552271
ns102357166.5
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5465282.5
ns5458379
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
469918416
ns472290792
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
466837917
ns468203875
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
437920916.5
ns437903521
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
739774042
ns743156542
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32269604.5
ns32279044
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
711087896
ns709215666.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
640897313
ns641585354.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
630411896
ns623424125.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
849787625
ns853935458
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1302125
ns1289084
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
905958
ns912625
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
938334
ns959625
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1987437
ns2066167
ns0.96
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
573939.5
ns576350.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2951687.5
ns2954792
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2611020.5
ns2624645.5
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2639896
ns2616708
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3702396
ns3750458
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1765767
ns1708662
ns1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5801417
ns5780625
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5727666.5
ns5802646
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5818916
ns5793708
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2913834
ns2916792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7292
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6166
ns6125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6209
ns6167
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns9917
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25586
ns24959.5
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212792
ns212666.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220834
ns219979.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221166
ns220458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215459
ns244353.5
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272866
ns249958
ns1.09
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
300445333
ns296320791
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
214002042
ns216911667
ns0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
196386541
ns196230687
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
307720792
ns303909375
ns1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7675041.5
ns7672082.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1232629833
ns1231911312.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
899311645.5
ns900530270.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
825300584
ns828047958
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1150330250
ns1151206292
ns1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26367421.5
ns26738113
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5458
ns4833
ns1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5416
ns5500
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6750.5
ns6167
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5084
ns5000
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
184497.5
ns149363.5
ns1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns7041
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7333
ns1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns7541
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns6917
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
655045
ns600699
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
583
ns500
ns1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns500
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24222
ns23466
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9542
ns8667
ns1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9833
ns8417
ns1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9667
ns9667
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9041
ns9125
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
221511.5
ns211340
ns1.05
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352562.5
ns368458
ns0.96
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351833
ns351459
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
353416.5
ns352500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
366166
ns352146
ns1.04
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21264
ns21302
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
826208
ns826271
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
775333.5
ns824958.5
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
808520.5
ns792000
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
828833
ns830250.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
278649
ns269586
ns1.03
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
340917
ns340937.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
342729.5
ns343062.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
453708
ns454770.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10687.5
ns14084
ns0.76
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18338
ns17990
ns1.02
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
709875
ns710583
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
728042
ns728458
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1005792
ns1004208
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26667
ns27417
ns0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
257132
ns239886
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
380187.5
ns383166.5
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
355542
ns350542
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
442146
ns443208
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30959
ns31250
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22801.5
ns22514
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
726667
ns718250
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
778791.5
ns782083
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1034042
ns1028417
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
105042
ns105334
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
214595.5
ns217107
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3583
ns3333
ns1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3542
ns3708
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3708
ns3625
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3542
ns3417
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17801
ns17516
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4583
ns4104.5
ns1.12
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4333
ns4208
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4375
ns4291
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4167
ns4166
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
276455
ns232485
ns1.19
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3833
ns3333
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3542
ns3667
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4292
ns4084
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3500
ns4250
ns0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
219668
ns176024.5
ns1.25
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8334
ns8291
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8334
ns8250
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8708
ns8250
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns8542
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1228564
ns1051146
ns1.17
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203709
ns204709
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209833
ns210709
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
213750
ns210583
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200750
ns199833.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34897
ns34425
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
611979.5
ns647229
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623084
ns649666.5
ns0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
633542
ns626208
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630833
ns640479.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
337730.5
ns293508
ns1.15
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
991250
ns993750
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1017458.5
ns1020395.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
954833
ns958396
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
864916.5
ns887291
ns0.97
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208131
ns206487.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4517208
ns4504792
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4768041
ns4702583.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4459667
ns4449000
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
4281312
ns4321500
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
937605
ns979904
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3625
ns3167
ns1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3291
ns3541
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4250
ns4166
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3166
ns3333.5
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
221703
ns174711
ns1.27
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7042
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7458
ns7042
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7687.5
ns7375
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7084
ns7083
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1025587
ns911927
ns1.12
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1644333
ns1650250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1183209
ns1195333
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1370292
ns1375625
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2475167
ns2471000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213710.5
ns213276
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12346958.5
ns12340062
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9593646
ns9568500
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9292209
ns9298896
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17963583.5
ns18088041
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1947963.5
ns1943838
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17361375
ns17384833.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14393542
ns14357854
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14339750
ns14387313
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21095083
ns21175104
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
88167
ns100083
ns0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88875
ns87750
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
91875
ns93416.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
134020.5
ns89625
ns1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126192
ns125990
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2027813
ns2026687.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2027000.5
ns2031083.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2054000
ns2031250
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2028125
ns2050458.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1026969
ns951363
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
2792
ns2979
ns0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
2583
ns2875
ns0.90
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3458
ns3520.5
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
1917
ns2521
ns0.76
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16376
ns16207
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2709
ns2666.5
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2792
ns2500
ns1.12
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2792
ns2875
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2833.5
ns2959
ns0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
186134.5
ns179422.5
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7250
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns5958
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6167
ns6000
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10083
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34252.5
ns33838
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
242958
ns225292
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220917
ns219750
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220417
ns220542
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
240375
ns244708
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
328052.5
ns293649.5
ns1.12
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3791
ns3709
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3709
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22539
ns22219
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14584
ns14417
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14542
ns14375
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14584
ns14625
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14417
ns14583
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
484358
ns436265
ns1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
92125
ns140000
ns0.66
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
92458
ns92458
ns1
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
98562.5
ns96792
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
118229
ns96792
ns1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125261.5
ns125211.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1913333
ns1921583.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1909771
ns1923937.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1956333
ns1928188
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1924333
ns1942771
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
935173
ns855373
ns1.09
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
879000
ns874041
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
818395.5
ns820458
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1219520.5
ns1223417
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
966459
ns972500
ns0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA
267198
ns272168
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2822917
ns2804167
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2496917
ns2520875
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3359000
ns3337667
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3411333
ns3424895.5
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1570113.5
ns1501496.5
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17000
ns16791.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15458.5
ns14854.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19041
ns18375
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16875
ns15229
ns1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
133146.5
ns131230
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
258834
ns227959
ns1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215125
ns250729
ns0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215792
ns216125
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227875
ns262791
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
602653.5
ns582129.5
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
219062.5
ns222062.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221375
ns219125
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222875
ns222041.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220791
ns221584
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
247312
ns244344.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
497625
ns508270.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
535916
ns521083
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
499208
ns498833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
511125
ns565541.5
ns0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1333241
ns1195773
ns1.11
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
3833.5
ns4479.5
ns0.86
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4250
ns3583.5
ns1.19
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5166.5
ns4750
ns1.09
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
3792
ns4625
ns0.82
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16912
ns16818
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7542
ns7208
ns1.05
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7167
ns7250
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7542
ns7333
ns1.03
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7667
ns7458.5
ns1.03
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
186762.5
ns180977.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18667
ns18583
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16708
ns17583.5
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20584
ns19958.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18084
ns17333
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
136037
ns132074.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224209
ns212166
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212687
ns212146
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213167
ns212917
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222979.5
ns218959
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
896805
ns814362
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4250
ns4042
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4333.5
ns4208
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5125
ns5000
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3875
ns4000
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
222577.5
ns175168.5
ns1.27
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10542
ns10250
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10791
ns9687.5
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10959
ns11083
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10333
ns10125
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1034707.5
ns961404
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3041.5
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3333
ns3291
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4042
ns4375
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
2958
ns3416.5
ns0.87
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
225445.5
ns193655
ns1.16
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7208.5
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7209
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7542
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
ns7458
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1042046
ns972220
ns1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23498333.5
ns23356708
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34789375
ns34480833.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37689958
ns37583875
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34909542
ns35001895.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1849921
ns1828165
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184647292
ns184126958
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
163834583
ns166867125
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146363541.5
ns146311896
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
274565083
ns275288375
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16510014
ns16524063
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
278243563
ns276685520.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
245760791.5
ns252606729
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231789354
ns231173396
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
324000854.5
ns324261749.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
182625
ns184542
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
184458
ns182833
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
186250
ns185583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
181875
ns184895.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
206355.5
ns166499.5
ns1.24
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
628291.5
ns634000
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
608229.5
ns585209
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
598250
ns592708.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
637791
ns630958
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
999947
ns926373.5
ns1.08
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3874375
ns3858042
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3917042
ns3914708
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3534687.5
ns3549917
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4554291
ns4595104.5
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
531266.5
ns532803
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17461354.5
ns17337937.5
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17833459
ns17877583
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16559937.5
ns16422125
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
19938750
ns20130416.5
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2619194
ns2619405
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
666
ns625
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
33463
ns32935
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9292
ns8958
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9458
ns8875
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9375
ns9458
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9187.5
ns9209
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
252733
ns248903
ns1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
651812167
ns649671041.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
390086667
ns390100166.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
327502625
ns355146542
ns0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
747314333
ns750210500
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12474949
ns12471745.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1879705041.5
ns1883695042
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1650371917
ns1646365041
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1514378771
ns1513696187.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2204966313
ns2208789146
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49428315
ns49495223
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1651458
ns1642208
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1196083
ns1192812.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1387103.5
ns1386104
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2353958
ns2519667
ns0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
217144
ns215937.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12704667
ns12672750
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9935187.5
ns9911875
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9671333.5
ns9658417
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18432334
ns18448708.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2021545.5
ns1992558.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17670625
ns17681874.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14743791.5
ns14694333
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14593292
ns14589750
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21437146
ns21582250
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26292
ns26667
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26333
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26292
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24013
ns23957
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67166
ns66959
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67208
ns67750
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67917
ns67250
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66958
ns67459
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
380547.5
ns371563.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
202875
ns203875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210375
ns209500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209916
ns209125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
198750
ns200459
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25898
ns26219
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
645354
ns647500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
637500.5
ns669416.5
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
634542
ns685542
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
634250
ns632166.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
326606.5
ns324278
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
672209
ns675000
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
637917
ns541042
ns1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
665042
ns637375
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
664917
ns666542
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131949
ns132249.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2224563
ns2232250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2248771
ns2239333.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2241125
ns2241084
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2237000
ns2299271.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1095016
ns1091764
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17417
ns17833
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17333
ns17917
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19500
ns20584
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16875
ns18709
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
133320
ns133803
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
260770.5
ns260333
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219458.5
ns255395.5
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229000
ns253687.5
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
263334
ns230479
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
947049
ns901721
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns541
ns1.16
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
666
ns667
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns666
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
584
ns542
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23873
ns23720
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10000
ns8333.5
ns1.20
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9750
ns9666
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10125
ns10208
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9750
ns9750
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
245331.5
ns244421
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5375
ns5125
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5625
ns5750
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6604.5
ns6584
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5000
ns5125
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
209896.5
ns195651
ns1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7875
ns7083
ns1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7292
ns7375
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7687.5
ns7750
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns7875
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
739872
ns711373.5
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2041
ns2041
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2250
ns2250
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2458
ns2250
ns1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2084
ns2208
ns0.94
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18207
ns18128
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6542
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6458
ns6542
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6708
ns6625
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6541
ns6417
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
306864
ns296966
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
747125
ns751937.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
749958.5
ns746542
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
747167
ns750125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
771333.5
ns751833.5
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21305
ns21365
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
791000
ns811458
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
780041.5
ns810958
ns0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775416
ns790958
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
794812.5
ns813167
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
271390
ns271261
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
6959
ns7334
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5958
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns5917
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns10250
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33759
ns33874
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
259750
ns258396
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
238854
ns269104
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231104
ns253416
ns0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
250208
ns245208
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
336384
ns333723
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10125
ns10250
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10312.5
ns10334
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10875
ns10625
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10167
ns10250
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
223921.5
ns213790.5
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24167
ns24583
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24583
ns24500
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25333
ns24792
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24584
ns24916
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1062400
ns1032950.5
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106104729.5
ns107140583
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117502187.5
ns117792062
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120758625
ns120863042
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117423500
ns117603375
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2624434
ns2946778
ns0.89
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
392280708
ns393794791.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
358697709
ns359678396
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
357440917
ns357838334
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
540821208.5
ns545418083.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15254730
ns15489580
ns0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
781416292
ns607837250
ns1.29
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
760831458
ns579716416
ns1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
750885583.5
ns747642396
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
784554021
ns607166334
ns1.29
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7583
ns7292
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6875
ns6958
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8208
ns7625
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7917
ns6834
ns1.16
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
214784
ns206235.5
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14542
ns13709
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13667
ns14167
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14125
ns14500
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14375
ns14292
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1015761
ns968613
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5625
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6250
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7500
ns6875
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5500
ns5750
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
211436.5
ns204166
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12875
ns12625
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12417
ns12583
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12687.5
ns13000
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13042
ns12292
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
728295
ns694587
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5250
ns5917
ns0.89
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5709
ns5458
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6542
ns5875
ns1.11
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5375
ns5958
ns0.90
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17219
ns16951
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15750
ns15583
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15375
ns15375
ns1
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15584
ns15625
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15916
ns15708
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
188803.5
ns185517
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns333
ns1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
334
ns333
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23653
ns22862.5
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6583
ns6209
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns6208
ns1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6542
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
227179
ns223995
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5958
ns5834
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6041
ns5917
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5959
ns6000
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5833
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24470
ns23989
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21520.5
ns20833
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21209
ns20583
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21667
ns21625
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21334
ns21375
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
249183.5
ns246983.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144062.5
ns169125
ns0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143042
ns144292
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146334
ns148291.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
188146
ns189062.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167467
ns166865
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1317583
ns1326271
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1321709
ns1323042
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1365791.5
ns1320500
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1318666
ns1341500
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1237894
ns1189366
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24708
ns23000
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24375
ns23479
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24375
ns24875
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22374.5
ns24750
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
318636
ns254630.5
ns1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
134750
ns130167
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
181250
ns128375
ns1.41
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
130000
ns123229
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
130958
ns131062.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1345187.5
ns1279498
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23482
ns23209
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6042
ns1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6416
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6792
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6792
ns6458
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
243071
ns238830
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4625
ns4333
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4541.5
ns4542
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5333
ns4708
ns1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4583
ns4791
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
231105.5
ns217579.5
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9875
ns9666
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9916.5
ns10042
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10125
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10208
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1276883
ns1231902.5
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1667
ns1584
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23221
ns22989
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5750
ns5667
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5750
ns5625
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6083
ns6041
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5709
ns5750
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
262260
ns258706.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6814041
ns6877625
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6367459
ns6431167
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6578812.5
ns6497166
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7695958
ns7600437.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214554
ns213793
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24052709
ns24074875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21310875
ns21241875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21123834
ns21023583.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29855166.5
ns29822125.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2121783
ns2088714.5
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48838979.5
ns37413209
ns1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45549667
ns34256250
ns1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45706771
ns45704562.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49408500
ns38148271
ns1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5875
ns5416
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5709
ns6104.5
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6708
ns6667
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5541
ns6167
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
212106.5
ns206549
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns7917
ns1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8167
ns8229.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8542
ns8584
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8208
ns8542
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1001631
ns962776
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1556417
ns1560583
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1270792
ns1259145.5
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1624187.5
ns1626291.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2180520.5
ns2161625
ns1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA
274298
ns280818.5
ns0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7888792
ns7902229
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6591250
ns6567125
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7197854
ns7147750
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10478229.5
ns10485771
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1773709
ns1771472.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
366500
ns373687.5
ns0.98
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
371020.5
ns370583
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
457708
ns462021
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
33208.5
ns23584
ns1.41
batchedmm(128, Bsize=4)/forward/GPU/CUDA
47286
ns45539
ns1.04
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
723916.5
ns728750
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
801750
ns804208.5
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1064875
ns1065312.5
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
115334
ns96666.5
ns1.19
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
287209.5
ns226465
ns1.27
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397291
ns397333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287834
ns288042
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288166
ns288417
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
750833
ns751375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44324
ns44356
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
661875
ns672167
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
532416
ns531292
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
535458
ns528292
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973250
ns975666
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191330.5
ns193617.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
670958
ns669291
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
644229
ns642666
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
680667
ns644708.5
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
648125
ns687208
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132061.5
ns132960
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2459333
ns2454209
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2456084
ns2456687
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2464542
ns2455291
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2456083
ns2470521
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1216753
ns1122477
ns1.08
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
3708
ns3541
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
3334
ns3208
ns1.04
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4334
ns4458
ns0.97
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2667
ns2958
ns0.90
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16517
ns16816
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5500
ns5292
ns1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5458
ns5333
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5625
ns5625
ns1
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5542
ns5750
ns0.96
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
186819.5
ns187435
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458167
ns1458000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1500500
ns1498250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1499333
ns1497083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1437750
ns1439583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39930
ns40900
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5130750
ns5127041
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5285584
ns5298083.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5315979
ns5287583
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4998959
ns5015875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195663
ns198989
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33499
ns34297
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15375
ns15125
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15417
ns15083.5
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15500
ns15375
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15167
ns15166
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
351211
ns348507
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
70667
ns71250
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71208
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71959
ns70959
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71333
ns71209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113147
ns113569.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
318500
ns317792
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
318000
ns319125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
323666
ns319500
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317125
ns319875
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
195331
ns197937.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1084
ns959
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1125
ns1000
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1084
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23576
ns23702
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8458
ns7500
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8334
ns7750
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8292
ns8334
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns7958
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
249171.5
ns249887
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
506709
ns504875
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
492375
ns484208
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
562708
ns564708
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
222187.5
ns236458
ns0.94
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129166
ns130159
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1387250
ns1379479.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1449208
ns1446458.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1788375
ns1730646
ns1.03
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
865812.5
ns884667
ns0.98
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
273491
ns273315.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32843
ns32089
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6667
ns6083
ns1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6000
ns1.08
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6500
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6458
ns6083
ns1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
250973.5
ns250296.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1722042
ns1723562.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1723208.5
ns1725958.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1721083
ns1731208
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1723750
ns1767667
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168847
ns168954.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4362042
ns4352187.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4261187.5
ns4302209
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4415583.5
ns4360250
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4366958.5
ns4366750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1143038
ns1065222
ns1.07
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6750
ns6916
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6959
ns6750
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6959
ns6875
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6708.5
ns6958
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20756
ns20747
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51417
ns67792
ns0.76
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32917
ns48292
ns0.68
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33333
ns32958
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51208.5
ns51583
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
197240.5
ns198224
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17542
ns18375
ns0.95
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17875
ns17625
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18916
ns18542
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17750
ns18291
ns0.97
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18861
ns18190
ns1.04
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53458
ns53292
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53334
ns53541
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53250
ns53500
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53500
ns53500
ns1
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
319618.5
ns306993
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75292
ns75375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75375
ns75208
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75792
ns75000
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75208
ns75458
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47162
ns46432
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324375
ns323792
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
327625
ns324916
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
329583
ns325000
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324208
ns327375
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
211676.5
ns209114
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484375
ns1485167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1527958
ns1524792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1527583
ns1525000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1462209
ns1466042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51967
ns51777
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5124708
ns5115209
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5280333
ns5290000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5332500
ns5261979.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4985875
ns5012167
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202369.5
ns202581
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28291
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28333
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28291
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24821
ns24112
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66459
ns66333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66458
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66833
ns66667
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66416
ns67041
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
482606
ns467729
ns1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1501229
ns1491583.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1127563
ns1128834
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1119291.5
ns1128084
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2246375
ns2260833.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
570915
ns577757.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3082875
ns3056208
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2738375
ns2732395.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2760354
ns2734709
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3780667
ns3843875
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1961915
ns1892225.5
ns1.04
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7895333
ns7896000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7893459
ns7928041.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7944812.5
ns7897562.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4834521
ns4840958
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80959
ns81709
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80333
ns81062.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82166
ns85084
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
134375.5
ns90541
ns1.48
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193995.5
ns194858.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2014625
ns2012792
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2006229
ns2022916.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2047021
ns2012625
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022958
ns2042500
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
740969
ns690147
ns1.07
This comment was automatically generated by workflow using github-action-benchmark.