-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CompatHelper: bump compat for LossFunctions in [weakdeps] to 1, (keep existing compat) #1108
Merged
avik-pal
merged 1 commit into
main
from
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
Dec 3, 2024
Merged
CompatHelper: bump compat for LossFunctions in [weakdeps] to 1, (keep existing compat) #1108
avik-pal
merged 1 commit into
main
from
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
Dec 3, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
branch
from
November 28, 2024 00:19
cdc59d0
to
4ce11e6
Compare
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 4ce11e6 | Previous: 78ad9c9 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4375 ns |
4291 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4083 ns |
3958 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5500 ns |
5125 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
4250 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
84646 ns |
60770 ns |
1.39 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10833 ns |
10250 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
10125 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11250 ns |
10333 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10666 ns |
10334 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
424092 ns |
423675 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1208 ns |
1125 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1375 ns |
1166 ns |
1.18 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1375 ns |
1229.5 ns |
1.12 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1209 ns |
1250 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18187 ns |
17992 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4167 ns |
4250 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4042 ns |
4000 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4167 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4334 ns |
3958 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
109616 ns |
109284 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
56292 ns |
57417 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47583 ns |
38208 ns |
1.25 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47042 ns |
46375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81500 ns |
80167 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37683 ns |
36667.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2047458 ns |
2021709 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2111917 ns |
2097000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1774291.5 ns |
2077875 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1998333 ns |
2001000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
197048 ns |
195812 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144500 ns |
145166.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
143750 ns |
142666 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146417 ns |
146500 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
143625 ns |
144167 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166583.5 ns |
165803 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1136187.5 ns |
1104750 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1137958 ns |
1156062 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
809667 ns |
1104750 ns |
0.73 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1023687.5 ns |
1129458 ns |
0.91 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
528203.5 ns |
527714 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3667 ns |
4000 ns |
0.92 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3583 ns |
3625 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4250 ns |
4375 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3479.5 ns |
3459 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
71329.5 ns |
70555.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9417 ns |
9084 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9167 ns |
8709 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
9667 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9791 ns |
9167 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
487141 ns |
481518.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16041 ns |
15416 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17916 ns |
16958 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17375 ns |
16791.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15104.5 ns |
14792 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
54934 ns |
54315.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216229 ns |
213958 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216416 ns |
214042 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215958 ns |
214208 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215209 ns |
214334 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
273941.5 ns |
273628 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
541 ns |
583 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
791 ns |
667 ns |
1.19 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
583 ns |
583.5 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17588 ns |
17264 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1667 ns |
1500 ns |
1.11 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1584 ns |
1625 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1625 ns |
1792 ns |
0.91 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1708 ns |
1708 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
102547 ns |
102318 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7084 ns |
7000 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
5084 ns |
1.20 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5750 ns |
5958 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9959 ns |
9916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24052 ns |
23961 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
228958 ns |
221542 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228208 ns |
229708.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230250 ns |
229667 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213375 ns |
226542 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
170592 ns |
170388 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3875 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23161 ns |
23385 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16584 ns |
16625 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16625 ns |
16500 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16917 ns |
17000 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16792 ns |
16833 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
160992 ns |
161544 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
580167 ns |
581791 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
573500 ns |
578709 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
578666 ns |
569958 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
572459 ns |
572333.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113258.5 ns |
113621 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1427958 ns |
1428958 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1425333 ns |
1421292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1412438 ns |
1415833 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1419458.5 ns |
1420000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
210587 ns |
210533 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1051916.5 ns |
1081750 ns |
0.97 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
966458.5 ns |
938708 ns |
1.03 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1348291.5 ns |
1353291.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1309083.5 ns |
1296666 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
273561 ns |
269675 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5852875 ns |
5971292 ns |
0.98 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4611959 ns |
4530771.5 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4913417 ns |
4949917 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5741834 ns |
5624041 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1089040 ns |
1072622 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23642 ns |
23468 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2084 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
2208 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2208 ns |
2125 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
171746 ns |
169303 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4375 ns |
4167 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4209 ns |
4208 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5292 ns |
4708 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4167 ns |
4125 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
66427 ns |
66233.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11584 ns |
11125 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11042 ns |
11250 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12167 ns |
12000 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11375 ns |
10792 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
451498.5 ns |
452338 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6708 ns |
6292 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6708 ns |
6417 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7833.5 ns |
7604.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6333 ns |
5833 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
52022 ns |
52542 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17708 ns |
18583 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16958 ns |
17500 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17666 ns |
18833 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17833 ns |
16833 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
305035.5 ns |
301964.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
645.5 ns |
625 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
667 ns |
583 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32534 ns |
32911 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9125 ns |
8625 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8792 ns |
8542 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9375 ns |
9125 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9417 ns |
8917 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
160811.5 ns |
160010 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64708 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
65000 ns |
64666 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64666 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64459 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111471.5 ns |
112101 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
296083.5 ns |
279458 ns |
1.06 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
280604.5 ns |
288583 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
281000 ns |
273583 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
278354.5 ns |
286083 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
186018 ns |
185547.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3152542 ns |
3376750.5 ns |
0.93 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3083145.5 ns |
2898291.5 ns |
1.06 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3062000 ns |
3024854 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
3957896 ns |
3941104 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
589493 ns |
581323 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7580875 ns |
7603583 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7478187.5 ns |
7358750 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7308437.5 ns |
7466208 ns |
0.98 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8200459 ns |
8146792 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1332529.5 ns |
1318419 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17664271 ns |
17484792 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17602854 ns |
17670999.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17558146 ns |
17533250 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14131125.5 ns |
9220187.5 ns |
1.53 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
24347834 ns |
23603916 ns |
1.03 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33909500 ns |
43639208 ns |
0.78 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37072979 ns |
37125083 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34935333 ns |
34980187.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1858311 ns |
1854234 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
192610083 ns |
188207417 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
233420667 ns |
251666438 ns |
0.93 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
194817417 ns |
194864208 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
435064333 ns |
434287708 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13953686 ns |
13931919 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
291677000 ns |
287943833 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
337320333 ns |
355406479.5 ns |
0.95 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
296199250 ns |
297803834 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
397202208.5 ns |
400767145.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22709 ns |
22458 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22875 ns |
22208 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23542 ns |
25041 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22334 ns |
22270.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95278.5 ns |
96107.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
114395.5 ns |
113166.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104083 ns |
104292 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
105042 ns |
105083 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
157500 ns |
103812.5 ns |
1.52 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
495763 ns |
502678.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5917 ns |
6833 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6333 ns |
6479.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6958 ns |
7041.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5813 ns |
5958 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
67876 ns |
68593 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15541 ns |
15000 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15209 ns |
15479 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16334 ns |
16333 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14917 ns |
14708.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
464239.5 ns |
475032.5 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3016917 ns |
3031167 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2067708 ns |
2061583 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2279041 ns |
2253209 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4761625 ns |
4505270.5 ns |
1.06 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585961 ns |
586394 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23894333 ns |
23625708.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18053729.5 ns |
18333062.5 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18010584 ns |
17998916.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35772354 ns |
35608125.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2775381 ns |
2764773.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33710250.5 ns |
33284000 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27797916.5 ns |
28078500 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28492020.5 ns |
28952938 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41549708 ns |
41446187.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73354 ns |
72167 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74583 ns |
81083 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76417 ns |
86562.5 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72500 ns |
75479 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
102363 ns |
104806 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
304187.5 ns |
223458.5 ns |
1.36 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
207416 ns |
325166 ns |
0.64 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213917 ns |
320958 ns |
0.67 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219021 ns |
210500 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
548660.5 ns |
552193 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11500 ns |
11917 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
12583 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12541 ns |
12708 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11750 ns |
12083 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
71670 ns |
71752 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26937.5 ns |
26667 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27125 ns |
26583 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27792 ns |
28000 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26875 ns |
26500 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
471558 ns |
476956.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12146 ns |
11667 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13083 ns |
12333 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14375 ns |
12917 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12459 ns |
11834 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
53115.5 ns |
53475 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25458 ns |
25792 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26292 ns |
25500 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26167 ns |
26500 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26166 ns |
26000 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
304265 ns |
305905.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
180687.5 ns |
181458 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
180500 ns |
180541 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
181834 ns |
184604.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179291 ns |
179667 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
57410.5 ns |
57257.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
591437 ns |
592917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
584500 ns |
587687.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
585167 ns |
595750 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
582833.5 ns |
582791.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
288843.5 ns |
291107 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6000 ns |
8958 ns |
0.67 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6208 ns |
6583 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7000 ns |
8042 ns |
0.87 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6084 ns |
6375 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
71614.5 ns |
71199.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14125 ns |
13916 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14833 ns |
14875 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15334 ns |
15459 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14334 ns |
13958.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
464523 ns |
465947 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1192625 ns |
1219708 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1256562.5 ns |
1231750 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1293416.5 ns |
1269667 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1002250 ns |
1009666 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
300031.5 ns |
300921 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4300875 ns |
4103750 ns |
1.05 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4442145.5 ns |
4571833 ns |
0.97 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4756625 ns |
4574959 ns |
1.04 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3694042 ns |
3707208 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1045545 ns |
1038858 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1917 ns |
1875 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23222 ns |
23656 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4875 ns |
4875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4916 ns |
4792 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4916 ns |
4917 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4958 ns |
4875 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
187868 ns |
190147.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5375 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5834 ns |
5708.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6958 ns |
6917 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5437.5 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
55818 ns |
56411.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10625 ns |
10750 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10917 ns |
11000 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11542 ns |
11834 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11209 ns |
10729.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
337998 ns |
336162 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
334 ns |
0.87 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22484 ns |
22819 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2750 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2750 ns |
2750 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3083 ns |
3042 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2833 ns |
2792 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
156444.5 ns |
159135.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11458 ns |
11458 ns |
1 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12084 ns |
11333 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13125 ns |
12750 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11541 ns |
11208 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
58675.5 ns |
58102 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24583.5 ns |
24750 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24333 ns |
24334 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25291 ns |
25084 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24500 ns |
24750 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
300446.5 ns |
298883.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4209 ns |
4209 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4250 ns |
4209 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4250 ns |
4291 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4167 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24328 ns |
24823 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
15917 ns |
16084 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16417 ns |
15959 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16375 ns |
16500 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16167 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
198420.5 ns |
197271 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5791 ns |
5833 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
5791 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5916 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5917 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33687 ns |
34115 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20312.5 ns |
20500 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21083 ns |
20417 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21417 ns |
21250 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20833 ns |
20708 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
177081 ns |
178582.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
397958 ns |
423708.5 ns |
0.94 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
385541 ns |
366416.5 ns |
1.05 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
483333 ns |
484917 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
103521 ns |
103541 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66957 ns |
67022 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
889417 ns |
943375 ns |
0.94 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
967146.5 ns |
950687 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1180875 ns |
1197916.5 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
396250 ns |
330416.5 ns |
1.20 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
191155.5 ns |
193979 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83458.5 ns |
80541.5 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
78645.5 ns |
81125 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81125 ns |
81541.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
126396 ns |
80479.5 ns |
1.57 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193610 ns |
194031 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1919146 ns |
1919833 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1945833 ns |
1936958 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1930334 ns |
1930229 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1919271 ns |
1923250 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
409565 ns |
400084 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21892 ns |
21834 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1750 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
172421 ns |
168563 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
6416 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7208.5 ns |
6166 ns |
1.17 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8458 ns |
7667 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6541 ns |
6709 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
63461.5 ns |
61087.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9167 ns |
8959 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9291 ns |
8875 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9417 ns |
9250 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9250 ns |
9312.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
322316.5 ns |
309875.5 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
157169937.5 ns |
118672458 ns |
1.32 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173870500 ns |
182326458 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148265875 ns |
148081791.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
107136250 ns |
102035042 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5468921 ns |
5467326.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
672488500 ns |
610447729.5 ns |
1.10 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
557043167 ns |
582022188 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
453902479.5 ns |
452913708.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
757591437.5 ns |
751418979 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34942280 ns |
34971564 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
699415291 ns |
646694167 ns |
1.08 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
667015104 ns |
688250333 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
595614666.5 ns |
583281666.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
745317417 ns |
744581417 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58083 ns |
59000 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47916 ns |
37792 ns |
1.27 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
48000 ns |
47750 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83583 ns |
83417 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38683 ns |
38231 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920625 ns |
1925854 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2012084 ns |
1987562.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1774916.5 ns |
1779021 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1888916.5 ns |
1864125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
175791.5 ns |
175192.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
265875 ns |
292250 ns |
0.91 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
282625 ns |
268916 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
269542 ns |
269500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
268542 ns |
266000 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
141371.5 ns |
128884 ns |
1.10 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
676750 ns |
686771 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
694500 ns |
702187.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
586625 ns |
591083 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
622041 ns |
688958 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
749046.5 ns |
706872 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2201458 ns |
2268958 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2243750 ns |
2245875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2170917 ns |
2101125 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2180209 ns |
2176375 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133870 ns |
133295.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5562500 ns |
5521229.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5498666 ns |
5587167 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5502042 ns |
5520666.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5503833 ns |
5493834 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
805122 ns |
748599 ns |
1.08 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
648292 ns |
642084 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
640229.5 ns |
648917 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
647417 ns |
636667 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
643750 ns |
635875 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47089 ns |
46696 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1806083 ns |
1822625 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1726395.5 ns |
1670333 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1722792 ns |
1719875 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2105042 ns |
2097416.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
223170 ns |
221082 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57208 ns |
57833 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46208 ns |
38500 ns |
1.20 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47750 ns |
46250 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84458.5 ns |
82750 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29180 ns |
28653 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2026208 ns |
2020167 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2111312.5 ns |
2105417 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2101875 ns |
2093958 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1998042 ns |
1999958.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
191645.5 ns |
190261 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13388479.5 ns |
13356563 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12455333 ns |
12441584 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12550771 ns |
12535208 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15223479.5 ns |
15154375 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
506259 ns |
512188.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47530416.5 ns |
47248458 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41916083.5 ns |
42098688 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40927646 ns |
40986395.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58589167 ns |
58394208 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2878579 ns |
2891115 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74922500 ns |
74033603.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
68238479 ns |
68368417 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90707250 ns |
90690875 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99093937.5 ns |
76143146 ns |
1.30 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57209 ns |
58250 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47875 ns |
38583 ns |
1.24 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47417 ns |
47625 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83083 ns |
79125 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48924 ns |
47024 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1911333 ns |
1918250 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1951896 ns |
1983396 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1981792 ns |
1965584 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1874209 ns |
1830750 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195426 ns |
192100.5 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32387 ns |
32257 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6041 ns |
6083 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
5958 ns |
6000 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6541 ns |
6416 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6375 ns |
6104.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
181845 ns |
172267 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31764 ns |
31372 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2667 ns |
2625 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2625 ns |
2625 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2958 ns |
2875 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2834 ns |
2666 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
168556 ns |
158332 ns |
1.06 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
322882374.5 ns |
283213208 ns |
1.14 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340460604.5 ns |
347751604 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314078562.5 ns |
314361479.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
270365250 ns |
273430250 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7106428 ns |
7090888 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1049520563 ns |
992205416 ns |
1.06 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
946755584 ns |
964468250 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
834284728.5 ns |
838327667 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1157006250 ns |
1152689375 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33928437.5 ns |
34106482 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1356253312.5 ns |
1303968312.5 ns |
1.04 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1355624041.5 ns |
1327504666.5 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1640731334 ns |
1629886334 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1672816583 ns |
1314925417 ns |
1.27 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1415000 ns |
1455709 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1408833 ns |
1463125 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1409750 ns |
1415166.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1415084 ns |
1410000 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128106 ns |
127607 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5053916 ns |
5015979 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5039000 ns |
5060792 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5038854.5 ns |
5051500 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5007604.5 ns |
5009458 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
619586 ns |
574399.5 ns |
1.08 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
163462125 ns |
170351312 ns |
0.96 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
128281083.5 ns |
167663375 ns |
0.77 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
126443875 ns |
130848583.5 ns |
0.97 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
153139396 ns |
167905166.5 ns |
0.91 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4857101 ns |
4881672 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
630120250 ns |
618588292 ns |
1.02 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
634882166 ns |
577882000 ns |
1.10 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
463859250 ns |
497505667 ns |
0.93 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
648056000 ns |
647917125 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16956311 ns |
16266169 ns |
1.04 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
9025812.5 ns |
8910542 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8963999.5 ns |
9026291.5 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7890895.5 ns |
7927084 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9809417 ns |
9711125 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1606346 ns |
1592738 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36653292 ns |
35730646 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37365041.5 ns |
38522375 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33358104.5 ns |
33553041 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37943458 ns |
37755625 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6472769 ns |
6512589 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47417 ns |
47333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47625 ns |
47333 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47667 ns |
47334 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47270.5 ns |
47875 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18634 ns |
18035 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50125 ns |
52792 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50250 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50625 ns |
50458 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50292 ns |
50667 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
226610 ns |
197012 ns |
1.15 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6708 ns |
6375 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7083 ns |
6250 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7916 ns |
7417 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6750 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
112917 ns |
112280 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10041 ns |
9584 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10250 ns |
9458 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10125 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9792 ns |
10209 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
670143.5 ns |
615930.5 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6167 ns |
5416 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6041 ns |
5791 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8041 ns |
7146 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5959 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
137866 ns |
123840 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13041 ns |
12583 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12833 ns |
12750 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13770.5 ns |
13208 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13542 ns |
12708 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
578170 ns |
529723.5 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
958 ns |
1083 ns |
0.88 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32251 ns |
32491 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8020.5 ns |
8000 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7875 ns |
7750 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
8209 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8084 ns |
7959 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
227701.5 ns |
209838 ns |
1.09 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
22958 ns |
23417 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23417 ns |
23041 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23417 ns |
23584 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23500 ns |
23417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18322 ns |
18029 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52541 ns |
54667 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52500 ns |
52417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52833 ns |
52667 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52667 ns |
52458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
345143 ns |
299710 ns |
1.15 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1453104 ns |
1444833 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1409541.5 ns |
1449584 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1400083.5 ns |
1399209 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1398458.5 ns |
1396958.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196055 ns |
195765 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5045166 ns |
5000042 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5041833 ns |
5049833 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5022042 ns |
5044562 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4999104 ns |
5015291.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
682114 ns |
612366.5 ns |
1.11 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3076646 ns |
3043104 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
1900792 ns |
2098583 ns |
0.91 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2298520.5 ns |
2313209 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4520479.5 ns |
4606709 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
581732 ns |
580804.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24602021 ns |
24374458 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18878250 ns |
19110937.5 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18930875 ns |
18926833 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36607541.5 ns |
36250750 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2856759 ns |
2861963.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34453291 ns |
33972875 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28407666 ns |
28642167 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28093604 ns |
28092229 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41723604 ns |
41633541.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
143978208 ns |
141888875 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
148557146 ns |
146034209 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
127214563 ns |
126705062.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
172790520.5 ns |
173781771 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22548832 ns |
22552094 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
936784479.5 ns |
1227732750 ns |
0.76 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1213946375 ns |
839227916.5 ns |
1.45 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
786668334 ns |
739276458 ns |
1.06 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
665353916 ns |
683957250 ns |
0.97 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118204063.5 ns |
117875105 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74166.5 ns |
73084 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75333.5 ns |
74479 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76063 ns |
75750 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74541 ns |
74958 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
291801 ns |
240665.5 ns |
1.21 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
281167 ns |
280208.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
292417 ns |
288959 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
205812.5 ns |
193791 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
273958 ns |
192583 ns |
1.42 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1421509 ns |
1331151 ns |
1.07 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
36347500.5 ns |
35557542 ns |
1.02 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36490833 ns |
36592625 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32565312.5 ns |
32410750 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40326750 ns |
40376458 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5842760.5 ns |
5838475 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
151616750 ns |
148073500 ns |
1.02 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
153933042 ns |
158619999.5 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
138687042 ns |
139542333.5 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
283410625 ns |
282659625 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34881239 ns |
34873454 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
156647937.5 ns |
120976041.5 ns |
1.29 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174062667 ns |
182674416.5 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148038834 ns |
147566209 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106701312.5 ns |
105641958.5 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5454041 ns |
5456587 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
515030562.5 ns |
471084687.5 ns |
1.09 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466854541 ns |
489605103.5 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
437508208 ns |
432706750 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
740373250 ns |
737367000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32305250 ns |
32284178 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
687245708 ns |
707739104.5 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
656087000 ns |
677702687.5 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
573799604 ns |
572041062.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
736233875 ns |
735458208 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1212875 ns |
1303791.5 ns |
0.93 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
988709 ns |
778750 ns |
1.27 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
926750 ns |
904854 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
1961041 ns |
1945625 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
581417 ns |
581135.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2926791.5 ns |
2961271 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2598375 ns |
2515584 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2622791 ns |
2624334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3707604 ns |
3695417 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1791390 ns |
1838423 ns |
0.97 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5875250 ns |
5788229.5 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5772208 ns |
5903625 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5791209 ns |
5805354.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2892417 ns |
2899667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7375 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6209 ns |
5250 ns |
1.18 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
6167 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
9916 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25314 ns |
25653 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214083 ns |
212479.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220833.5 ns |
226833 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223708 ns |
220417 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207250 ns |
206167 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
294210 ns |
275653 ns |
1.07 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
310120958.5 ns |
307447667 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
230513875 ns |
279760625 ns |
0.82 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
199087146 ns |
198268687.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
310619959 ns |
308090500 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7672328 ns |
7673335 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1090523750.5 ns |
1074946146 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
909677416.5 ns |
1069981500 ns |
0.85 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
800460750 ns |
801953875 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1151616708.5 ns |
1147606167 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26662753 ns |
26674789 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5458 ns |
4958 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5417 ns |
5208 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
5958 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5417 ns |
5042 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
197404.5 ns |
169081.5 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7083 ns |
6833 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7520.5 ns |
6917 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8083 ns |
7625 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7125 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
708066 ns |
666084 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
667 ns |
542 ns |
1.23 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23541 ns |
24582 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
8729.5 ns |
9125 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
8459 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9084 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9167 ns |
9041 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
230110 ns |
231180 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
355396 ns |
352416.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
352604.5 ns |
351792 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352958 ns |
354500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
360166.5 ns |
352125 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
20966 ns |
21300.5 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
779875 ns |
814416 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
828542 ns |
809021 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
775937 ns |
782042 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
824729 ns |
827334 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
295465.5 ns |
305499.5 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
321125 ns |
336479.5 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
340292 ns |
321125 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
454667 ns |
450500 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10208 ns |
10542 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17809 ns |
18195 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
713583.5 ns |
721208 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
732458 ns |
733229 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1006146 ns |
1007271 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26625 ns |
26666 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
294532 ns |
274145 ns |
1.07 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
363458 ns |
383062 ns |
0.95 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
346333 ns |
329312 ns |
1.05 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
443583 ns |
442417 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
31958.5 ns |
30792 ns |
1.04 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22792 ns |
22813 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
726458 ns |
737625 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
783458 ns |
785604 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1030000 ns |
1032042 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
91000.5 ns |
105375 ns |
0.86 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
254713.5 ns |
222871.5 ns |
1.14 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3625 ns |
3417 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3666 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3583 ns |
3583 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17993 ns |
17737 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
4417 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4209 ns |
4209 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4375 ns |
4333 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4209 ns |
4292 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
293869 ns |
278790 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3708 ns |
3791 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4125 ns |
3604.5 ns |
1.14 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4250 ns |
4145.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3833 ns |
3666.5 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
239644.5 ns |
207112 ns |
1.16 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8459 ns |
8125 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8334 ns |
8000 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8542 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8333 ns |
8458 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1235108 ns |
1220818 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
205166 ns |
203687.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209375 ns |
210041 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210625 ns |
210625 ns |
1 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199917 ns |
200708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34664.5 ns |
34937 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
643542 ns |
645270.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
628083 ns |
631770.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621958 ns |
622458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631354 ns |
630750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
352108 ns |
343085 ns |
1.03 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
994063 ns |
1001750 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1017812.5 ns |
1034729 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
956000 ns |
956333 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
874000 ns |
879958 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
209089 ns |
207672.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4616542 ns |
4524208 ns |
1.02 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4758625 ns |
4821708 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4473229 ns |
4482250 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
5131208 ns |
5132979 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
907863.5 ns |
922465 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3375 ns |
3666 ns |
0.92 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3583 ns |
3292 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4084 ns |
3417 ns |
1.20 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3542 ns |
3583 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
229996.5 ns |
232276 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7333 ns |
7292 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
6792 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7459 ns |
7500 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
6875 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1033168.5 ns |
1014308 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1601062.5 ns |
1651708 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1164187 ns |
1164875 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1348916 ns |
1344708 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2438708 ns |
2500875 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
213393 ns |
214937 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12359854 ns |
12379084 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9593167 ns |
9615125.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9306187.5 ns |
9247041 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
17918375 ns |
18054792 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1944835.5 ns |
1946109 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17368208 ns |
17413000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14457396 ns |
14415146.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14346583 ns |
14339250 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21056958.5 ns |
21151646 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
89250 ns |
134917 ns |
0.66 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90875 ns |
88958 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
90333 ns |
91334 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
89292 ns |
87666 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126558 ns |
126488 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2042271 ns |
2026792 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2047792 ns |
2043625 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2046125 ns |
1766792 ns |
1.16 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2027187.5 ns |
2026459 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1094922 ns |
1034650 ns |
1.06 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
2125 ns |
2770.5 ns |
0.77 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2833 ns |
1334 ns |
2.12 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3625 ns |
3208 ns |
1.13 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2875 ns |
3791 ns |
0.76 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16329 ns |
16389 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2750 ns |
2584 ns |
1.06 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2791 ns |
2459 ns |
1.14 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2750 ns |
2709 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2667 ns |
2791 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
198937.5 ns |
192723.5 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
6833 ns |
7250 ns |
0.94 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5208 ns |
1.15 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
5959 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
9959 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34628 ns |
34193 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218667 ns |
225250 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
224375 ns |
227063 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221250 ns |
220708 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213208 ns |
213333 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
357231 ns |
312634.5 ns |
1.14 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22304 ns |
22321 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14042 ns |
14417 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14500 ns |
14250 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14500 ns |
14416 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14375 ns |
14375 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
522517 ns |
475484 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
93395.5 ns |
134292 ns |
0.70 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
94625 ns |
93667 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
94520.5 ns |
94354.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
93708 ns |
91958 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126008 ns |
125921 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1925375 ns |
1924541.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1959833 ns |
1939333 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1934167 ns |
1709625 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1923104 ns |
1925042 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1031315 ns |
949226.5 ns |
1.09 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
860041 ns |
874708 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
818125 ns |
796250 ns |
1.03 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1223500.5 ns |
1220958 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
960000 ns |
963208 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
278534 ns |
277966 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2734187.5 ns |
2838542 ns |
0.96 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2543791 ns |
2538917 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3348292 ns |
3341125 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3412625 ns |
3415500 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1727357 ns |
1590492.5 ns |
1.09 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16146 ns |
17646 ns |
0.91 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16854 ns |
16500 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17542 ns |
18042 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17937.5 ns |
17333 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
145155.5 ns |
142389.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
258562.5 ns |
226250 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
224416.5 ns |
239208.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216167 ns |
215666.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226792 ns |
227708 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
722833 ns |
648593.5 ns |
1.11 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
223375 ns |
222666 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222416 ns |
220083 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
220333.5 ns |
222792 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
222292 ns |
221875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
276639.5 ns |
275688.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
498500 ns |
564542 ns |
0.88 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
568917 ns |
507292 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
497270.5 ns |
506333 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
553000 ns |
559542 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1551978 ns |
1323540.5 ns |
1.17 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
4916 ns |
4229.5 ns |
1.16 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
4417 ns |
3958 ns |
1.12 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
5291.5 ns |
3916 ns |
1.35 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3687.5 ns |
4333 ns |
0.85 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17012 ns |
16749 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7291.5 ns |
7187 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7375 ns |
6917 ns |
1.07 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7375 ns |
7292 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7250 ns |
7416 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
199205.5 ns |
193558 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17750 ns |
19333.5 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18583 ns |
17167 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18792 ns |
19291 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19708 ns |
16959 ns |
1.16 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
149215 ns |
145420.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212229.5 ns |
223917 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
214791.5 ns |
216437.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214000 ns |
215375 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213416.5 ns |
213812.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1106181.5 ns |
914033 ns |
1.21 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4604 ns |
4958 ns |
0.93 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4541 ns |
4250 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5208 ns |
4417 ns |
1.18 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4625 ns |
3917 ns |
1.18 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
247437 ns |
206416 ns |
1.20 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
10250 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10416 ns |
10000 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10667 ns |
10958 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10708.5 ns |
10000 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1104526 ns |
1027488.5 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3792 ns |
3833 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3583 ns |
3459 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4375 ns |
3416 ns |
1.28 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3708.5 ns |
3250 ns |
1.14 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
250113 ns |
236791.5 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7709 ns |
7417 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7417 ns |
7250 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8042 ns |
7625 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7792 ns |
7375 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1118106 ns |
1067899 ns |
1.05 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
24395229.5 ns |
23463750.5 ns |
1.04 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34970417 ns |
43484791.5 ns |
0.80 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37903500 ns |
37835875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34905292 ns |
34880875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1846073.5 ns |
1833754 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
186148375 ns |
184463792 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159573208 ns |
172964124.5 ns |
0.92 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146504750 ns |
146554521 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
410969125 ns |
410369375 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16461519 ns |
16525549 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
433409166 ns |
424815979 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
253111667 ns |
259769792 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
296633708 ns |
297288958 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
482176417 ns |
478383791 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
183750 ns |
183959 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
184166 ns |
183375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184209 ns |
186187.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
185792 ns |
183187.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
231726 ns |
205888.5 ns |
1.13 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
587333 ns |
602916.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
634750 ns |
596416.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
587375 ns |
592375 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
636583 ns |
596542 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1168518 ns |
1054788 ns |
1.11 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3877125 ns |
3829562.5 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
4098875 ns |
3998791.5 ns |
1.03 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3590458 ns |
3564812.5 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4552854.5 ns |
4550791.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
546491 ns |
532059.5 ns |
1.03 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17951854 ns |
17302667 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17977208 ns |
18565313 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16584500 ns |
16600312.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
20094125 ns |
20208979.5 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2622140 ns |
2631431 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
666 ns |
583 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32440 ns |
33095 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9625 ns |
9083 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8917 ns |
9042 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9520.5 ns |
9458.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9125 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
266562.5 ns |
266296 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
498981625 ns |
498097750 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
425212729 ns |
506743916 ns |
0.84 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
423096083 ns |
424015542 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
678508583 ns |
594637416 ns |
1.14 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12483399 ns |
12483759 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1884453479 ns |
1878936437.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1622853833 ns |
1662067875 ns |
0.98 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1491015604 ns |
1496755770.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2214891062.5 ns |
2214230167 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49102861.5 ns |
49527395 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1639542 ns |
1663166 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1200791.5 ns |
1177833 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1379229 ns |
1370041 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2469520.5 ns |
2349521 ns |
1.05 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215836 ns |
217522 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12801083.5 ns |
12726750 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9934354.5 ns |
10036417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9654646 ns |
9643083 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18402104 ns |
18397833 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2050458 ns |
2037123 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17715041 ns |
17723584 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14747229 ns |
14827916 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14595562.5 ns |
14555416.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21491333 ns |
21415041 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26291 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26208 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23986 ns |
23706 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66667 ns |
67354.5 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67375 ns |
66792 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67166 ns |
68375 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67084 ns |
66875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
407500 ns |
393355.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204125 ns |
203458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209541 ns |
209417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209459 ns |
210084 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199500 ns |
199125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26363 ns |
26245.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
648854 ns |
647916 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
621375 ns |
672375.5 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621937.5 ns |
621792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
627417 ns |
593542 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
354910 ns |
351878.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
650750 ns |
679750 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
657458.5 ns |
657291 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
605042 ns |
595709 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
639312.5 ns |
632771 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132108 ns |
131601.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2303125 ns |
2238750 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2235792 ns |
2300791 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2243583 ns |
2241896 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2242687.5 ns |
2244958 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1270621 ns |
1242570.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18375 ns |
18625 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17959 ns |
17979 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19000 ns |
18375 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20375 ns |
17104 ns |
1.19 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146138.5 ns |
144244 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220083 ns |
256458 ns |
0.86 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230958 ns |
245646 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219834 ns |
221750 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220542 ns |
230416 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1103808.5 ns |
1056298 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
667 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
708 ns |
583 ns |
1.21 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
24081 ns |
23741 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9667 ns |
9208 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9416.5 ns |
9708 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9916 ns |
9458 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9917 ns |
9333 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
263006.5 ns |
257592.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5125 ns |
1.15 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5500 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6334 ns |
6395.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5125 ns |
5458 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
233824.5 ns |
231821.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
6833 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
6792 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7584 ns |
7458 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7208 ns |
6917 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
802883.5 ns |
801589.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2083 ns |
2167 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2416 ns |
2000 ns |
1.21 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2417 ns |
2208 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2583 ns |
2375 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
17959 ns |
17797 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6417 ns |
6375 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6708 ns |
6542 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6625 ns |
6667 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6542 ns |
6375 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
332926 ns |
330267.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749458 ns |
748708 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
750500 ns |
756208 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
749209 ns |
752750 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
746999.5 ns |
753542 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21640.5 ns |
20724 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
788396 ns |
792417 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
814125 ns |
796875 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
773541 ns |
786834 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
787458 ns |
808000 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
299599.5 ns |
297689.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7209 ns |
7250 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5250 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6042 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10125 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33378 ns |
33074 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223312.5 ns |
228604.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
234583 ns |
251041 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228583 ns |
227708 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225500 ns |
226000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
364014 ns |
362298.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10437.5 ns |
10209 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10437.5 ns |
10209 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10917 ns |
10458 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10167 ns |
9750 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
247324 ns |
252317 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25125 ns |
25334 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24417 ns |
24312.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25083 ns |
25959 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24625 ns |
24395.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1131993 ns |
1133104 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
107199604.5 ns |
106928354 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
118500021 ns |
126898666 ns |
0.93 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120637125 ns |
121692334 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
118730312.5 ns |
117598792 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2636142 ns |
2629460 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
391647750 ns |
390743083 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
367183625 ns |
379904750 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
357219875 ns |
361277959 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
495276708 ns |
481946125 ns |
1.03 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15197474 ns |
15184946 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
763707354.5 ns |
754771020.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
584908084 ns |
597861750 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
747695166.5 ns |
748681771 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
947855583.5 ns |
760209125 ns |
1.25 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7146 ns |
6500 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7250 ns |
6667 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8458.5 ns |
8333 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6750 ns |
6667 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
240105 ns |
239111 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14583 ns |
14125 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13500 ns |
14125 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14334 ns |
14437.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14042 ns |
13667 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1094237 ns |
1073718 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6708 ns |
5542 ns |
1.21 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5833 ns |
5542 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7750 ns |
6395.5 ns |
1.21 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5792 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
235579.5 ns |
235877.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12708 ns |
12208 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12125 ns |
12542 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12959 ns |
12750 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12583 ns |
12166 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
796257.5 ns |
781667 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
6166 ns |
5709 ns |
1.08 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
6500 ns |
5437.5 ns |
1.20 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6041 ns |
5750 ns |
1.05 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5459 ns |
5833 ns |
0.94 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16992 ns |
16760 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15667 ns |
15417 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15458 ns |
15333 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15625 ns |
15500 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15500 ns |
15625 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
201507.5 ns |
199275.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
417 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23859 ns |
23515 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6000 ns |
6333 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6416 ns |
6167 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6417 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6333 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
241270 ns |
240257 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5917 ns |
6083 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6000 ns |
5875 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24880 ns |
24789 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21208 ns |
20958 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
20875 ns |
20958.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21791 ns |
21334 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21875 ns |
21000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
265379 ns |
263523 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
185750 ns |
188417 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
143791 ns |
162166 ns |
0.89 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146437.5 ns |
146708.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144041 ns |
149625 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167923.5 ns |
167166 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1355666 ns |
1323812.5 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1327584 ns |
1371958 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1313333 ns |
1317937.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1324478.5 ns |
1325562.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1353391.5 ns |
1350174 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22875.5 ns |
25292 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22917 ns |
22500 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23083 ns |
23146.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21875 ns |
22979.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
357023 ns |
352259 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
132875 ns |
173645.5 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
130063 ns |
180041 ns |
0.72 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
118583 ns |
119500 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
119042 ns |
126334 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1541572 ns |
1470411 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
334 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23464 ns |
23380 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6459 ns |
6125 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6229.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6708 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6666 ns |
6167 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
260312 ns |
256300 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4770.5 ns |
5084 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4917 ns |
5083 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4833.5 ns |
5083 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4792 ns |
4292 ns |
1.12 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
257356 ns |
256465.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10166 ns |
10209 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10291 ns |
9750 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10333 ns |
10750 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10208 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1347539 ns |
1354750 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1584 ns |
1583 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1666 ns |
1708 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23075 ns |
22916 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5750 ns |
5750 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5709 ns |
5667 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6000 ns |
6167 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5791 ns |
5750 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
274492.5 ns |
272343 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6862062.5 ns |
6820375 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6386167 ns |
6368417 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6513250 ns |
6567000 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7668584 ns |
7648166 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214723.5 ns |
214879 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24141125 ns |
24083333.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21293084 ns |
21351687.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21005542 ns |
21140875 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29845875 ns |
29752125.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2102920 ns |
2100360 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37596000 ns |
37299645.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
34345812 ns |
34217771 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45828958 ns |
45700125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49304416.5 ns |
38021000 ns |
1.30 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6083 ns |
5750 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5583.5 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6834 ns |
6395.5 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5791 ns |
5292 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
238463 ns |
235350 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8583 ns |
8167 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8292 ns |
8416.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8542 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8583 ns |
8500 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1060997 ns |
1060836 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1537167 ns |
1566292 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1266521 ns |
1237250 ns |
1.02 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1605458 ns |
1619208 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2157021 ns |
2132958 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
280814 ns |
278998 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7985729 ns |
7937625 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6616125 ns |
6656917 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7124500 ns |
7130604.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10471541.5 ns |
10453333.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1891446.5 ns |
1878437 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
352000 ns |
370292 ns |
0.95 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
376521 ns |
353124.5 ns |
1.07 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
461521 ns |
459083 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
21917 ns |
23666 ns |
0.93 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
42092.5 ns |
42541.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
744729 ns |
753083 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
810958 ns |
809125 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1057791.5 ns |
1063125 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
120687 ns |
116979.5 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
311338 ns |
239130.5 ns |
1.30 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397250 ns |
397291 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288084 ns |
212417 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288000 ns |
288125 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751083 ns |
752000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44119 ns |
44180 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
644042 ns |
667583 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
532708 ns |
474167 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
531750 ns |
531812.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973333 ns |
973083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189637 ns |
194058 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
666875 ns |
678250 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
691834 ns |
667145.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
641479 ns |
621709 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
652333 ns |
646959 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132553 ns |
133035 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2517292 ns |
2484229 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2458124.5 ns |
2543916.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2461459 ns |
2480312.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2442833 ns |
2471875 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1360464 ns |
1215811 ns |
1.12 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
2958 ns |
2791 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
3562.5 ns |
2084 ns |
1.71 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4375 ns |
4333 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3667 ns |
3354 ns |
1.09 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16607 ns |
16281.5 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5458 ns |
5375 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5542 ns |
5209 ns |
1.06 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5667 ns |
5500 ns |
1.03 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5583 ns |
5584 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
200518 ns |
201076.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1463583 ns |
1457583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1499083 ns |
1497084 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1501292 ns |
1498833 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1441667 ns |
1436500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41223 ns |
41204 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5150666 ns |
5117834 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5316250 ns |
5304542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5311104 ns |
5300500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4994792 ns |
4807333 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198341 ns |
199725 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33867 ns |
32858 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14875 ns |
15250 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15208 ns |
15000 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15417 ns |
15292 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15333 ns |
15083 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
381115 ns |
377713 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71125 ns |
70792 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71333 ns |
71417 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71250 ns |
71125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71083 ns |
70000 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113443 ns |
113374.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
325667 ns |
318333 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
326291 ns |
334916 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318416 ns |
318083 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318750 ns |
318209 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
194359 ns |
193117.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
959 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23990 ns |
23866.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8083 ns |
7833 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7958 ns |
7875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8458 ns |
8125 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
7875 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
263533.5 ns |
261797 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
495458 ns |
512646 ns |
0.97 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
489167 ns |
479541 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
569500 ns |
566104 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
219062.5 ns |
216667 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129442 ns |
130101 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1408375 ns |
1405541 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1471125 ns |
1481750 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1765312.5 ns |
1758666 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
869750 ns |
872625 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
275589 ns |
274250.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
417 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
416 ns |
292 ns |
1.42 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32155 ns |
31596 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6459 ns |
6375 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
5854.5 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6541 ns |
6500 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6042 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
266163 ns |
263141.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1735708 ns |
1731916.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1724083 ns |
1768000 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1722334 ns |
1725583 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1725709 ns |
1724459 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169232 ns |
168363 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4400187 ns |
4401542 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4356000 ns |
4406313 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4359499.5 ns |
4361083 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4375958 ns |
4360083 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1188845 ns |
1173884.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6958 ns |
6583 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6792 ns |
6791 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6958 ns |
7062.5 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6667 ns |
6791 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20922 ns |
20597 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
49083 ns |
32792 ns |
1.50 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
51583 ns |
62083 ns |
0.83 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33125 ns |
33292 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
78417 ns |
51084 ns |
1.54 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
211261 ns |
293465.5 ns |
0.72 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
18167 ns |
18000 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
19125 ns |
17458 ns |
1.10 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18375 ns |
17916 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17833 ns |
18042 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18753 ns |
18220 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53458 ns |
53250 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53291 ns |
53292 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53417 ns |
53583 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53416 ns |
53416.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
347606 ns |
340467.5 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75291 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75292 ns |
75417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75292 ns |
75292 ns |
1 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75292 ns |
74833 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47188 ns |
46370 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
330792 ns |
324292 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
328083 ns |
342291.5 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
333417 ns |
336708 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325750 ns |
324667 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
209908 ns |
208689 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1487792 ns |
1483500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1527458 ns |
1520542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1526916 ns |
1528333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1465291 ns |
1461958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52424 ns |
51330 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5138083 ns |
5116916.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5295937 ns |
5306417 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5306541.5 ns |
4956417 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4983083 ns |
4985125.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
205950.5 ns |
204511 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28250 ns |
28250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28291 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28333 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24487 ns |
24159 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66209 ns |
66584 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67000 ns |
66208 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66292 ns |
67583 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66542 ns |
66208 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
531569.5 ns |
518001 ns |
1.03 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1379541.5 ns |
1500667 ns |
0.92 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1151395.5 ns |
935916 ns |
1.23 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1056958 ns |
1063395.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2243270.5 ns |
2253583 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
584256.5 ns |
585024 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
2989520.5 ns |
3089125 ns |
0.97 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2762500 ns |
2661333 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2742833.5 ns |
2581104 ns |
1.06 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3827083 ns |
3818625 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2047804.5 ns |
1992242 ns |
1.03 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8011083 ns |
7906625 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7905458 ns |
8031000 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7906916 ns |
7927541.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4819500 ns |
4820333 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
132354 ns |
134041 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81834 ns |
81459 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81708 ns |
82833 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80875 ns |
81833 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193254.5 ns |
194356 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2046104.5 ns |
2010167 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2039000 ns |
2043167 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2030666 ns |
2009750 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2020083 ns |
2026792 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
802997 ns |
794414 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
branch
from
December 3, 2024 04:58
4ce11e6
to
759267a
Compare
… existing compat)
avik-pal
force-pushed
the
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
branch
from
December 3, 2024 05:03
759267a
to
1a6ae85
Compare
avik-pal
deleted the
compathelper/new_version/2024-11-28-00-19-06-369-03156874151
branch
December 3, 2024 08:49
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request changes the compat entry for the
LossFunctions
package from0.11.1
to0.11.1, 1
.This keeps the compat entries for earlier versions.
Note: I have not tested your package with this new compat entry.
It is your responsibility to make sure that your package tests pass before you merge this pull request.