Skip to content

Commit

Permalink
docs(reactant): simplify the enzyme call (#987)
Browse files Browse the repository at this point in the history
* docs(reactant): simplify the enzyme call

* docs: fix enzyme call
  • Loading branch information
avik-pal authored Oct 23, 2024
1 parent 5633fbe commit 817ce1a
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions docs/src/manual/compiling_lux_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,8 @@ Now we will compile the gradient function using `Reactant.@compile`.

```@example compile_lux_model
function enzyme_gradient(model, ps, st, x, y)
dps = Enzyme.make_zero(ps)
Enzyme.autodiff(Enzyme.Reverse, Const(loss_function), Active, Const(model),
Duplicated(ps, dps), Const(st), Const(x), Const(y))
return dps
return Enzyme.gradient(Enzyme.Reverse, Const(loss_function), Const(model),
ps, Const(st), Const(x), Const(y))[2]
end
enzyme_gradient_compiled = @compile enzyme_gradient(model, ps_ra, st_ra, x_ra, y_ra)
Expand Down

1 comment on commit 817ce1a

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 817ce1a Previous: 5633fbe Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 70542 ns 398708.5 ns 0.18
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 73167 ns 72333.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 74208 ns 74041.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 71541 ns 71167 ns 1.01
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43995 ns 44998 ns 0.98
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 270729 ns 1310521 ns 0.21
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 325667 ns 272334 ns 1.20
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 270604 ns 260208 ns 1.04
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 313917 ns 286645.5 ns 1.10
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 194622 ns 192473 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 403875 ns 1286000 ns 0.31
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 406167 ns 408333 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 403500 ns 425666 ns 0.95
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 320625 ns 336750 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1536229.5 ns 1782417 ns 0.86
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1202875 ns 1203812.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1389000.5 ns 1389854 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2432625 ns 2353000.5 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212925 ns 213060 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12265687.5 ns 12139583.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9563916.5 ns 9558042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9302437.5 ns 9325479.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18003541 ns 18029333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1895448 ns 1903843 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17325000 ns 17304896 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14341125 ns 14365229 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14345958 ns 14311583.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21071084 ns 21182146 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 122021416.5 ns 250844916 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174130042 ns 174043167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 149172395.5 ns 147706708.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 107349875.5 ns 104215334 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5475745 ns 5509992 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 584274792 ns 1208153416 ns 0.48
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 533708583 ns 535649167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 443275083.5 ns 438878250 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 630283625 ns 631915667 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38140342 ns 38034017.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 704421479 ns 1069304583 ns 0.66
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 675174833 ns 667503333 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 596452895.5 ns 616964458.5 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 745204646 ns 744392396 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 869333 ns 1118938 ns 0.78
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 817958 ns 826541.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1226229 ns 1218813 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 963333 ns 944417 ns 1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA 268571 ns 275412.5 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2688583 ns 3230584 ns 0.83
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2413000 ns 2410042 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3293125 ns 3297917 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3266709 ns 3279687.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1066059 ns 1070834 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6707687 ns 6940479 ns 0.97
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6421541 ns 6345229.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6561458.5 ns 6498146 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7617542 ns 7619292 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210412 ns 210830 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24351959 ns 24348020.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21777750.5 ns 21802625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21667208 ns 21625979 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29689167 ns 29737250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1975433.5 ns 1970602 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48587208 ns 37161750 ns 1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45703354.5 ns 45515166 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45666250 ns 45712979.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49344271 ns 49484646 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13366687 ns 13755125 ns 0.97
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12397708.5 ns 12438333 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12505375 ns 12501666.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15199833 ns 15168208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 513574.5 ns 513180 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47354416 ns 47815084 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41793209 ns 41719750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41201583.5 ns 41210729.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58251333 ns 58345583.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3066964 ns 3217941 ns 0.95
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97142208.5 ns 95291875 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91465292 ns 91202166 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91248709 ns 91466833.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98939750 ns 98989375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287147312 ns 416233916 ns 0.69
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339525208 ns 339464334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 316913333 ns 313767938 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 268449250 ns 271058375 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7090723.5 ns 7061082.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 974481667 ns 1545657458 ns 0.63
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 870526958 ns 898188583 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 828389833.5 ns 826159521 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1109051750 ns 1107530292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33710535 ns 33786267 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1767149875 ns 1790658917 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1687610417 ns 1722150625 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1595678667 ns 1650904250 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1666392333 ns 1671180334 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1551125 ns 2097417 ns 0.74
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1261353.5 ns 1264709 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1653208 ns 1649958 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2149875 ns 2137875 ns 1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA 263573.5 ns 267573.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7881604 ns 9675541.5 ns 0.81
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6520750 ns 6563291 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7254500 ns 7221208.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10440833.5 ns 10480041.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1084797 ns 1100631 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 191678542 ns 377462270.5 ns 0.51
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 141721416 ns 141558375 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 139923854 ns 127416125 ns 1.10
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 177014209 ns 176760875 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4838270 ns 4873783.5 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 624362333 ns 1122686458 ns 0.56
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 618016750 ns 507815583 ns 1.22
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 592371875 ns 593816333 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 805157583 ns 502835542 ns 1.60
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16284095 ns 16273151 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1085854.5 ns 1054520.5 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 975729 ns 957083 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1357208 ns 1358375 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1349083 ns 1298728.5 ns 1.04
lenet(28, 28, 1, 64)/forward/GPU/CUDA 265378.5 ns 273330.5 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 4474791.5 ns 4965250 ns 0.90
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 3739187.5 ns 3766375 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4560417 ns 4597458 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5712666.5 ns 5551125 ns 1.03
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1122289 ns 1169825 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23588062.5 ns 70642000 ns 0.33
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35175438 ns 33503041.5 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37473250 ns 37118042 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35351792 ns 35169250 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1831536.5 ns 1857688 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184620749.5 ns 354366208 ns 0.52
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159156729 ns 158337812.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 184643833.5 ns 184451937.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 383154875 ns 383352917 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16517121 ns 16518743 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 293382125 ns 390000249.5 ns 0.75
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 245326959 ns 243640875 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 291862916.5 ns 294140479 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 434596291 ns 434507583 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 763937917 ns 1277952958 ns 0.60
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 482397583 ns 485919333.5 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 442223645.5 ns 433213020.5 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 863476083 ns 864952167 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12467223 ns 12478846 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1869815583 ns 3528826062 ns 0.53
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1631627208 ns 1558207834 ns 1.05
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1585055312.5 ns 1473262062.5 ns 1.08
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2117325187.5 ns 2071178020.5 ns 1.02
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49578956 ns 49689763 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3059354 ns 3411625 ns 0.90
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2093791.5 ns 2088417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2289083 ns 2285125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4739583.5 ns 4873000 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 579124 ns 585949 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25412708 ns 25945083 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19875521 ns 19714270.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18774500 ns 18949479 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36713334 ns 36845500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3000437 ns 3206157 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34769000 ns 54184583.5 ns 0.64
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 29459354 ns 29555770.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29088291.5 ns 29848896 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42657458 ns 43608062.5 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1650417 ns 1785500 ns 0.92
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1199791.5 ns 1173062.5 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1400458 ns 1396875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2479437.5 ns 2459000 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217327 ns 217473 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12731874.5 ns 12550750 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9939375 ns 9958167 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9706875 ns 9729166 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18298958 ns 18404208 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954694 ns 1959796 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17720541 ns 17667854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14711375 ns 14631583 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14650125 ns 14658583 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21313917 ns 21427250 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23731791.5 ns 70575125 ns 0.34
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34171063 ns 33678083.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37737000 ns 37450417 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35038125 ns 35578375 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1844121 ns 1839026 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 304074583 ns 471605125 ns 0.64
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 228800917 ns 227950667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 191127750 ns 191225709 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 390163042 ns 390127875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13910975 ns 13948109 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 298073062.5 ns 413570792 ns 0.72
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 251492333 ns 249784875 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297366500.5 ns 300285375 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 439290375 ns 439800375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 2412875 ns 4206625 ns 0.57
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2369417 ns 2269125 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2319187 ns 2408833 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2416791.5 ns 2413208 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 591605 ns 588604 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6527458 ns 11056542 ns 0.59
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 6513833 ns 6181729 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 6566958 ns 6557458.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 6523625 ns 6527958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1406564 ns 1411902 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17550208 ns 17068000 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17517771 ns 17515792 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17546062 ns 17552604 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14106333 ns 14105729 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 67500 ns 820083 ns 0.08230874192002517
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 69166.5 ns 73166.5 ns 0.95
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 70917 ns 70625 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 69083 ns 68708 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 48376 ns 48967 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 326250 ns 1511625 ns 0.22
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 320145.5 ns 316459 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 294645.5 ns 325979 ns 0.90
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 329020.5 ns 325312 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 216051 ns 216972.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 443604.5 ns 1538396 ns 0.29
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 445166 ns 403375 ns 1.10
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 433041 ns 444000.5 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 335625 ns 374167 ns 0.90
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3039374.5 ns 3392562.5 ns 0.90
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2078959 ns 2049750 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2285042 ns 2295500 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4855854 ns 4870250 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585428 ns 577280 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23586375 ns 24079791 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18045479 ns 17988417 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18421541.5 ns 18387542 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36084354 ns 36117729.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2895805 ns 3098822 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34493896 ns 53510938 ns 0.64
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27617667 ns 27579604 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29463292 ns 29114500 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41630042 ns 41915417 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121571333 ns 250358250 ns 0.49
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174174208 ns 173849209 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 149053875 ns 147986979 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 103999666 ns 104290083 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5461247 ns 5468752 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 471938937.5 ns 1095527645.5 ns 0.43
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 534939750 ns 535207166 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 436707291.5 ns 432356541.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 722476709 ns 724055375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35165698 ns 35153495 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 642315333 ns 1027502854.5 ns 0.63
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 658599833 ns 659696479 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 583130687.5 ns 602230000 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 735672250 ns 733800000 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 402041.5 ns 2044334 ns 0.20
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 443417 ns 367604.5 ns 1.21
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 335916.5 ns 319083.5 ns 1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 315708 ns 401958 ns 0.79
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 574217 ns 582402.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2022125 ns 6397416 ns 0.32
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2006458 ns 2005000 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 1845834 ns 1821770.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 1993334 ns 2025333 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1315953 ns 1327577 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5776604 ns 9976292 ns 0.58
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5777125 ns 5768729 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5805750 ns 5775813 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2873000 ns 2876833 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 103792 ns 547584 ns 0.19
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 104167 ns 103209 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 105666 ns 105375 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 103625 ns 104020.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28031 ns 27680 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 209334 ns 526083 ns 0.40
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 219479 ns 212584 ns 1.03
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 209959 ns 209792 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 209250 ns 209333 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 218695 ns 219101 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 706687.5 ns 1037667 ns 0.68
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 715687.5 ns 716416 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 707166.5 ns 707708 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 691750 ns 686166 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13667 ns 461458.5 ns 0.029616964472428182
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 13375 ns 13500 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14750 ns 14083.5 ns 1.05
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 12709 ns 13562.5 ns 0.94
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27661 ns 27872 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25791 ns 339792 ns 0.07590231671139991
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 25667 ns 25875 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 26083 ns 26334 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 25917 ns 25958 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 207395 ns 208652 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 45375 ns 352917 ns 0.13
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 45958 ns 51667 ns 0.89
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 46166 ns 46667 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 30459 ns 28208 ns 1.08
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 320027084 ns 596075833.5 ns 0.54
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 290608500 ns 261796375 ns 1.11
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 291095958 ns 274968937.5 ns 1.06
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 319727583 ns 319947500 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7667238 ns 7667816.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1234222187.5 ns 2039634875 ns 0.61
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 994781770.5 ns 996403792 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 921801250 ns 881761792 ns 1.05
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1564104917 ns 1561629292 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27046210 ns 27305261 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 413917 ns 773791.5 ns 0.53
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 414208 ns 417208 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 417333 ns 418834 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 415125 ns 420375 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47319.5 ns 48116 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1092167 ns 2076250 ns 0.53
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1065333 ns 1095687.5 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1072584 ns 1073708 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 1079166 ns 1082062 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 227192 ns 229623.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 3114917 ns 4054312.5 ns 0.77
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 3108625 ns 2996124.5 ns 1.04
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 3108959 ns 3071666.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3008708.5 ns 3030459 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 533375 ns 1439791 ns 0.37
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 495958 ns 435083.5 ns 1.14
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 453271 ns 527417 ns 0.86
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 445520.5 ns 514834 ns 0.87
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585889.5 ns 583408 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 2134229.5 ns 6197020.5 ns 0.34
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2114834 ns 2129000 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2150208 ns 2140812 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 2125500 ns 2135000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1371879.5 ns 1357853 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7941062.5 ns 11881396 ns 0.67
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7913562 ns 7914188 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7966875 ns 7944458 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4886708.5 ns 4861000 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 6646 ns 4542 ns 1.46
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 7417 ns 7750 ns 0.96
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 7875 ns 7791.5 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 7312 ns 6709 ns 1.09
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 25255 ns 25133 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7625 ns 9250 ns 0.82
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7667 ns 7542 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7625 ns 7667 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7708 ns 7250 ns 1.06
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 191623.5 ns 192989.5 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 9042 ns 9375 ns 0.96
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 9042 ns 9209 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 9084 ns 9167 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5917 ns 5958 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 19541 ns 15625 ns 1.25
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 20250 ns 20625 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 21292 ns 21166 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 20250 ns 20000 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25135 ns 25087 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 33917 ns 30875 ns 1.10
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 33167 ns 33937.5 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 33791 ns 33459 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 33583 ns 33916 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 202824.5 ns 203029.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 94646 ns 93000 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 94334 ns 95042 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 94958 ns 95250 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 92208 ns 92458 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 13000 ns 380084 ns 0.034202965660222476
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 14083 ns 12875 ns 1.09
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 15709 ns 15125 ns 1.04
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 13708 ns 13166 ns 1.04
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26307.5 ns 26429 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 23562.5 ns 290416.5 ns 0.0811334755428841
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 24459 ns 23875 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 23500 ns 23042 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 24209 ns 24000 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 172990 ns 171766 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 56958 ns 310458 ns 0.18
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 58750 ns 57208.5 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 57291.5 ns 57292 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 37708 ns 34625 ns 1.09
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 5958 ns 3292 ns 1.81
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 6917 ns 6875 ns 1.01
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 7708 ns 7875 ns 0.98
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 7125 ns 6875 ns 1.04
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23646.5 ns 23455 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5291.5 ns 6792 ns 0.78
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5125 ns 5416 ns 0.95
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5417 ns 5541 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5208 ns 4958 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 177367 ns 175743.5 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 9292 ns 8541 ns 1.09
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 9083 ns 9084 ns 1.00
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 9208 ns 9334 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 6125 ns 6083 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 107224959 ns 153171271 ns 0.70
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116148083.5 ns 117466187.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120301520.5 ns 119681583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117718792 ns 117629167 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2639626.5 ns 2629890 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 399020250 ns 560880125 ns 0.71
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 369329750 ns 370900291.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 396359583 ns 399068916 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 634997083 ns 632760125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15142582 ns 15150542 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 806322750 ns 762982250 ns 1.06
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 758735875 ns 758524167 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 811941167 ns 810307458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 910428709 ns 908363459 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.