Skip to content

Commit

Permalink
docs: add CUDA.CURAND.default_rng() to docs (#1105)
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Nov 25, 2024
1 parent 06eb507 commit 6f9f8d6
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions docs/src/api/Building_Blocks/WeightInitializers.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ learning models.
| --------------------------------- | ----------------------- | ------------------------------------------------ |
| `Random.jl` | `Array` | |
| `StableRNGs.jl` | `Array` | |
| `CUDA.CURAND.default_rng()` | `CuArray` | |
| `CUDA.default_rng()` | `CuArray` | |
| `GPUArrays.default_rng(CuArray)` | `CuArray` | |
| `AMDGPU.rocrand_rng()` | `ROCArray` | |
Expand Down

1 comment on commit 6f9f8d6

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 6f9f8d6 Previous: 06eb507 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3937.5 ns 4375 ns 0.90
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4333 ns 4333 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4917 ns 4875 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4042 ns 4291 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61383 ns 62852 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10583 ns 10250 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 10542 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 10625 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10083 ns 10542 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 431239 ns 442826.5 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1042 ns 1000 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1334 ns 1208 ns 1.10
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1333 ns 1333 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1125 ns 1333 ns 0.84
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18191 ns 18476 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4333 ns 3895.5 ns 1.11
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4250 ns 4167 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4291 ns 4042 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3834 ns 3959 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110865.5 ns 113416 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57709 ns 57542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46667 ns 46292 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46208 ns 46541 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80291 ns 83375 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37897 ns 37589.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2036958 ns 2024896 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083333.5 ns 1835271 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1856125 ns 2098250 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1994375 ns 2020667 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198201 ns 198299 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 157792 ns 143916 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145875 ns 144479.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145583.5 ns 145937.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143729 ns 143417 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166222 ns 166429 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1114145.5 ns 1117416 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1128875 ns 995229 ns 1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1024292 ns 1124542 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1115833.5 ns 1145250.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 534915.5 ns 537731.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3584 ns 3792 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4208 ns 3542 ns 1.19
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4000 ns 4333.5 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3667 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 67978 ns 68177 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9750 ns 9334 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10459 ns 8583 ns 1.22
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 9417 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9125 ns 8854.5 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 495677 ns 498056 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15000 ns 16125 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18500 ns 15813 ns 1.17
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16000 ns 18791 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14583 ns 15167 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55105 ns 55225 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213834 ns 215625 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215958 ns 213104.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213333 ns 214167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214375 ns 213188 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 276152.5 ns 275510 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 792 ns 750 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 875 ns 0.91
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 667 ns 0.81
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17241 ns 17577 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1541 ns 1500 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1708 ns 1667 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1791 ns 1542 ns 1.16
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1542 ns 1500 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 102070.5 ns 103563 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7334 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 5666 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5916 ns 5958 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10333 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23944 ns 23689 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221312.5 ns 221979 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229500 ns 229334 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228667 ns 229417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218021 ns 214125 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170367 ns 169909 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3875 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23420 ns 23381 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17000 ns 16667 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17084 ns 16750 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16916 ns 17000 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16708 ns 16708 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 162884.5 ns 162845 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 573416.5 ns 571542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 580333 ns 574583 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 568042 ns 575500 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 569542 ns 575333 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113416 ns 113453 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1418250 ns 1419645.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1429042 ns 1428270.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1420375 ns 1425833 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1437458 ns 1425291 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 212927.5 ns 212962.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1086895.5 ns 1086583 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 962854 ns 963625.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1344792 ns 1340708 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1286083 ns 1274625 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 281106 ns 275533.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5908292 ns 6003313 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4600625 ns 4543291 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4927041.5 ns 4950500 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5714562.5 ns 5760542 ns 0.99
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1101975 ns 1094293 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23476 ns 23428 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2084 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2209 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2166 ns 2083 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 169515.5 ns 170454.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4333 ns 4542 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4500 ns 4208 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4791.5 ns 4750 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3791 ns 4166 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65149 ns 66283 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11584 ns 10895.5 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11333 ns 11375 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11667 ns 12334 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11333 ns 11084 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 446339 ns 458465.5 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6916 ns 6667 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6917 ns 6541.5 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7479.5 ns 8792 ns 0.85
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6042 ns 6000 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51979.5 ns 52738.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 19125 ns 17667 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17458 ns 17958 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17792 ns 17833 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17333 ns 16875 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 300938.5 ns 308559 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 666 ns 583 ns 1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 666 ns 0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32053.5 ns 32624 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9250 ns 8750 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9042 ns 8833 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9375 ns 9625 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8917 ns 8667 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 158152 ns 159033.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64333 ns 64667 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64625 ns 64583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64458 ns 64458 ns 1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64375 ns 64709 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111585 ns 111312 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 287209 ns 283500 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 277834 ns 271791 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 280583 ns 274167 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 281125 ns 287250 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 183928 ns 185765 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3298562.5 ns 3282083.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3083000 ns 3018667 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3028771 ns 3018083 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4061625 ns 3955041 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 577723.5 ns 584692 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7606291 ns 7658895.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7495375.5 ns 7457750 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7404541 ns 7453875 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8192541.5 ns 8280228.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1371476 ns 1363348 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17505792 ns 17573167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17567291 ns 17536583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17475667 ns 17554104.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14122958.5 ns 14252687.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23660020.5 ns 23479854 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34147791.5 ns 33441750 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37059937.5 ns 37263874.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34985187.5 ns 35385958 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1854503 ns 1857021 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187449375 ns 189054666 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 233703458.5 ns 232216291.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 195671083 ns 192889813 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 433586291 ns 446068417 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13830860.5 ns 13856600 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 288446709 ns 287697167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 337867791 ns 333361166 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296978708 ns 296288146 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 400413062.5 ns 358087917 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22084 ns 21542 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24979.5 ns 21917 ns 1.14
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23875 ns 23833 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21666 ns 23000 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 98077 ns 99298.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 102958 ns 103416 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104792 ns 102959 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 103812 ns 105041 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 110292 ns 103708 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 512479 ns 518798 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6125 ns 6000 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 5750 ns 1.13
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7062.5 ns 6708 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6083 ns 6334 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 69253 ns 69881 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15166 ns 14833.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16145.5 ns 14917 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16208 ns 16166 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15083 ns 15333.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 482969 ns 489724.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3041271 ns 2971270.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2067458.5 ns 2063958 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2297479.5 ns 2272542 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4457375 ns 4666750 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 592674 ns 591884 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23527562.5 ns 23554624.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18050396 ns 18028687.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17902834 ns 17852125 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35496125 ns 36060875 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2768935.5 ns 2769683 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33385459 ns 33424000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27540666 ns 27634645.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28658250 ns 28456458 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41547354.5 ns 41983541 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74875 ns 71959 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74396 ns 75187.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75500 ns 76708 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74333 ns 74750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 102653 ns 105798 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 291291.5 ns 296145.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 318417 ns 218625 ns 1.46
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 208187.5 ns 213708 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 290437.5 ns 318812.5 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 545207.5 ns 566311.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11958 ns 11792 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12145.5 ns 12250 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14209 ns 12750 ns 1.11
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11500 ns 12500 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70994 ns 73134 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27042 ns 26042 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26917 ns 27042 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27625 ns 27875 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26958 ns 27417 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 469447.5 ns 488772.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12708 ns 11792 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12708 ns 12625 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14125 ns 14250 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11958 ns 12750 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52810 ns 54676 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25958 ns 25458 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26209 ns 25334 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 25958 ns 27000 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26875 ns 26167 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301312 ns 315843.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179875 ns 179292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181083 ns 182084 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 182500 ns 182291.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179666 ns 183250 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56497 ns 58956 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 582959 ns 581896 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 588917 ns 588312.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 585083 ns 583583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590500 ns 587083.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 286103 ns 294154 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 5625 ns 1.14
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7125 ns 6334 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8083 ns 6979.5 ns 1.16
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 6520.5 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70488.5 ns 72600.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 13916 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14458 ns 14417 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15417 ns 15667 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14291 ns 14895.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 457568 ns 479621.5 ns 0.95
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1207438 ns 1206937.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1241417 ns 1245542 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1284208 ns 1292542 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 997354.5 ns 1006417 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301394.5 ns 299757 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4107041.5 ns 4169416.5 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4414458 ns 4414833 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4959854.5 ns 4586979 ns 1.08
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3696125 ns 3897020.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1040815 ns 1046888 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23635 ns 23582 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4959 ns 4833 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5041 ns 4916 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4958 ns 4917 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4958 ns 4875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 185922 ns 189156.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 5792 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6625 ns 5917 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6334 ns 7375 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5666 ns 5917 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55102 ns 56730.5 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11084 ns 10500 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11834 ns 10833 ns 1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10791 ns 11542 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10875 ns 10667 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 330730 ns 346270 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22810 ns 22862 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 2709 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 2750 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2959 ns 3042 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 156803.5 ns 160360.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11750 ns 11000 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11958 ns 11833 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13292 ns 13334 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11292 ns 11583 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57953 ns 61359.5 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25291.5 ns 24541 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24917 ns 24459 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25125 ns 25542 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24542 ns 24542 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 293802.5 ns 310594.5 ns 0.95
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4209 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4209 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24619 ns 24670 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16375 ns 16250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16292 ns 16083 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16333 ns 16291 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16250 ns 15959 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 195053 ns 204493 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5834 ns 5791 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5834 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33320 ns 33437 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21209 ns 20500 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21125 ns 21167 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21625 ns 22250 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20667 ns 20895.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 173685 ns 177667.5 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 426708 ns 419666.5 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 384958 ns 386209 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 482062.5 ns 478875 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 102708.5 ns 109479 ns 0.94
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66966 ns 67033 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 909146 ns 909583.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 972729 ns 973708.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1175729 ns 1177021 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 439917 ns 462625 ns 0.95
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190337.5 ns 190401 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80625 ns 81208 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81500 ns 81791 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81708 ns 82417 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80187.5 ns 81291.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193436 ns 193814.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1902458 ns 1935667 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1931125 ns 1916333.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1927562.5 ns 1698562.5 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1906917 ns 1930833.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 397725 ns 412302 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22050 ns 21830 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 169534 ns 176875.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6875 ns 6459 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7146 ns 6333 ns 1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7667 ns 8041 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6500 ns 6875 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 62608.5 ns 68185.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 9375 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9333 ns 9125 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9083 ns 9417 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9458 ns 9542 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 313766.5 ns 336822.5 ns 0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 118190208 ns 120048500 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174175750 ns 173952417 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147818500 ns 148044250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 107522750 ns 104491521 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5476530 ns 5473411 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 612187917 ns 614607916.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 556303083 ns 555101125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 452274750 ns 456940500.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 757288396 ns 767695250.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38234410 ns 34955825 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 649126292 ns 651124125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 667267229 ns 669116625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 589618437.5 ns 578815791.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 741758417 ns 742698833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57583 ns 60083 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 48000 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47833 ns 46333 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82250 ns 84917 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37784 ns 38192 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917978.5 ns 1922875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1995291.5 ns 1974250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1985646 ns 1990062.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1843354 ns 1906625 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 172983 ns 174155.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 266084 ns 267042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268750 ns 265667 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268209 ns 269750 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267562 ns 267791.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132212 ns 147410.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 650416.5 ns 592896 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 674667 ns 684208.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 589437.5 ns 589520.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 688771 ns 698666 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 730804 ns 813499.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2181417 ns 2209791.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2196416.5 ns 2209896 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2101104 ns 2103125 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2231125 ns 2236250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133510.5 ns 133936 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5502917 ns 5510542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5510333 ns 5555437.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5498521 ns 5498229 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5441417 ns 5548208 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 776428 ns 892951 ns 0.87
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 640333 ns 637334 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 646083 ns 646875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 646875 ns 646625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 635334 ns 654875 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47144 ns 46730 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1818833 ns 1826417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1727958 ns 1722937.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1724083 ns 1719334 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2099750 ns 2093625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 220116 ns 221633 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58500 ns 59125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47083 ns 46417 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46458 ns 47000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81500 ns 84917 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28922 ns 28496 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2025729 ns 2036604.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2106791.5 ns 1836917 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2095000 ns 2095792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1998375 ns 2019479 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189080 ns 190943.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13351000 ns 13379729 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12437395.5 ns 12436958 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12498666.5 ns 12541125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14894375 ns 15244750 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 519065 ns 515571.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47200625 ns 47364291.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41881708 ns 41915083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40754334 ns 40768458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58105083 ns 59377021 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2883161 ns 2882313 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 96219125 ns 74658083.5 ns 1.29
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91954062.5 ns 90949750 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90758584 ns 90367208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98984500 ns 99601541 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58708 ns 59459 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47000 ns 47417 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47291 ns 47333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82000 ns 84375 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47821 ns 48174 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1902250 ns 1935167 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1986000 ns 1966979 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1978125 ns 1974250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1883042 ns 1911312.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194258.5 ns 197391.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 417 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32804.5 ns 32694 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6792 ns 6042 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6375 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6583 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6166 ns 6083 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 179592 ns 187739 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32076 ns 32507 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2625 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2792 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2667 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 166744 ns 175489 ns 0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 284237292 ns 286852687.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339653916.5 ns 339813500 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313913791.5 ns 313305624.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 272402875 ns 269143125 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7047786.5 ns 7114947 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 993594875 ns 994653583 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 945283292 ns 936368458 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 835507124.5 ns 837895375.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1160037292 ns 1177393750 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34045459 ns 34035408 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1668906166 ns 1316953312.5 ns 1.27
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1694695167 ns 1689250042 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1627000917 ns 1683427084 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1703328625 ns 1672545042 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1418646 ns 1454500 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1413958 ns 1407875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1414875 ns 1409834 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1411416 ns 1414167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128242 ns 128152 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5029312.5 ns 5044833 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5037875 ns 4714312.5 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5028146 ns 5026437.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5024417 ns 5051042 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 552451.5 ns 686490.5 ns 0.80
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 170453833 ns 172590354 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 127944542 ns 124318041 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129428958 ns 123190833 ns 1.05
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 164372666.5 ns 165357625 ns 0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4859943 ns 4891073 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 620949625 ns 615854000 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 515114583 ns 630625333 ns 0.82
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 463124083 ns 562103875 ns 0.82
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 648066667 ns 653647292 ns 0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16797902 ns 16015121 ns 1.05
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8927250 ns 9006416.5 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8950584 ns 8896937.5 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7917333 ns 7913208 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9753125 ns 9977125 ns 0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1591258 ns 1591890.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35919479 ns 35918500 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37210542 ns 36875416 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33517916.5 ns 33279603.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37573417 ns 39552041.5 ns 0.95
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6470424 ns 6456562 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47417 ns 47709 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47583 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47708 ns 47500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47417 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18601 ns 19056.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50542 ns 50500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50458 ns 50250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50542 ns 50750 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 52916.5 ns 50375 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 206886.5 ns 263004.5 ns 0.79
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7000 ns 6500 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7291 ns 6875 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7625 ns 8250 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7125 ns 7000 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 89400.5 ns 145281 ns 0.62
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10625 ns 9750 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10333.5 ns 10250 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10500 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10479.5 ns 10042 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 543240.5 ns 744130.5 ns 0.73
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6000 ns 5875 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6166 ns 6125 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7083 ns 7375 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5666.5 ns 5750 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 121379.5 ns 151878 ns 0.80
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13333 ns 13042 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13000 ns 13250 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13333 ns 13250 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12687.5 ns 13542 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 510219 ns 607325 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 1042 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1084 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32431.5 ns 32597 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8000 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8542 ns 7833 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns 8292 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8042 ns 8375 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 204899 ns 237194 ns 0.86
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23208 ns 23167 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23417 ns 23084 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23666 ns 23458 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23334 ns 23292 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18285 ns 18583 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52625 ns 52500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52709 ns 54208.5 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53083 ns 53083 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52562.5 ns 52667 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 286000 ns 385105.5 ns 0.74
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1398458 ns 1402916 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1450667 ns 1454958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1398999.5 ns 1401792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1395750.5 ns 1456375 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196905 ns 197023 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5011896 ns 5029479.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5032187.5 ns 5009000 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5012250 ns 5018334 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5002687.5 ns 5048291.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 598226 ns 723983 ns 0.83
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3070875 ns 3027250 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2072042 ns 2080437.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2289104.5 ns 2291437.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4773854 ns 4926416 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 584355 ns 586737 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24311583 ns 24385583.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18870583.5 ns 18885375 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19070166.5 ns 18817833 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36514562.5 ns 37153959 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2861612.5 ns 2840867 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34008958 ns 34104021 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28397792 ns 28291895.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27946625 ns 28002792 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41793708.5 ns 42422958 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144075292 ns 142464667 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147842750 ns 147805542 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126624187.5 ns 127021875 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 172290146 ns 173652229 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22560426 ns 22558116 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1298569062.5 ns 1200697875 ns 1.08
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 886633209 ns 1864615833.5 ns 0.48
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1199135125 ns 1647966021 ns 0.73
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 689233333 ns 686829458 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 117701235 ns 117772826 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73000 ns 75375 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73292 ns 84209 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 85645.5 ns 76458 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72583 ns 80958 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 223969 ns 312734.5 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 276062.5 ns 204687.5 ns 1.35
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 287625 ns 278104 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 282625 ns 192375 ns 1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 190583 ns 283312.5 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1155754 ns 1488616 ns 0.78
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35424583 ns 35722250 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36355854 ns 36312354 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32516083.5 ns 32588937.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40329917 ns 40883292 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5847057 ns 5836813 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 144746000 ns 149459958 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153804708.5 ns 153182708.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 140298187 ns 140187104 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 283107125 ns 226961625.5 ns 1.25
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34865240 ns 34882818.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121095354 ns 121271541.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174763417 ns 174726458 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148056208 ns 147669333 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105211667 ns 105646958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5466322 ns 5477234.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 468110062.5 ns 471261458.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466487917 ns 465682583 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 437682625 ns 434340042 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 737562458 ns 758899104.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35152775 ns 32272056.5 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 706128833.5 ns 709031375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 656179312 ns 654357417 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 571296688 ns 581732375 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 731578125 ns 734152875 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1324833 ns 1246834 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 963417 ns 970729 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 979125 ns 905979 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2064125 ns 2088750 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 573443.5 ns 584722.5 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2963875 ns 3017521 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2641084 ns 2605541 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2621249.5 ns 2618042 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3522250 ns 3762104 ns 0.94
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1659147 ns 1908342 ns 0.87
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5792625 ns 5812937.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5824583.5 ns 5782937.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5815083.5 ns 5769333 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2879416 ns 2967958 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7500 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6333 ns 6250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6250 ns 6125 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10375 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25248 ns 25756 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212708 ns 212417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220666 ns 221208.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221208 ns 220334 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206375 ns 206375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 250623 ns 307623 ns 0.81
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 307616584 ns 310236833.5 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221441583 ns 228243416 ns 0.97
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198752396 ns 199023750 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 309471333 ns 307111500 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7903869 ns 7677099 ns 1.03
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1075422250 ns 1077070792 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 906727646 ns 909540270.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 801892167 ns 811121083 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1153514499.5 ns 1177347271 ns 0.98
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26746953 ns 26401108 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5791.5 ns 5625 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5917 ns 5833.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6375 ns 6667 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4875 ns 5208 ns 0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 155781 ns 199489.5 ns 0.78
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7208 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7334 ns 7334 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7562.5 ns 7416 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7083 ns 7291 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 649264 ns 722768 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 667 ns 0.81
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 666 ns 667 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23898 ns 24721 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9333.5 ns 9000 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9542 ns 9209 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9833 ns 9709 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8792 ns 9416 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 220286 ns 238784 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351479.5 ns 353291.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352042 ns 351750 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353812.5 ns 352354.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 354687.5 ns 362833 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21024 ns 21565 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 811959 ns 814792 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 778625 ns 826041.5 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 774625 ns 777875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 821708 ns 829958 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 304830.5 ns 302093.5 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 339000 ns 337583.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 343083 ns 340250 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451041.5 ns 444208 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10583 ns 10812.5 ns 0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18316 ns 18424 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 714000 ns 719166.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 742750.5 ns 721917 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1003583 ns 1006458 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26375 ns 28250 ns 0.93
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 291054.5 ns 299767.5 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 384958.5 ns 379708.5 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 348083 ns 349958 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 444917 ns 436354 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30125 ns 30833 ns 0.98
batchedmm(16, Bsize=128)/forward/GPU/CUDA 23128 ns 23185.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 738542 ns 737500 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 791896 ns 772041.5 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1018521 ns 1022146 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 105270.5 ns 101459 ns 1.04
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 225989 ns 233048 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3708 ns 3458 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3708 ns 3625 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3792 ns 3625 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3542 ns 3625 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17710.5 ns 18179 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4500 ns 4334 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4625 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4500 ns 4625 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4417 ns 4458 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 279830 ns 297309.5 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3666 ns 3833 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 3708 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4500 ns 4208.5 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3708 ns 4104 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 199332 ns 236489 ns 0.84
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 8458 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 8166 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8458 ns 8792 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8708 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1228522.5 ns 1272633 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203645.5 ns 208042 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210208 ns 215895.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210792 ns 211084 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201125 ns 199667 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34981 ns 35583 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 611520.5 ns 645812.5 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 624479.5 ns 623291 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 624000.5 ns 622916 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630624.5 ns 638333 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 361212 ns 366544.5 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 995583 ns 1020979 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1022646 ns 1006020.5 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 952562 ns 957729 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 869209 ns 904000 ns 0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207395.5 ns 208984 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4529458 ns 4550166.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4744750 ns 4713709 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4448625 ns 4462125 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5089542 ns 5571625 ns 0.91
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 933469 ns 936095 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3666 ns 3708 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3375 ns 3708.5 ns 0.91
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4167 ns 4292 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3209 ns 3709 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 242210.5 ns 245340.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7167 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7167 ns 7375 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7417 ns 7708 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns 7209 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1046390.5 ns 1060150.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1637500.5 ns 1616083 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1186917 ns 1153750 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1336062.5 ns 1337250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2468375 ns 2432374.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213930 ns 217163 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12339333 ns 12337062.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9615979.5 ns 9522833 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9254104 ns 9266729 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17996208 ns 18081312 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954541 ns 1948614 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17361479 ns 17355771 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14427833 ns 14388208.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14271583.5 ns 14348354 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21144917 ns 21196875 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88834 ns 88312.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91500 ns 89271 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 90834 ns 91125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87625 ns 91625 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125982 ns 126391 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2019500 ns 2036875 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2042708 ns 2015416.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2028042 ns 1865791 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2025999.5 ns 2043208 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1063927.5 ns 1072650 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 3541.5 ns 2813 ns 1.26
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2333 ns 2791 ns 0.84
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3584 ns 3375 ns 1.06
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1500 ns 1833 ns 0.82
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15780.5 ns 16578 ns 0.95
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3000 ns 2542 ns 1.18
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2958 ns 2625 ns 1.13
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2709 ns 2875 ns 0.94
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2875 ns 3000 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 195545.5 ns 199941.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7167 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5709 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 5833 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10250 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33801 ns 34656 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221667 ns 212709 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228625 ns 220625 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221000 ns 220541 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206709 ns 220354 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347206.5 ns 356302 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22295 ns 22913 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14500 ns 14584 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14459 ns 14458 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14458 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14417 ns 14166.5 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 485845.5 ns 486419 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92250 ns 92375 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94209 ns 93333.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 95250 ns 94916.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91750 ns 95417 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125421 ns 125841 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1929000 ns 1915000 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1929917 ns 1914209 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922333 ns 1928125.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923646 ns 1940729 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 960449 ns 1045000 ns 0.92
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 875291 ns 871959 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 817687.5 ns 821041.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1220791.5 ns 1216500 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 956708 ns 943271 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 270219.5 ns 280426 ns 0.96
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2786125 ns 2729167 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2476771 ns 2498104 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3326500 ns 3340041 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3277354 ns 3427250 ns 0.96
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1614761 ns 1723859 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16229 ns 17229 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18000 ns 17875 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17084 ns 18792 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14895.5 ns 15375 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142802.5 ns 190406.5 ns 0.75
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222458 ns 228541 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216625 ns 220833.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216270.5 ns 216521 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225667 ns 228437.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 642849 ns 725866.5 ns 0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220666.5 ns 221145.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 223437.5 ns 222000 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 221291.5 ns 221291 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220208 ns 221083.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 270694.5 ns 321034 ns 0.84
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 498104.5 ns 495084 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 505958 ns 496625 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498020.5 ns 496729 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 500229 ns 507625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1376499 ns 1510358 ns 0.91
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4000 ns 3854 ns 1.04
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 3667 ns 3875 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5875 ns 5042 ns 1.17
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3666 ns 4083 ns 0.90
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16958 ns 17250 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7333 ns 7167 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7333 ns 6959 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7125 ns 7250 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7542 ns 7416 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 195319 ns 201503.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17333 ns 20083 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20291.5 ns 16875 ns 1.20
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19354.5 ns 19500 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16708.5 ns 18167 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146982.5 ns 232442 ns 0.63
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214625 ns 224916 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212500 ns 212708 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212792 ns 212416 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221417 ns 248812.5 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1020818 ns 1078558.5 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4125 ns 4333 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4500 ns 4250 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5291 ns 5209 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3542 ns 4417 ns 0.80
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 241162.5 ns 255475 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11000 ns 10708 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10459 ns 10166 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10916 ns 10833 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10125 ns 11375 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1056501 ns 1114054 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3333 ns 3500 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3875 ns 3583.5 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4250 ns 4458 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2938 ns 3917 ns 0.75
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 237567.5 ns 247293 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7875 ns 7542 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7709 ns 7250 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7250 ns 8083 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458.5 ns 7791 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1070019 ns 1128935 ns 0.95
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23347771 ns 23544917 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35406500 ns 34700375 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37669583 ns 37800604.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34858666 ns 35322562.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1830001 ns 1834217 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183823166 ns 183535208 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159867750 ns 159261916 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146428479.5 ns 146891041.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 410553708 ns 419037250 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16506890.5 ns 16405198.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 424862333.5 ns 428624000 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253527416.5 ns 254269584 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 295623854.5 ns 296570146 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 480544667 ns 493357917 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182875 ns 184479.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 185563 ns 184208 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184500 ns 185416 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182250 ns 184062.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 218471 ns 233424 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 633375 ns 585417 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596208 ns 589833 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 587250 ns 586896 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590520.5 ns 639166 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1067870 ns 1146333 ns 0.93
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3926937.5 ns 3917708 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3941459 ns 3921208 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3667000 ns 3581645.5 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4544333.5 ns 4674709 ns 0.97
batchedmm(128, Bsize=512)/forward/GPU/CUDA 531767 ns 538155 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17389166 ns 17548417 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17947521 ns 17792083 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16390812 ns 16472417 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19902458.5 ns 21347458 ns 0.93
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2636393 ns 2621425 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32468 ns 33117 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9479.5 ns 9291 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 9354.5 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9584 ns 9792 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8875 ns 9416 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 263813 ns 269036 ns 0.98
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 498564458 ns 503680250 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 426956020.5 ns 425402999.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 423367333 ns 418147958 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 596263958 ns 678706395.5 ns 0.88
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12482792 ns 12481919 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1875323562.5 ns 1881496729.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1628477375 ns 1619255500 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1492393083.5 ns 1494277750 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2205444916.5 ns 2234203604 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49302271 ns 49122118.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1639125 ns 1536334 ns 1.07
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1202125 ns 1156271 ns 1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1357187.5 ns 1380541 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2457312 ns 2362625 ns 1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213583.5 ns 217676 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12714125 ns 12766416 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9952750 ns 9918708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9614459 ns 9674833 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18361979 ns 18454708 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2064490 ns 2051013 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17715625 ns 17738333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14737021 ns 14710417 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14521854 ns 14604375 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21413792 ns 21451125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26666 ns 26250 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24074 ns 23581.5 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67542 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67625 ns 68000 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68125 ns 67333 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67042 ns 67042 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 400556.5 ns 404754.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203812.5 ns 205583 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210750 ns 209125 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209833 ns 209000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199375 ns 199041 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27041 ns 26431.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 627333 ns 611478.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 626584 ns 633292 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622042 ns 670416 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 580541 ns 611479 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355125.5 ns 353085.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 640833 ns 612333 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 653000 ns 643520.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 599854 ns 644958 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 599062.5 ns 652334 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132599.5 ns 132321 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2247625 ns 2263750 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2173250 ns 2226645.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2242375 ns 2243875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2238250 ns 2302583 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1242951 ns 1253025 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17459 ns 19667 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19917 ns 16917 ns 1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18958 ns 21500.5 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17500 ns 18208 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146290 ns 145311.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227125 ns 233042 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229687.5 ns 218770.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219959 ns 262625 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218709 ns 230000 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1041939 ns 1059070.5 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 666 ns 667 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 24055 ns 23551 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10042 ns 9667 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9875 ns 9583 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10292 ns 10583 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9333 ns 9959 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260885.5 ns 258697 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 5833 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 5542 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6416 ns 6500 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5062.5 ns 4958 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 223280.5 ns 231871 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 6875 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7416 ns 7125 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7459 ns 7792 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 6833 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 794061 ns 803650 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2083 ns 2166 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2333 ns 1917 ns 1.22
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2542 ns 2417 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2166 ns 2333 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17949 ns 17852 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6584 ns 6417 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6584 ns 6458 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6792 ns 6916 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6520.5 ns 6459 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 335163 ns 332798.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 746958.5 ns 749396 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746875 ns 746625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749792 ns 749250 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 751729 ns 751417 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21434 ns 21271 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 819625 ns 793042 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791708 ns 792500 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 773145.5 ns 775750 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 790854 ns 797542 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 298785 ns 296567 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7500 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5542 ns 1.08
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10166 ns 10458 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33922 ns 32600 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220854.5 ns 221375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 236854.5 ns 240270.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228083.5 ns 257854 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212875 ns 222250 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 365652.5 ns 360398 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10209 ns 10104.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10417 ns 10334 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10708 ns 10916 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9541.5 ns 10229.5 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 251155 ns 251149.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24333 ns 24666 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24584 ns 24312.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24583 ns 25750 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24979 ns 25562.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1135827 ns 1138926 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106024583.5 ns 106325125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117903521 ns 117472625 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120396396 ns 120287229 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117544479 ns 117860729 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2631384.5 ns 2629206 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 385240458 ns 394161333.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 368294084 ns 365470000 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 356727875 ns 355300666 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 482802291 ns 484349417 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15255065.5 ns 15196205 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 936146916.5 ns 755446562.5 ns 1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 762770042 ns 762235792 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 746849979.5 ns 742589166.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 945639875 ns 957309125 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7541 ns 7041 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7250 ns 6875 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7583.5 ns 8584 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6479 ns 6625 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 243530.5 ns 243246.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14458 ns 14167 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13959 ns 14041.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14750 ns 14667 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13833 ns 14020.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1088103 ns 1088970 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 5917 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6292 ns 6270.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7166.5 ns 7292 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5583 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238108.5 ns 237671 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13084 ns 12458 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12916 ns 12375 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12583 ns 12916 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12166 ns 12292 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 798707 ns 799420 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5584 ns 5417 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5770.5 ns 5709 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6500 ns 6166 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 7166.5 ns 5667 ns 1.26
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17513 ns 17212 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15667 ns 15458 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15625 ns 15459 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15541 ns 15708 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15583 ns 15625 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 202130 ns 202604 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 416 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23880 ns 23718 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6584 ns 6167 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6334 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6875 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6187.5 ns 6291 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 241952.5 ns 241777 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5917 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5917 ns 6042 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 5959 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5834 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25052 ns 24949 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21667 ns 20958 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21750 ns 21208 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21875 ns 21625 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21167 ns 21250 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 267258 ns 267126.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144125 ns 143875 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144791 ns 143770.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146500 ns 149333 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143125 ns 144270.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168261.5 ns 169394.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1324313 ns 1364229 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1331958 ns 1311708 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1325708 ns 1324520.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1319666.5 ns 1349667 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1358754 ns 1363355 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24416.5 ns 23458 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24834 ns 22250 ns 1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23375 ns 25167 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21292 ns 22584 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 357948 ns 357496 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 132395.5 ns 186729 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 127354.5 ns 175562.5 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118459 ns 180666.5 ns 0.66
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117395.5 ns 165042 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1501059 ns 1496433 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23530 ns 23418 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 6167 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6666 ns 6333.5 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7125 ns 7042 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6459 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 259579 ns 259427.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 4708 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 4625 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5042 ns 5166 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4084 ns 4896 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 258332.5 ns 256096.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10375 ns 9917 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10209 ns 10042 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10209 ns 10667 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10166.5 ns 10750 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1363356 ns 1366539 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23506 ns 23036 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5958 ns 5625 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6042 ns 5833 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6125 ns 6041 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5667 ns 5666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 277914 ns 276314 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6791458.5 ns 6734334 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6360916.5 ns 6391625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6541917 ns 6537375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7577625 ns 7542292 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214916.5 ns 216147 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24027042 ns 24173292 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21266917 ns 21308875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21002500 ns 21052792 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29759417 ns 29893541 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2132435.5 ns 2120264 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48562041 ns 37482583 ns 1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45901125 ns 45446437.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45588125 ns 45525834 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49263125 ns 49665500 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 5916 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6167 ns 5729.5 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6625 ns 7166 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5334 ns 5750 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 236967.5 ns 236953 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9250 ns 8583 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8458 ns 8042 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8958 ns 8666 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8500 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1058397 ns 1066445 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1553000 ns 1511791 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1271333 ns 1266750 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1611667 ns 1624771 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2139521 ns 2083583.5 ns 1.03
lenet(28, 28, 1, 128)/forward/GPU/CUDA 272139 ns 272636.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7938708.5 ns 7911542 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6600938 ns 6587125 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7126750 ns 7180959 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10443521 ns 10527750 ns 0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1846977 ns 1860081 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 374500.5 ns 364792 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 372770.5 ns 367208 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 456750 ns 449270.5 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 23000 ns 23917 ns 0.96
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46393 ns 46266 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 736917 ns 743187 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 808083 ns 805084 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1057958 ns 1059125 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 78020.5 ns 89959 ns 0.87
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 308525 ns 310715.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397417 ns 397333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287917 ns 288042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288000 ns 288209 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 753542 ns 750708 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43767 ns 43949 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 673583 ns 673458 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 536166 ns 531458 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 531917 ns 529250 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973208 ns 974917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188160 ns 189986 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 633500 ns 595125 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 647250 ns 645125 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 599709 ns 661291.5 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 615666 ns 604083.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131655.5 ns 132185 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2457916 ns 2499541.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2396750 ns 2451209 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2458187.5 ns 2456625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2452625 ns 2529417 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1345493 ns 1282545 ns 1.05
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3083 ns 3333 ns 0.92
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2833 ns 3708 ns 0.76
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4500 ns 4125 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2583 ns 2708 ns 0.95
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16191 ns 16211 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5750 ns 5292 ns 1.09
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5584 ns 5250 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5459 ns 5667 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5625 ns 5625 ns 1
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 198160.5 ns 197863.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458292 ns 1466875 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1499833 ns 1505417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499083 ns 1503125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437417 ns 1440875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40922 ns 41133 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5128625 ns 5168291.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5308187.5 ns 5273458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5301146 ns 5291104 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4993250 ns 5023291 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195601.5 ns 197140 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33852 ns 32935 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15500 ns 15042 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15334 ns 15209 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15334 ns 15334 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15166 ns 15000 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 381247 ns 373770 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71292 ns 71500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71416 ns 70750 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71167 ns 71125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 70916 ns 71083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113823.5 ns 112823 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317500 ns 320042 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 321000 ns 315667 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 319083 ns 318667 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318500 ns 324000 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 197369 ns 194736 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 1000 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 959 ns 1000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24373 ns 23415 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333.5 ns 8208 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8229.5 ns 8167 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8250 ns 8375 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7833 ns 8042 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 265338.5 ns 263486.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 511459 ns 505999.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 488042 ns 497291.5 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 567084 ns 560209 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 220750 ns 217875 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129208 ns 129532 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1389625 ns 1384937.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1480250 ns 1454020.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1756312.5 ns 1746937.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 865000 ns 899021 ns 0.96
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 277406 ns 276899 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 416 ns 292 ns 1.42
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32170 ns 31419.5 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 6167 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6333 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6458 ns 6667 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6020.5 ns 6458.5 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 266374 ns 263608 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1718417 ns 1728312 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1721417 ns 1729000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726125 ns 1733417 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1719500 ns 1738250 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169010.5 ns 170018 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4367625 ns 4369375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4399270.5 ns 3963375 ns 1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4374042 ns 4358208 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4359438 ns 4400041 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1258694 ns 1280531 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6500 ns 6750 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6625 ns 6792 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7208.5 ns 6875 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6542 ns 6875 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20518 ns 20701.5 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32542 ns 51792 ns 0.63
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 52479.5 ns 51208 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52000 ns 32833 ns 1.58
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 32625 ns 71875 ns 0.45
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210236 ns 222859 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17542 ns 17625 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17917 ns 17625 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18708 ns 18291 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17375 ns 17937.5 ns 0.97
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18845.5 ns 18343 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53750 ns 53250 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53208 ns 53166 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53250 ns 53166 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53500 ns 53792 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 344404.5 ns 340623.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75292 ns 75709 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75500 ns 74125 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 74833 ns 75291 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74959 ns 75334 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47057 ns 47398 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 323708 ns 325250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 338541 ns 325333 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 326000 ns 324750 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325417 ns 340333 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 211393 ns 211070 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1486000 ns 1491500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1527542 ns 1531125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526208 ns 1529875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463000 ns 1465875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52398 ns 51611 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5120500 ns 5144459 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5242958 ns 5274708 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5297166.5 ns 5268229 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4985916.5 ns 5019729.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 204362 ns 205600 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28167 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28333 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25076 ns 24406 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66417 ns 66541 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66417 ns 66292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66792 ns 66375 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66458 ns 66417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 540264.5 ns 516526.5 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1467604.5 ns 1468729.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1148208 ns 1131000 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1073125 ns 1119791.5 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2179542 ns 2241937.5 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 575331.5 ns 581317 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3075042 ns 3109709 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2748167 ns 2104833 ns 1.31
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2727604 ns 2739417 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3816646 ns 3875250.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2066149 ns 2085553.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7917125 ns 7940229.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7956750 ns 7908458.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7912958 ns 7909729.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4824417 ns 4901667 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81334 ns 81709 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82000 ns 81979.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81895.5 ns 83833 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80250 ns 80541.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193566.5 ns 193422.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2017375 ns 2029687.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2065916.5 ns 2007750 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2015625 ns 2012750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2021542 ns 2040271 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 803967 ns 811844 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.