Skip to content

Commit

Permalink
docs: change update step to thousands in PINN 2D PDE (#1153)
Browse files Browse the repository at this point in the history
For 50,000 training steps, an update every 1000 step is enough detail
  • Loading branch information
abhro authored Jan 1, 2025
1 parent 367680b commit 46a012d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion examples/PINN2DPDE/main.jl
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ function train_model(xyt, target_data, xyt_bc, target_bc; seed::Int=0,

isnan(loss) && throw(ArgumentError("NaN Loss Detected"))

if iter % 500 == 1 || iter == maxiters
if iter % 1000 == 1 || iter == maxiters
@printf "Iteration: [%5d / %5d] \t Loss: %.9f (%.9f) \t Physics Loss: %.9f \
(%.9f) \t Data Loss: %.9f (%.9f) \t BC \
Loss: %.9f (%.9f)\n" iter maxiters loss mean_loss stats.physics_loss mean_physics_loss stats.data_loss mean_data_loss stats.bc_loss mean_bc_loss
Expand Down

1 comment on commit 46a012d

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 46a012d Previous: 367680b Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3791 ns 4042 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4500 ns 4125 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4875 ns 4833.5 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3666 ns 3958 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59711.5 ns 60780 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 10500 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10458 ns 10333 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10750 ns 10625 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10625 ns 10833 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 419469 ns 423470 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1062.5 ns 1084 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1167 ns 1125 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1500 ns 1416 ns 1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1125 ns 1208 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18540 ns 18313 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4083 ns 4042 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4083 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4208 ns 4208 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3958 ns 3625 ns 1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109802.5 ns 110716 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57542 ns 57375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46416 ns 46292 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 46500 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80875 ns 82709 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37744 ns 37768 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2035395.5 ns 2006604.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2078396 ns 2082209 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2078708 ns 2011667 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1998584 ns 2018937.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195463 ns 196514.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144250 ns 141709 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144166.5 ns 144000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145125 ns 145187 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 153104.5 ns 144208 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165592.5 ns 165424.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1120291.5 ns 1001541.5 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1113167 ns 1118791.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 832708.5 ns 1097124.5 ns 0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1117084 ns 1141417 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 520015.5 ns 532439 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3667 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3542 ns 3542 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4166 ns 3917 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3125 ns 3541.5 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 66073.5 ns 71776.5 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9042 ns 9042 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8750 ns 9584 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10208 ns 8500 ns 1.20
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8833 ns 9042 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 469701 ns 486557 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17041 ns 15125 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15834 ns 17792 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16604.5 ns 16916.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16791 ns 15250 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54530 ns 56432 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213750 ns 214500 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214875 ns 214625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215667 ns 215333.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226125 ns 216041 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 269469 ns 280343 ns 0.96
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 667 ns 0.81
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 584 ns 1.21
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 709 ns 708 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 667 ns 0.81
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17336 ns 17273.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1583 ns 0.87
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1667 ns 0.82
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1667 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1458 ns 1541 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 100554 ns 103457 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7000 ns 7000 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5750 ns 5937.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5709 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9750 ns 9833 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23286 ns 24396 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222021 ns 222750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228542 ns 229041 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229292 ns 230041 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213937.5 ns 213500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 166141.5 ns 171992 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3959 ns 4000 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23204 ns 23948 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16917 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16792 ns 16583 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17250 ns 17041 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164061.5 ns 165565.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 568792 ns 572458 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 578645.5 ns 576208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 578083 ns 581250 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 575625 ns 575042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113438.5 ns 113609 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1422625 ns 1419604 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1420000 ns 1420333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1422375 ns 1421834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1426708 ns 1421062.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 213572 ns 216706.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1077687.5 ns 1089896 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 960917 ns 966312 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1353229.5 ns 1351792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1315312 ns 1307959 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 274529.5 ns 276909 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5961958 ns 5979271 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4633250 ns 4608000 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4975188 ns 4925667 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5557125 ns 5767000 ns 0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1081948 ns 1097403.5 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 541 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23910 ns 23800 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2250 ns 2084 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 176064.5 ns 174099 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4125 ns 4209 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4375 ns 4042 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5167 ns 5020.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4250 ns 3667 ns 1.16
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65504 ns 66593 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11875 ns 10958 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11000 ns 11167 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11917 ns 12083 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11500 ns 11167 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 448080.5 ns 455844 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 6583 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6958 ns 6417 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8250 ns 7562.5 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 6333 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52534 ns 53149 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18708.5 ns 17375 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18625 ns 17250 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18375 ns 18250 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16708 ns 16458 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 296471 ns 301789.5 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 708 ns 542 ns 1.31
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 584 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33481 ns 33109.5 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8834 ns 8542 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 8500 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9334 ns 9375 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8354.5 ns 8416.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 158505 ns 161412.5 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64459 ns 64666 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64750 ns 64583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64916 ns 64459 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64625 ns 64208 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112347 ns 112066 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 279250 ns 275959 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 282167 ns 279333 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 284125 ns 280167 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 278708 ns 284791 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 187244.5 ns 190816.5 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3278417 ns 3359666.5 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3081000 ns 3020708 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3021792 ns 3019708 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4040979.5 ns 4044937.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573775.5 ns 582824 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7620208 ns 7633375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7449187.5 ns 7444749.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7493708.5 ns 7451687.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8208791 ns 8276916.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1340015.5 ns 1416070 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18366417 ns 17541687.5 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17522312.5 ns 17532229.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17580834 ns 17547042 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14093354.5 ns 14143625 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23631333 ns 23437021 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33504604 ns 33669000 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37034667 ns 36847792 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34967583.5 ns 35241729 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1860248 ns 1852807 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189693000 ns 188072458 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 165014875 ns 164284791 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152416688 ns 152400917 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434850958 ns 434137916 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13871408 ns 13886569 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289105312.5 ns 288796896 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250867083 ns 251588375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296775875 ns 296639417 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 473537562.5 ns 474281875 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22083 ns 22000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22459 ns 22625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25375 ns 24250 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24083 ns 21812.5 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95417 ns 98991 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103083 ns 104791 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103250 ns 103292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104542 ns 104708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103041 ns 103625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 502007.5 ns 514494 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5917 ns 1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 5834 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6708 ns 6459 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5791.5 ns 6167 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68401.5 ns 69465 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14792 ns 14417 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15000 ns 15250 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16542 ns 15459 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14875 ns 14666 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 475091.5 ns 483934.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3002625 ns 2986042 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2079375 ns 2014792 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2272333 ns 2274354.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4882708 ns 4589125 ns 1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586443 ns 584502 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23536000 ns 23505916.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18038562.5 ns 18035749.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16972167 ns 16922042 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34545146 ns 34856104.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2768189 ns 2763874 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33221458 ns 33341541.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27561792 ns 27602208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27327000 ns 27326333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42034750 ns 41263417 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 71417 ns 72791.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 71854.5 ns 73208 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75708 ns 83958 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74708 ns 83208 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101188 ns 103702 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 205250.5 ns 286979.5 ns 0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 206750 ns 206625.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 208958 ns 322750 ns 0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217416 ns 322333 ns 0.67
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 541638 ns 559306 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11875 ns 11458.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11416 ns 11666.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12958 ns 12333 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11708 ns 11958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70557.5 ns 73645.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25667 ns 26208.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26541.5 ns 27000 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27729.5 ns 27416 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26667 ns 26645.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 468068.5 ns 483328.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12812.5 ns 11917 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12209 ns 14750 ns 0.83
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14208 ns 13708 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12291.5 ns 12708 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52262 ns 54699.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25625 ns 25375 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25916.5 ns 25500 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26250 ns 26333 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26604 ns 27875 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 297345.5 ns 308185.5 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 178792 ns 182041.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180750 ns 181583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 181917 ns 183167 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179166 ns 182167 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56939 ns 58753 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 593333 ns 592604 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 582708 ns 583041 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583667 ns 594209 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 584542 ns 586791 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 282717 ns 294181 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6167 ns 6083 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5958.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6875 ns 6833 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5708.5 ns 6250 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 69908.5 ns 72095.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13791 ns 14375 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13917 ns 13083 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15667 ns 14791 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458 ns 14292 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 454508 ns 473402.5 ns 0.96
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1225312.5 ns 1210604.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1241959 ns 1239854 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1289958.5 ns 1297479 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1011625 ns 1024875 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 300319.5 ns 300941 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4103042 ns 4097875.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4403333 ns 4434062.5 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4523854.5 ns 4563541 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3709771 ns 3722313 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1034770 ns 1037751.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1791 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1916 ns 1834 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23619 ns 23494 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4958 ns 4834 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5000 ns 4834 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4958 ns 4917 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 186116 ns 188396 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5833 ns 5625 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5917 ns 5459 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6667 ns 6500 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5209 ns 5562.5 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54405.5 ns 54865 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11125 ns 10583 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11500 ns 10500 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11458 ns 11125 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10500 ns 10666 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 320192 ns 324083 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22488.5 ns 22774 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2708 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2750 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3083 ns 2959 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2708 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 157059.5 ns 158123.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11459 ns 11375 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11625 ns 11083 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12875 ns 12125 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10958 ns 11542 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55353 ns 56425.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25020.5 ns 24583 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25292 ns 24667 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25125 ns 24833.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24875 ns 25250 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 284593.5 ns 289503 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4250 ns 4208 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24743 ns 24426.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16333 ns 16417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16375 ns 16167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16520.5 ns 16334 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16208 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 192574 ns 194624 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5709 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5833 ns 5708 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6042 ns 5834 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33721.5 ns 33182 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21000 ns 20792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21000 ns 20645.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21417 ns 20792 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20709 ns 20417 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 172002 ns 174846 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 422124.5 ns 423688 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 387791 ns 381917 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 477333 ns 480521 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 103125 ns 104125 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66716 ns 66873.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 921333 ns 934375 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 974250 ns 984083 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1186458 ns 1186625 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 457479.5 ns 471042 ns 0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189036 ns 189890.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80542 ns 81458.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80709 ns 80125 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84896 ns 81104.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79833 ns 136333 ns 0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193358.5 ns 192847 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919250 ns 1918292 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1876583 ns 1908625 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1946041 ns 1922750 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1921396 ns 1953687.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 391971 ns 394765 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21948.5 ns 21680 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1917 ns 1792 ns 1.07
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1917 ns 1833 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 166123 ns 167307.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6417 ns 6625 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6666 ns 6333 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7771 ns 7375 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6145.5 ns 6667 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 56772 ns 59094.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9604.5 ns 8958 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9459 ns 8959 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9417 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9041 ns 9416 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 294981.5 ns 303401 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120459792 ns 120415166.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173682208 ns 173861833 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147804000 ns 147873916 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105720875 ns 104464750 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5472285 ns 5466659 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 610206729.5 ns 607892187.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555562500 ns 555380583 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 452099291.5 ns 449180562.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 626409896 ns 624687437 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34955764 ns 34960099 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 657253583 ns 655676042 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665008062.5 ns 664719854.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 581676208.5 ns 586317000.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 857648458 ns 854444125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57875 ns 57541 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47791 ns 47500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47500 ns 46625 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83395.5 ns 85500 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37072 ns 37532 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1915500 ns 1919792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1932792 ns 1980000 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1995084 ns 1978083.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890500 ns 1915584 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 171922.5 ns 173336.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267854.5 ns 266563 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 267708 ns 285125 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 269750 ns 286313 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 268166 ns 267916 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 123763 ns 130327.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 594417 ns 588541 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 681291 ns 688375 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 604895.5 ns 691667 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 689917 ns 713875 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 674236.5 ns 704236.5 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2176375 ns 2209792 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2222812.5 ns 2211250 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2205042 ns 2214666 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2093562.5 ns 2251125 ns 0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133331 ns 133526 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5514416 ns 5473459 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5508500 ns 5495771 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5535958 ns 5506084 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5491750 ns 5555625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 730299 ns 758118 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 638167 ns 641209 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 647708 ns 638417 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 659416 ns 648750 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 643750 ns 647250 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46729.5 ns 46678 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1822167 ns 1823542 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1723042 ns 1728500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1727833 ns 1721125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2106333 ns 2101541 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 219682 ns 220988 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58458 ns 58375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46917 ns 47291 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47292 ns 46667 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84125 ns 84417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28215 ns 28560 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030041 ns 2021604 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2004250 ns 2078542 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2122125 ns 2089792 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1985979.5 ns 2018458 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 186715 ns 188289 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13357770.5 ns 13165083 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12440000 ns 12437062.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12492250 ns 12496625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15108458 ns 15241708 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 510701.5 ns 511138.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47178791.5 ns 47044896 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41760334 ns 41734229 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40950875 ns 41006041 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58205437.5 ns 58474250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2894239.5 ns 2887641 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97014458.5 ns 74158583 ns 1.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91152834 ns 68293166 ns 1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90701604.5 ns 90787478.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98541521.5 ns 76120020.5 ns 1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58959 ns 58708 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 47417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47750 ns 47333 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79958 ns 81500 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47779.5 ns 48467.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918645.5 ns 1906541 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1971000 ns 1966979 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1997667 ns 1972250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1889750 ns 1919083.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192960 ns 194955.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 416 ns 292 ns 1.42
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 33172 ns 31682 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6292 ns 5979.5 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 5959 ns 1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6834 ns 6417 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6250 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 171303 ns 173280.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 250 ns 1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 250 ns 1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32323 ns 31661 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2833 ns 2583 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2917 ns 2625 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2917 ns 2834 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2708 ns 2584 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 162112.5 ns 162166.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 289426812.5 ns 285912791.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339624334 ns 341793875 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 315284104.5 ns 314064437.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 274668667 ns 269291750 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7120353.5 ns 7104649.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1014634416 ns 1013628833 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 953687125 ns 955735416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 857733312.5 ns 855387437.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1265357333 ns 1263250834 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33985258 ns 33975753 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1675373667 ns 1379120562.5 ns 1.21
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1668941291 ns 1314342812 ns 1.27
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1606744000 ns 1634956500 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1787636084 ns 1372311479 ns 1.30
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1409499.5 ns 1410229 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1413833 ns 1415750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1419895.5 ns 1412896 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1458541.5 ns 1460375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127493 ns 127578 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5016749.5 ns 5011584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4651917 ns 5015500 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5058791 ns 5020521 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5012792 ns 5052375 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 551564 ns 577903.5 ns 0.95
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 171852250 ns 171180458 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 129831062.5 ns 128541250 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 115995771 ns 109850250 ns 1.06
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 168839667 ns 169107792 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4879222 ns 4873683 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 629070333 ns 624949333 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 493488792 ns 491287250 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 456364583 ns 454790833 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 675660292 ns 648542167 ns 1.04
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16223916 ns 16059874 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8950646 ns 8910395.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8924625 ns 8995792 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7865125 ns 7901000 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9701750 ns 9817770.5 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1588053 ns 1593491 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36024125 ns 35975583 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37000208.5 ns 37440812.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33425875 ns 33423291.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37661542 ns 38560271 ns 0.98
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6463767 ns 6452757.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47562.5 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47416 ns 47583 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47666 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47375 ns 47375 ns 1
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 17907 ns 18605 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50542 ns 50250 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50375 ns 50417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50584 ns 50625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50583 ns 50459 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 184398 ns 218596.5 ns 0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6958.5 ns 6416 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6500 ns 6625 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8042 ns 7209 ns 1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6542 ns 7000 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 89066 ns 120537.5 ns 0.74
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 9667 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10437.5 ns 9583 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10625 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10209 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 510214.5 ns 676959 ns 0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5666 ns 5584 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 6167 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7417 ns 7146 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5458 ns 5562.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 109271 ns 144983 ns 0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13125 ns 12875 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13250 ns 13084 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13375 ns 13875 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13208 ns 12959 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 457940.5 ns 555671 ns 0.82
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32174 ns 32054 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 7500 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8292 ns 7875 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8167 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 7958.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 199053.5 ns 215727.5 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23354.5 ns 23166.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23250 ns 23292 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23542 ns 23458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23125 ns 23334 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18347 ns 18589.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52667 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52584 ns 52500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52750 ns 52958 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52417 ns 52333 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 291115 ns 299146 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1398084 ns 1401500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1402791 ns 1396145.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1401792 ns 1398562.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1402875 ns 1435792 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195544.5 ns 195172 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5010813 ns 5009646 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5016584 ns 4800875 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5062708 ns 5005896 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5013500 ns 5025041.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 617335 ns 612010.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3040417 ns 3032250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2105083 ns 2072292 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2280208 ns 2300667 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4865521 ns 4921042 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 579665 ns 580134 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24414604.5 ns 24343228.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18876208.5 ns 18906020.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17652979 ns 17758521.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35825688 ns 35734042 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2847809 ns 2830179 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34006188 ns 33956916.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28283750 ns 28347958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27926083.5 ns 28079666 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41742416.5 ns 42065000 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144750166 ns 144437916 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 146949375 ns 147635291 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126208208.5 ns 125109916 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173205292 ns 173674875 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22782449 ns 22545545 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1847080125 ns 908256562.5 ns 2.03
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 809911709 ns 1584608041.5 ns 0.51
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 755677291 ns 749118208 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 667449084 ns 669868292 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118406338 ns 118395391 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76791 ns 81333 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76042 ns 75042 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76417 ns 77166 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72541 ns 73625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 250232.5 ns 243285.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 277229 ns 287145.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 193583 ns 285833 ns 0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 205417 ns 283104.5 ns 0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 303083.5 ns 279041 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1279646 ns 1239705 ns 1.03
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35472875 ns 35487666 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36379896 ns 36325875 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32315333.5 ns 32416604 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40618416.5 ns 40654875 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5840653.5 ns 5840513 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 146765250 ns 146753459 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153200125 ns 153140083.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 137307792 ns 135055542 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 285301125 ns 286267791 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34880703 ns 34875869 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120518062.5 ns 120929708.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174031666 ns 174008000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148283312.5 ns 147856792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106552271 ns 102357166.5 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5465282.5 ns 5458379 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469918416 ns 472290792 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466837917 ns 468203875 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 437920916.5 ns 437903521 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 739774042 ns 743156542 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32269604.5 ns 32279044 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 711087896 ns 709215666.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 640897313 ns 641585354.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 630411896 ns 623424125.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 849787625 ns 853935458 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1302125 ns 1289084 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 905958 ns 912625 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 938334 ns 959625 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1987437 ns 2066167 ns 0.96
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 573939.5 ns 576350.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2951687.5 ns 2954792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2611020.5 ns 2624645.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2639896 ns 2616708 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3702396 ns 3750458 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1765767 ns 1708662 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5801417 ns 5780625 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5727666.5 ns 5802646 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5818916 ns 5793708 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2913834 ns 2916792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7292 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6166 ns 6125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6209 ns 6167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 9917 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25586 ns 24959.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212792 ns 212666.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220834 ns 219979.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221166 ns 220458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215459 ns 244353.5 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 272866 ns 249958 ns 1.09
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 300445333 ns 296320791 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 214002042 ns 216911667 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 196386541 ns 196230687 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 307720792 ns 303909375 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7675041.5 ns 7672082.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1232629833 ns 1231911312.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 899311645.5 ns 900530270.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 825300584 ns 828047958 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1150330250 ns 1151206292 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26367421.5 ns 26738113 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5458 ns 4833 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5416 ns 5500 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6750.5 ns 6167 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5084 ns 5000 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 184497.5 ns 149363.5 ns 1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7041 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7333 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7541 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 6917 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 655045 ns 600699 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24222 ns 23466 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9542 ns 8667 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9833 ns 8417 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9667 ns 9667 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9041 ns 9125 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 221511.5 ns 211340 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352562.5 ns 368458 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351833 ns 351459 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353416.5 ns 352500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 366166 ns 352146 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21264 ns 21302 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 826208 ns 826271 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 775333.5 ns 824958.5 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 808520.5 ns 792000 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 828833 ns 830250.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 278649 ns 269586 ns 1.03
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 340917 ns 340937.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 342729.5 ns 343062.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 453708 ns 454770.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10687.5 ns 14084 ns 0.76
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18338 ns 17990 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 709875 ns 710583 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 728042 ns 728458 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1005792 ns 1004208 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26667 ns 27417 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 257132 ns 239886 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 380187.5 ns 383166.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 355542 ns 350542 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 442146 ns 443208 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30959 ns 31250 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22801.5 ns 22514 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 726667 ns 718250 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 778791.5 ns 782083 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1034042 ns 1028417 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 105042 ns 105334 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 214595.5 ns 217107 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3583 ns 3333 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3542 ns 3708 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3708 ns 3625 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3542 ns 3417 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17801 ns 17516 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4104.5 ns 1.12
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4333 ns 4208 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4375 ns 4291 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4167 ns 4166 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 276455 ns 232485 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3833 ns 3333 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3542 ns 3667 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4292 ns 4084 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3500 ns 4250 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 219668 ns 176024.5 ns 1.25
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8291 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8334 ns 8250 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 8250 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8542 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1228564 ns 1051146 ns 1.17
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203709 ns 204709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209833 ns 210709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 213750 ns 210583 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200750 ns 199833.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34897 ns 34425 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 611979.5 ns 647229 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623084 ns 649666.5 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 633542 ns 626208 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630833 ns 640479.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 337730.5 ns 293508 ns 1.15
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 991250 ns 993750 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1017458.5 ns 1020395.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 954833 ns 958396 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 864916.5 ns 887291 ns 0.97
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208131 ns 206487.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4517208 ns 4504792 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4768041 ns 4702583.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4459667 ns 4449000 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4281312 ns 4321500 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 937605 ns 979904 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 3167 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3291 ns 3541 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4250 ns 4166 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3166 ns 3333.5 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 221703 ns 174711 ns 1.27
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7042 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7042 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7687.5 ns 7375 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7084 ns 7083 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1025587 ns 911927 ns 1.12
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1644333 ns 1650250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1183209 ns 1195333 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1370292 ns 1375625 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2475167 ns 2471000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213710.5 ns 213276 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12346958.5 ns 12340062 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9593646 ns 9568500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9292209 ns 9298896 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17963583.5 ns 18088041 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1947963.5 ns 1943838 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17361375 ns 17384833.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14393542 ns 14357854 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14339750 ns 14387313 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21095083 ns 21175104 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88167 ns 100083 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 88875 ns 87750 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91875 ns 93416.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 134020.5 ns 89625 ns 1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126192 ns 125990 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027813 ns 2026687.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2027000.5 ns 2031083.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2054000 ns 2031250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2028125 ns 2050458.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1026969 ns 951363 ns 1.08
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 2792 ns 2979 ns 0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2583 ns 2875 ns 0.90
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3458 ns 3520.5 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1917 ns 2521 ns 0.76
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16376 ns 16207 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2709 ns 2666.5 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 2500 ns 1.12
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2792 ns 2875 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2833.5 ns 2959 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 186134.5 ns 179422.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7250 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 5958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 6000 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34252.5 ns 33838 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 242958 ns 225292 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220917 ns 219750 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220417 ns 220542 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 240375 ns 244708 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 328052.5 ns 293649.5 ns 1.12
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3791 ns 3709 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22539 ns 22219 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14584 ns 14417 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14542 ns 14375 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14584 ns 14625 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14417 ns 14583 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 484358 ns 436265 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92125 ns 140000 ns 0.66
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 92458 ns 92458 ns 1
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98562.5 ns 96792 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 118229 ns 96792 ns 1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125261.5 ns 125211.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1913333 ns 1921583.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1909771 ns 1923937.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1956333 ns 1928188 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1924333 ns 1942771 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 935173 ns 855373 ns 1.09
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 879000 ns 874041 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 818395.5 ns 820458 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1219520.5 ns 1223417 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 966459 ns 972500 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 267198 ns 272168 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2822917 ns 2804167 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2496917 ns 2520875 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3359000 ns 3337667 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3411333 ns 3424895.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1570113.5 ns 1501496.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17000 ns 16791.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15458.5 ns 14854.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19041 ns 18375 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16875 ns 15229 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133146.5 ns 131230 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 258834 ns 227959 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215125 ns 250729 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215792 ns 216125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227875 ns 262791 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 602653.5 ns 582129.5 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 219062.5 ns 222062.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221375 ns 219125 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222875 ns 222041.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220791 ns 221584 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 247312 ns 244344.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 497625 ns 508270.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 535916 ns 521083 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 499208 ns 498833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 511125 ns 565541.5 ns 0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1333241 ns 1195773 ns 1.11
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3833.5 ns 4479.5 ns 0.86
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4250 ns 3583.5 ns 1.19
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5166.5 ns 4750 ns 1.09
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3792 ns 4625 ns 0.82
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16912 ns 16818 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7542 ns 7208 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7167 ns 7250 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7542 ns 7333 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7667 ns 7458.5 ns 1.03
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 186762.5 ns 180977.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18667 ns 18583 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16708 ns 17583.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20584 ns 19958.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18084 ns 17333 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 136037 ns 132074.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224209 ns 212166 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212687 ns 212146 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213167 ns 212917 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222979.5 ns 218959 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 896805 ns 814362 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4250 ns 4042 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4333.5 ns 4208 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5125 ns 5000 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3875 ns 4000 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 222577.5 ns 175168.5 ns 1.27
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10542 ns 10250 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10791 ns 9687.5 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10959 ns 11083 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10333 ns 10125 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1034707.5 ns 961404 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3041.5 ns 1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3333 ns 3291 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4042 ns 4375 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2958 ns 3416.5 ns 0.87
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 225445.5 ns 193655 ns 1.16
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7208.5 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns 7209 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7542 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns 7458 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1042046 ns 972220 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23498333.5 ns 23356708 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34789375 ns 34480833.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37689958 ns 37583875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34909542 ns 35001895.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1849921 ns 1828165 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184647292 ns 184126958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 163834583 ns 166867125 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146363541.5 ns 146311896 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 274565083 ns 275288375 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16510014 ns 16524063 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 278243563 ns 276685520.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 245760791.5 ns 252606729 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231789354 ns 231173396 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 324000854.5 ns 324261749.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182625 ns 184542 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184458 ns 182833 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186250 ns 185583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181875 ns 184895.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 206355.5 ns 166499.5 ns 1.24
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 628291.5 ns 634000 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 608229.5 ns 585209 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 598250 ns 592708.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 637791 ns 630958 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 999947 ns 926373.5 ns 1.08
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3874375 ns 3858042 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3917042 ns 3914708 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3534687.5 ns 3549917 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4554291 ns 4595104.5 ns 0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA 531266.5 ns 532803 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17461354.5 ns 17337937.5 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17833459 ns 17877583 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16559937.5 ns 16422125 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19938750 ns 20130416.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2619194 ns 2619405 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 666 ns 625 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33463 ns 32935 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9292 ns 8958 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 8875 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9375 ns 9458 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9187.5 ns 9209 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 252733 ns 248903 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 651812167 ns 649671041.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 390086667 ns 390100166.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 327502625 ns 355146542 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 747314333 ns 750210500 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12474949 ns 12471745.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1879705041.5 ns 1883695042 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1650371917 ns 1646365041 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1514378771 ns 1513696187.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2204966313 ns 2208789146 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49428315 ns 49495223 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1651458 ns 1642208 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1196083 ns 1192812.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1387103.5 ns 1386104 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2353958 ns 2519667 ns 0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217144 ns 215937.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12704667 ns 12672750 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9935187.5 ns 9911875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9671333.5 ns 9658417 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18432334 ns 18448708.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2021545.5 ns 1992558.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17670625 ns 17681874.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14743791.5 ns 14694333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14593292 ns 14589750 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21437146 ns 21582250 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26667 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26333 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24013 ns 23957 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67166 ns 66959 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67208 ns 67750 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67917 ns 67250 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66958 ns 67459 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 380547.5 ns 371563.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 202875 ns 203875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210375 ns 209500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209916 ns 209125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 198750 ns 200459 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25898 ns 26219 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 645354 ns 647500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 637500.5 ns 669416.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 634542 ns 685542 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 634250 ns 632166.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 326606.5 ns 324278 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 672209 ns 675000 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 637917 ns 541042 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 665042 ns 637375 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 664917 ns 666542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131949 ns 132249.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2224563 ns 2232250 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2248771 ns 2239333.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2241125 ns 2241084 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2237000 ns 2299271.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1095016 ns 1091764 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17417 ns 17833 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17333 ns 17917 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19500 ns 20584 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16875 ns 18709 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133320 ns 133803 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 260770.5 ns 260333 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219458.5 ns 255395.5 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229000 ns 253687.5 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 263334 ns 230479 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 947049 ns 901721 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 541 ns 1.16
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 666 ns 667 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 666 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23873 ns 23720 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10000 ns 8333.5 ns 1.20
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 9666 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10125 ns 10208 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9750 ns 9750 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 245331.5 ns 244421 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5125 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5625 ns 5750 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6604.5 ns 6584 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5000 ns 5125 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 209896.5 ns 195651 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7875 ns 7083 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7375 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7687.5 ns 7750 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7334 ns 7875 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 739872 ns 711373.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2041 ns 2041 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2250 ns 2250 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2458 ns 2250 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2084 ns 2208 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18207 ns 18128 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6542 ns 6542 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6458 ns 6542 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6708 ns 6625 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6541 ns 6417 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 306864 ns 296966 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 747125 ns 751937.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 749958.5 ns 746542 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 747167 ns 750125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 771333.5 ns 751833.5 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21305 ns 21365 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 791000 ns 811458 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 780041.5 ns 810958 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775416 ns 790958 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 794812.5 ns 813167 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 271390 ns 271261 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6959 ns 7334 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5958 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5917 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10250 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33759 ns 33874 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 259750 ns 258396 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 238854 ns 269104 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 231104 ns 253416 ns 0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 250208 ns 245208 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 336384 ns 333723 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10125 ns 10250 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10312.5 ns 10334 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10875 ns 10625 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10167 ns 10250 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 223921.5 ns 213790.5 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24167 ns 24583 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24583 ns 24500 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25333 ns 24792 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24584 ns 24916 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1062400 ns 1032950.5 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106104729.5 ns 107140583 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117502187.5 ns 117792062 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120758625 ns 120863042 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117423500 ns 117603375 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2624434 ns 2946778 ns 0.89
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 392280708 ns 393794791.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 358697709 ns 359678396 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 357440917 ns 357838334 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 540821208.5 ns 545418083.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15254730 ns 15489580 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 781416292 ns 607837250 ns 1.29
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 760831458 ns 579716416 ns 1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 750885583.5 ns 747642396 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 784554021 ns 607166334 ns 1.29
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7583 ns 7292 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 6958 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8208 ns 7625 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7917 ns 6834 ns 1.16
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 214784 ns 206235.5 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14542 ns 13709 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13667 ns 14167 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14125 ns 14500 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14375 ns 14292 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1015761 ns 968613 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5625 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6125 ns 6250 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 6875 ns 1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5500 ns 5750 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 211436.5 ns 204166 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12875 ns 12625 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12417 ns 12583 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12687.5 ns 13000 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13042 ns 12292 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 728295 ns 694587 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5250 ns 5917 ns 0.89
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5709 ns 5458 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6542 ns 5875 ns 1.11
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5375 ns 5958 ns 0.90
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17219 ns 16951 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15750 ns 15583 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15375 ns 15375 ns 1
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15584 ns 15625 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15916 ns 15708 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 188803.5 ns 185517 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 333 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23653 ns 22862.5 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6583 ns 6209 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6208 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6542 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6375 ns 6375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 227179 ns 223995 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5834 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6041 ns 5917 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 6000 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24470 ns 23989 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21520.5 ns 20833 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21209 ns 20583 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21667 ns 21625 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21334 ns 21375 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 249183.5 ns 246983.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144062.5 ns 169125 ns 0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143042 ns 144292 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146334 ns 148291.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 188146 ns 189062.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167467 ns 166865 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1317583 ns 1326271 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1321709 ns 1323042 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1365791.5 ns 1320500 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1318666 ns 1341500 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1237894 ns 1189366 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24708 ns 23000 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24375 ns 23479 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24375 ns 24875 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22374.5 ns 24750 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 318636 ns 254630.5 ns 1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 134750 ns 130167 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 181250 ns 128375 ns 1.41
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 130000 ns 123229 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 130958 ns 131062.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1345187.5 ns 1279498 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23482 ns 23209 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6042 ns 1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6416 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6792 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 6458 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 243071 ns 238830 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 4333 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4541.5 ns 4542 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5333 ns 4708 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4583 ns 4791 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 231105.5 ns 217579.5 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9875 ns 9666 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9916.5 ns 10042 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10125 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10208 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1276883 ns 1231902.5 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1667 ns 1584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23221 ns 22989 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 5667 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5750 ns 5625 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6083 ns 6041 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5709 ns 5750 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 262260 ns 258706.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6814041 ns 6877625 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6367459 ns 6431167 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6578812.5 ns 6497166 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7695958 ns 7600437.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214554 ns 213793 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24052709 ns 24074875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21310875 ns 21241875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21123834 ns 21023583.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29855166.5 ns 29822125.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2121783 ns 2088714.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48838979.5 ns 37413209 ns 1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45549667 ns 34256250 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45706771 ns 45704562.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49408500 ns 38148271 ns 1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5875 ns 5416 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5709 ns 6104.5 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6708 ns 6667 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5541 ns 6167 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 212106.5 ns 206549 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 7917 ns 1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 8229.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8584 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8542 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1001631 ns 962776 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1556417 ns 1560583 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1270792 ns 1259145.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1624187.5 ns 1626291.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2180520.5 ns 2161625 ns 1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA 274298 ns 280818.5 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7888792 ns 7902229 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6591250 ns 6567125 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7197854 ns 7147750 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10478229.5 ns 10485771 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1773709 ns 1771472.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 366500 ns 373687.5 ns 0.98
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 371020.5 ns 370583 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 457708 ns 462021 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 33208.5 ns 23584 ns 1.41
batchedmm(128, Bsize=4)/forward/GPU/CUDA 47286 ns 45539 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 723916.5 ns 728750 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 801750 ns 804208.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1064875 ns 1065312.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 115334 ns 96666.5 ns 1.19
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 287209.5 ns 226465 ns 1.27
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397291 ns 397333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287834 ns 288042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288166 ns 288417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750833 ns 751375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44324 ns 44356 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 661875 ns 672167 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532416 ns 531292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 535458 ns 528292 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973250 ns 975666 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191330.5 ns 193617.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 670958 ns 669291 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644229 ns 642666 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 680667 ns 644708.5 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 648125 ns 687208 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132061.5 ns 132960 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2459333 ns 2454209 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2456084 ns 2456687 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2464542 ns 2455291 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2456083 ns 2470521 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1216753 ns 1122477 ns 1.08
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3708 ns 3541 ns 1.05
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 3334 ns 3208 ns 1.04
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4334 ns 4458 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2667 ns 2958 ns 0.90
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16517 ns 16816 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5500 ns 5292 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5458 ns 5333 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5625 ns 5625 ns 1
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5542 ns 5750 ns 0.96
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 186819.5 ns 187435 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458167 ns 1458000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500500 ns 1498250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499333 ns 1497083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437750 ns 1439583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39930 ns 40900 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5130750 ns 5127041 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5285584 ns 5298083.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5315979 ns 5287583 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4998959 ns 5015875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195663 ns 198989 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33499 ns 34297 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15375 ns 15125 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15083.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15500 ns 15375 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15167 ns 15166 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 351211 ns 348507 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70667 ns 71250 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71208 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71959 ns 70959 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71333 ns 71209 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113147 ns 113569.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318500 ns 317792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318000 ns 319125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 323666 ns 319500 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317125 ns 319875 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 195331 ns 197937.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1084 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1125 ns 1000 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1084 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23576 ns 23702 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7500 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8334 ns 7750 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns 8334 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 7958 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 249171.5 ns 249887 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 506709 ns 504875 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 492375 ns 484208 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 562708 ns 564708 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 222187.5 ns 236458 ns 0.94
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129166 ns 130159 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1387250 ns 1379479.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1449208 ns 1446458.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1788375 ns 1730646 ns 1.03
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 865812.5 ns 884667 ns 0.98
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273491 ns 273315.5 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 416 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32843 ns 32089 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6667 ns 6083 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6000 ns 1.08
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6500 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6458 ns 6083 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 250973.5 ns 250296.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1722042 ns 1723562.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1723208.5 ns 1725958.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1721083 ns 1731208 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1723750 ns 1767667 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168847 ns 168954.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4362042 ns 4352187.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4261187.5 ns 4302209 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4415583.5 ns 4360250 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4366958.5 ns 4366750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1143038 ns 1065222 ns 1.07
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6750 ns 6916 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6959 ns 6750 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6959 ns 6875 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6708.5 ns 6958 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20756 ns 20747 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51417 ns 67792 ns 0.76
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32917 ns 48292 ns 0.68
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33333 ns 32958 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51208.5 ns 51583 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 197240.5 ns 198224 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17542 ns 18375 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17875 ns 17625 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18916 ns 18542 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17750 ns 18291 ns 0.97
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18861 ns 18190 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53458 ns 53292 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53334 ns 53541 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53250 ns 53500 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53500 ns 53500 ns 1
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 319618.5 ns 306993 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75292 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75375 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75792 ns 75000 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75208 ns 75458 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47162 ns 46432 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324375 ns 323792 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 327625 ns 324916 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 329583 ns 325000 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324208 ns 327375 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 211676.5 ns 209114 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484375 ns 1485167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1527958 ns 1524792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1527583 ns 1525000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1462209 ns 1466042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51967 ns 51777 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5124708 ns 5115209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5280333 ns 5290000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5332500 ns 5261979.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4985875 ns 5012167 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202369.5 ns 202581 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28291 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28333 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28291 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24821 ns 24112 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66459 ns 66333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66458 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66833 ns 66667 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66416 ns 67041 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 482606 ns 467729 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1501229 ns 1491583.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1127563 ns 1128834 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1119291.5 ns 1128084 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2246375 ns 2260833.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 570915 ns 577757.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3082875 ns 3056208 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2738375 ns 2732395.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2760354 ns 2734709 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3780667 ns 3843875 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1961915 ns 1892225.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7895333 ns 7896000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7893459 ns 7928041.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7944812.5 ns 7897562.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4834521 ns 4840958 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80959 ns 81709 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80333 ns 81062.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82166 ns 85084 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 134375.5 ns 90541 ns 1.48
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193995.5 ns 194858.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2014625 ns 2012792 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2006229 ns 2022916.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2047021 ns 2012625 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2022958 ns 2042500 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 740969 ns 690147 ns 1.07

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.