Skip to content

Commit

Permalink
docs: add Reactant and TPU to autodiff.md (#1101)
Browse files Browse the repository at this point in the history
* Add Reactant to autodiff.md

* Update autodiff.md

* Update autodiff.md

* Apply suggestions from code review

---------

Co-authored-by: Avik Pal <[email protected]>
  • Loading branch information
wsmoses and avik-pal authored Nov 23, 2024
1 parent fb901ea commit d755929
Showing 1 changed file with 29 additions and 18 deletions.
47 changes: 29 additions & 18 deletions docs/src/manual/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,42 +6,53 @@ Lux. Additionally, we provide some convenience functions for working with AD.

## Overview

| AD Package | Mode | CPU | GPU | Nested 2nd Order AD | Support Class |
| :----------------------------------------------------------------- | :------ | :----- | :----- | :------------------ | :------------ |
| [`ChainRules.jl`](https://github.com/JuliaDiff/ChainRules.jl)[^cr] | Reverse | ✔️ | ✔️ | ✔️ | Tier I |
| [`Enzyme.jl`](https://github.com/EnzymeAD/Enzyme.jl) | Reverse | ✔️ |[^q] |[^q] | Tier I[^e] |
| [`Zygote.jl`](https://github.com/FluxML/Zygote.jl) | Reverse | ✔️ | ✔️ | ✔️ | Tier I |
| [`ForwardDiff.jl`](https://github.com/JuliaDiff/ForwardDiff.jl) | Forward | ✔️ | ✔️ | ✔️ | Tier I |
| [`ReverseDiff.jl`](https://github.com/JuliaDiff/ReverseDiff.jl) | Reverse | ✔️ ||| Tier II |
| [`Tracker.jl`](https://github.com/FluxML/Tracker.jl) | Reverse | ✔️ | ✔️ || Tier II |
| [`Mooncake.jl`](https://github.com/compintell/Mooncake.jl) | Reverse |[^q] ||| Tier III |
| [`Diffractor.jl`](https://github.com/JuliaDiff/Diffractor.jl) | Forward |[^q] |[^q] |[^q] | Tier III |
| AD Package | Mode | CPU | GPU | TPU | Nested 2nd Order AD | Support Class |
| :----------------------------------------------------------------- | :------ | :----- | :----- | :----- | :------------------ | :------------ |
| [`Reactant.jl`](https://github.com/EnzymeAD/Reactant.jl)[^re] + [`Enzyme.jl`](https://github.com/EnzymeAD/Enzyme.jl) | Reverse | ✔️ | ✔️ | ✔️ | ✔️ | Tier I |
| [`ChainRules.jl`](https://github.com/JuliaDiff/ChainRules.jl)[^cr] | Reverse | ✔️ | ✔️ || ✔️ | Tier I |
| [`Enzyme.jl`](https://github.com/EnzymeAD/Enzyme.jl) | Reverse | ✔️ |[^q] ||[^q] | Tier I[^e] |
| [`Zygote.jl`](https://github.com/FluxML/Zygote.jl) | Reverse | ✔️ | ✔️ || ✔️ | Tier I |
| [`ForwardDiff.jl`](https://github.com/JuliaDiff/ForwardDiff.jl) | Forward | ✔️ | ✔️ || ✔️ | Tier I |
| [`ReverseDiff.jl`](https://github.com/JuliaDiff/ReverseDiff.jl) | Reverse | ✔️ |||| Tier II |
| [`Tracker.jl`](https://github.com/FluxML/Tracker.jl) | Reverse | ✔️ | ✔️ ||| Tier II |
| [`Mooncake.jl`](https://github.com/compintell/Mooncake.jl) | Reverse |[^q] |||| Tier III |
| [`Diffractor.jl`](https://github.com/JuliaDiff/Diffractor.jl) | Forward |[^q] |[^q] ||[^q] | Tier III |

[^e]: Currently Enzyme outperforms other AD packages in terms of CPU performance. However,
there are some edge cases where it might not work with Lux. We are working on
improving the compatibility. Please report any issues you encounter.
there are some edge cases where it might not work with Lux when not using Reactant. We are working on
improving the compatibility. Please report any issues you encounter and try Reactant if something fails.

[^q]: This feature is supported downstream, but we don't extensively test it to ensure
that it works with Lux.

[^cr]: Note that `ChainRules.jl` is not really an AD package, but we have first-class
support for packages that use `rrules`.

[^re]: Note that `Reactant.jl` is not really an AD package, but a tool for compiling functions, including the use of EnzymeMLIR for AD via `Enzyme.jl`.
We have first-class support for the usage of `Reactant.jl` for inference and training when using `Enzyme.jl` for differentiation.

## [Recommendations](@id autodiff-recommendations)

* For CPU Usacases:

1. Use `Zygote.jl` for the best performance. This is the most reliable and fastest
1. Use `Reactant.jl` + `Enzyme.jl` for the best performance as well as mutation-support.
When available, this is the most reliable and fastest option.
2. Use `Zygote.jl` for the best performance without `Reactant.jl`. This is the most reliable and fastest
option for CPU for the time-being. (We are working on faster Enzyme support for CPU)
2. Use `Enzyme.jl`, if there are mutations in the code and/or `Zygote.jl` fails.
3. If `Enzyme.jl` fails for some reason, (open an issue and) try
3. Use `Enzyme.jl`, if there are mutations in the code and/or `Zygote.jl` fails.
4. If `Enzyme.jl` fails for some reason, (open an issue and) try
`ReverseDiff.jl` ([possibly with compiled mode](https://juliadiff.org/ReverseDiff.jl/dev/api/#ReverseDiff.compile)).

* For GPU Usacases:

1. Use `Zygote.jl` for the best performance. This is the most reliable and fastest
option for GPU for the time-being. We are working on supporting `Enzyme.jl` for
GPU as well.
1. Use `Reactant.jl` + `Enzyme.jl` for the best performance. This is the most reliable and fastest option, but presently
only supports NVIDIA GPU's. AMD GPUs are currently not supported.
2. Use `Zygote.jl` for the best performance on non-NVIDIA GPUs. This is the most reliable and fastest
non-`Reactant.jl` option for GPU for the time-being. We are working on supporting `Enzyme.jl` without
`Reactant.jl` for GPU as well.

* For TPU Usacases:
1. Use `Reactant.jl`. This is the only supported (and fastest) option.

## Support Class

Expand Down

1 comment on commit d755929

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: d755929 Previous: fb901ea Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4083 ns 4125 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4458 ns 4083.5 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4583 ns 5167 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4458 ns 4250 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61537 ns 60836 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9958 ns 10458 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11083 ns 10208.5 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 10333 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10292 ns 10292 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 428120 ns 426426 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1208 ns 1000 ns 1.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1333 ns 1291 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1333 ns 1437.5 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1042 ns 1208 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 17813 ns 17928 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3959 ns 4125 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4084 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4375 ns 4167 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 3958 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110308 ns 109688.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57500 ns 57625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38333 ns 38333 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46625 ns 46792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82166 ns 81167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36705 ns 37191 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027541 ns 2025916.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090041.5 ns 2084833.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2097083 ns 2091333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999875 ns 1993604 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195283 ns 194623 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143625 ns 144416 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143417 ns 147520.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145584 ns 144062.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147187.5 ns 144041 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166525 ns 165620 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1109542 ns 1116375.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1126812.5 ns 1135458 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1122083 ns 1116021 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1020645.5 ns 1117250 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 533338 ns 525200 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3583 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3416 ns 3416 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4541 ns 4417 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3604.5 ns 3750 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 68868.5 ns 67680 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9292 ns 9083 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 9042 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9792 ns 9291 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8833 ns 8750 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 494765.5 ns 488913 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15583 ns 16583.5 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16458 ns 15000 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16500 ns 16937.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15083 ns 14521 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54721 ns 55104 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212833 ns 215166.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215167 ns 213375 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214416 ns 212833 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212417 ns 213208 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 274119.5 ns 272083 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 542 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 792 ns 625 ns 1.27
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 687.5 ns 1.09
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17270 ns 17338 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1667 ns 1583 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1667 ns 1666 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1458 ns 1708 ns 0.85
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1708 ns 1625 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103124 ns 102756.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7291 ns 7083 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5292 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5875 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10083 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23563 ns 23408 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220708 ns 221750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 236874.5 ns 231917 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228875 ns 228875 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220166 ns 214167 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 169828.5 ns 169815.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23299 ns 23411 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16708 ns 16583.5 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16833 ns 16459 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16834 ns 16709 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16667 ns 16791 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 162920 ns 162393 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 574791 ns 569208 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 578334 ns 569667 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 574000 ns 570125 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574333 ns 578750 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113504 ns 113197 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1420083 ns 1418708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1415750 ns 1421583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1420208 ns 1420834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1425187.5 ns 1432291 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 212199 ns 211123.5 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1067895.5 ns 1076625 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 940416 ns 938625 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1346520.5 ns 1353166 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1295333 ns 1298500 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 276087 ns 277930.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6005792 ns 5845333 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4619125 ns 4593146 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4921458.5 ns 4960354 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5705500 ns 5524145.5 ns 1.03
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1093586 ns 1090079 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23336 ns 23601.5 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2166 ns 2209 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2209 ns 2083 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 170662.5 ns 169946.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4083 ns 3666 ns 1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4250 ns 4417 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5250 ns 4709 ns 1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4250 ns 4500 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66890.5 ns 65407 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11166 ns 10834 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11292 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11792 ns 11667 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11145.5 ns 10958 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 455730.5 ns 453534 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6708 ns 6167 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6917 ns 7479.5 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8000 ns 8500 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6833 ns 6375 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53251 ns 52550.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17646 ns 16583 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17687.5 ns 17500 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17583 ns 19833 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18520.5 ns 16625 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 303857.5 ns 303262 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32349 ns 31843 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8500 ns 8542 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 8875 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9291 ns 9250 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9375 ns 8208 ns 1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 158134 ns 159642 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64375 ns 64792 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64500 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64542 ns 64542 ns 1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64417 ns 64375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111051 ns 111120 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 280917 ns 280042 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 285417 ns 291791 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 280750 ns 279250 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 279291.5 ns 277208 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 185526.5 ns 184735.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3281750 ns 3278875 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2797500 ns 2813375 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3018917 ns 3029687.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4088625 ns 3938209 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 571296 ns 578907.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7642500 ns 7620083 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7291354 ns 7352417 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7449292 ns 7457271 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8096333 ns 8189500 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1326986 ns 1328385 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17512333 ns 17561125 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17557479.5 ns 17648625 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17568792 ns 17534459 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14165000 ns 14095167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23618750 ns 23588417 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43411666 ns 44459541 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37050562 ns 37064416.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34914229.5 ns 34977333.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1853387 ns 1845684 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187623875 ns 189659041 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 247457083 ns 250146875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 194208333 ns 193409375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434785500 ns 434181959 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13912861.5 ns 18049039.5 ns 0.77
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289468416 ns 290672125 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 350360437.5 ns 356317062.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297011958 ns 296289666.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 409128187.5 ns 392800437.5 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24042 ns 22875 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23958 ns 22938 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23916 ns 24562.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22020.5 ns 24416 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96407 ns 96194.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103208.5 ns 103875 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104791 ns 103416 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104667 ns 104292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103417 ns 103125.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 511501 ns 506291.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6145.5 ns 5917 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5625 ns 6000 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6979.5 ns 6584 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5709 ns 6209 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 69596.5 ns 68552.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14520.5 ns 15166.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15542 ns 15500 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16125 ns 15542 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14625 ns 14958 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 479202.5 ns 480464 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3041750 ns 2996875 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2066041.5 ns 2072750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2266312 ns 2257667 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4490041.5 ns 4838583 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 590463 ns 584192 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23486917 ns 23549437 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18259854 ns 18342167 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17822021 ns 17896791 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35704478.5 ns 35570625 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2768088 ns 2764116 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33321020.5 ns 33587937.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28000312.5 ns 28029333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28560333.5 ns 28377209 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41618958 ns 41334187.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72209 ns 75479 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 81645.5 ns 73958.5 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 74917 ns 74125 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72396 ns 72166 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105122.5 ns 104339 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 278083 ns 203458.5 ns 1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 314375 ns 280916.5 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 208562.5 ns 209583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 241750.5 ns 216291.5 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 565906 ns 562778.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11417 ns 11708 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11833.5 ns 12833 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12250 ns 13042 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12166 ns 11917 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 73969 ns 72705 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26125 ns 26645.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27542 ns 26458 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26708 ns 27458 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26708 ns 26792 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 488459.5 ns 488247 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12208 ns 12000 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13084 ns 13750 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13833 ns 14000 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12687.5 ns 12500 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 55593 ns 55166 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25333 ns 25583 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26458 ns 26416 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26250 ns 26375 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26458 ns 28167 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 314229 ns 313572.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 181625 ns 181541.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180250 ns 181104 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183667 ns 181895.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179417 ns 181916 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 58869 ns 59339.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 587667 ns 612417 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585625 ns 590459 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583584 ns 583541 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 584708 ns 582416 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 294563.5 ns 294347 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5395.5 ns 5854.5 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6042 ns 7000 ns 0.86
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8416.5 ns 7167 ns 1.17
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8791 ns 6042 ns 1.45
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 73281.5 ns 72861 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13834 ns 14208.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15125 ns 14333 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14333 ns 15084 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14250 ns 14208 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 478456 ns 476457 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1191541 ns 1198334 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1236750 ns 1236458 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1285583.5 ns 1270167 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1003417 ns 1009834 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302585 ns 301349 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4114354 ns 4121104 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4527875 ns 4571459 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4560333.5 ns 4583146 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3695000 ns 3708333 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1056192.5 ns 1054428 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23824 ns 24401 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 5042 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4959 ns 4875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 193428 ns 192852.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6084 ns 5916.5 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6166 ns 6625 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6917 ns 7625 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6209 ns 5916 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 57953 ns 57663 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10333 ns 10562.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11709 ns 11417 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11333 ns 12083 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11583 ns 10459 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 343622 ns 339260 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23294 ns 23460 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2791 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 2792 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 2709 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2791 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 163978 ns 162941.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11375 ns 11542 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11666 ns 12209 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12500 ns 13875 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11542 ns 11583 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 59566.5 ns 59011.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24459 ns 24375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25042 ns 24583 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25083.5 ns 25208 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25083 ns 24792 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 305262.5 ns 303188 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4209 ns 4208 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25152 ns 25111 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16083 ns 16042 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16042 ns 15917 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16375 ns 16291 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16291 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 203575 ns 202144.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5875 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5833 ns 5833 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34167 ns 34056 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20291.5 ns 20520.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21041 ns 21000 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21500 ns 21167 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21167 ns 21333 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 180386.5 ns 179609.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 420667 ns 425458.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 363520.5 ns 364854.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 482000 ns 482520.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 125291.5 ns 103125 ns 1.21
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67480 ns 67737 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 897041 ns 906625 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 967000.5 ns 982042 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1167958 ns 1181333 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 396500 ns 377458 ns 1.05
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 197078.5 ns 194135 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80125 ns 81333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81020.5 ns 82041 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82625 ns 84291 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83458 ns 81813 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194831 ns 194522 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1694000 ns 1927625 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1917291.5 ns 1941000 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1931459 ns 1930917 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896062.5 ns 1842062 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 416256.5 ns 390656 ns 1.07
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22312 ns 22388 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 176862.5 ns 171479 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 6542 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6875 ns 7083.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7750 ns 8020.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7000 ns 6500 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 62506 ns 60274 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8917 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 9417 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9417 ns 9916 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9333 ns 9208 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 325531 ns 311149 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121103854.5 ns 120884833.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181392229 ns 181722750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147959958.5 ns 148231625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 103681750 ns 108144417 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5500074 ns 5478841 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 613086875 ns 615355583.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 578493750 ns 581447666.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454857041.5 ns 451634708.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 752941812.5 ns 757933250.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35077599 ns 34994190 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 649102417 ns 649420209 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 685608520.5 ns 687787021 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 589011249.5 ns 584232000.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 739858625 ns 744942000 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59500 ns 59500 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38708 ns 39125 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 48000 ns 48020.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82708 ns 83458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38528 ns 38331 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1741292 ns 1946625 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1966416 ns 1985458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1984416 ns 1983521 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1859270.5 ns 1887334 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 177396 ns 176268 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 271125 ns 265750 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 274250 ns 268104.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268416 ns 269291.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267791.5 ns 265125 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 137600.5 ns 125359 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 587833 ns 690208 ns 0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 666917 ns 658417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 587208 ns 603125 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 665917 ns 594458 ns 1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 757074 ns 701612 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2224291.5 ns 2169417 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2235083 ns 2237833 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2099770.5 ns 2188625 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2218208 ns 2203000 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135238 ns 133751 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5494167 ns 5513083.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5547875 ns 5572520.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5497792 ns 5508208 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5395666.5 ns 5485271 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 797087 ns 720574 ns 1.11
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 643250 ns 638458 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 646958 ns 640250 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 642375 ns 640416 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 640208 ns 642666.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47636 ns 46893.5 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1820958 ns 1824209 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1668166 ns 1666417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1721291 ns 1728208 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2100708 ns 2102708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 227359.5 ns 220656.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58583 ns 58500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38208.5 ns 38584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47292 ns 46208 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82750 ns 83042 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 29299.5 ns 28530.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2023770.5 ns 2056084 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018000 ns 2102729.5 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2096292 ns 2102270.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1983895.5 ns 1992792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191243 ns 189031.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13392479 ns 13396167 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12447084 ns 12488625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12573562.5 ns 12567208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15225667 ns 14924083 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515936 ns 512412.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47214583.5 ns 47267416.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42007792 ns 42078000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40831167 ns 40824125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58287250 ns 58451854 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2893597 ns 2895350 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 73879562 ns 74360062.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91062583 ns 91413375 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90595250 ns 90659959 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98708500 ns 76716041 ns 1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59041 ns 59208 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38458 ns 38833 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47500 ns 47125 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83041 ns 78625 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46889 ns 48139.5 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914042 ns 1938145.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1980250 ns 1984167 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1983041.5 ns 1977812.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895208.5 ns 1877083 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191685.5 ns 195830.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 31909.5 ns 32688 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 5958 ns 6083 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6666 ns 6334 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6666 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6667 ns 6000 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 174339.5 ns 173538 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31434 ns 32105 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2584 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2959 ns 2792 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2792 ns 2791 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2833 ns 2584 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161698.5 ns 160748.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 284655874.5 ns 287049250 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 346665396 ns 347795687.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314185249.5 ns 314367979.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271410834 ns 271524458 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7071052.5 ns 7120410.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 986652459 ns 1003307875 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 960769500 ns 964885125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 837320313 ns 835293000 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1160509417 ns 1152976875 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34004605 ns 34058870 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1311324917 ns 1312833396 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1697266750 ns 1706336084 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1638971166 ns 1599191959 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1734387958.5 ns 1309056604.5 ns 1.32
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1414375 ns 1408791 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1459333 ns 1452791.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1417583 ns 1449625 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1464750 ns 1407209 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127631 ns 128282.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4707666.5 ns 5034917 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5056666.5 ns 5065916.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5045625 ns 5035937.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5028167 ns 5012729 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 589690 ns 483777.5 ns 1.22
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 174231250 ns 171224875 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 167491167 ns 167755167 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 128702541 ns 128923708 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 154878708 ns 154904187 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4890073 ns 4889428.5 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 622332667 ns 621337542 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 581984000 ns 581831583 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 496978166 ns 460212833 ns 1.08
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 643892875 ns 643084792 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16065970 ns 16318390 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8934042 ns 8919875 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9020375 ns 9050687.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7917083 ns 7921583 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9692542 ns 9747084 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1603050 ns 1600463.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36495271 ns 36566209 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 38137292 ns 38511167 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33438520.5 ns 33595375 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37760500 ns 37796583 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6473707 ns 6471792 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47375 ns 47291 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47417 ns 47479.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47500 ns 47729.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47542 ns 47334 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18555 ns 18559 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50375 ns 50417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50667 ns 50417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50375 ns 50375 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 207795 ns 167009.5 ns 1.24
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 6459 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7041 ns 7770.5 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7958 ns 8041 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7208.5 ns 7000 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 101178.5 ns 76373.5 ns 1.32
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 10000 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10625 ns 10458 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10625 ns 10250 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10417 ns 10084 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 593102.5 ns 456260 ns 1.30
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5708 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6208.5 ns 6708 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6750 ns 7458 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6084 ns 5917 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 121281 ns 91945.5 ns 1.32
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12708 ns 12917 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13541 ns 13625 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 13416 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13208 ns 13292 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 511694 ns 417439.5 ns 1.23
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32282 ns 32442 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7542 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 7875 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 8291 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8041 ns 7834 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 210142.5 ns 192614 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23166 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23209 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23250 ns 23416 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23104.5 ns 23292 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18312 ns 18706.5 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52416 ns 52417 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52542 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52709 ns 52959 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52625 ns 52875 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 291833.5 ns 226057.5 ns 1.29
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1400833 ns 1403937.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1445959 ns 1409291.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1396833 ns 1405208 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1398917 ns 1402896 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 197117.5 ns 196688.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5008208 ns 5027625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5030250 ns 5036500.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5026354 ns 5008875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4996437.5 ns 5003083.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 600264 ns 565308 ns 1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3038708 ns 3058166 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2105979 ns 2060229 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2274062.5 ns 2301833 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4858083 ns 4897625 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586328 ns 586278 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24399625 ns 24473708.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19072583.5 ns 19098958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18904750 ns 18981042 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36638687.5 ns 37019125 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2819518 ns 2831934 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33955417 ns 34098417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28785062.5 ns 28724166.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28141333 ns 28239458 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41707708.5 ns 41378063 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 142540583 ns 146235958 ns 0.97
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 146733875 ns 147965500 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 125527687.5 ns 127304667 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174248667 ns 172673353.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22566115 ns 22564119 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 968276062.5 ns 1235304437.5 ns 0.78
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 860326354.5 ns 869077229.5 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 858659167 ns 769904041 ns 1.12
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 683117959 ns 666199333 ns 1.03
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118099274 ns 118146881 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72375 ns 73812 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74000 ns 73875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76250 ns 75687.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73208 ns 76416 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235570 ns 208579 ns 1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 203292 ns 295500 ns 0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 282896 ns 193958 ns 1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 203583 ns 287395.5 ns 0.71
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207583 ns 282729 ns 0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1260670 ns 1165959 ns 1.08
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35143208 ns 35776083 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36705709 ns 36529041 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32591958.5 ns 32581292 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40607646 ns 40338396 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5841170.5 ns 5849817 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148155791.5 ns 148302541 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 158417083.5 ns 158881084 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 137765333 ns 138956354.5 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 283770667 ns 284123584 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34905958 ns 34596502 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120795375 ns 120211625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181579562.5 ns 182136458 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148004834 ns 148062084 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 108061458.5 ns 105814875 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5466179.5 ns 5475710.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 468909791.5 ns 469150645.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 485490958.5 ns 486184250 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438520417 ns 437949792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 742778708 ns 739059333 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32266057 ns 32333012 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 707166333 ns 712730687.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 671742104.5 ns 678064125 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 577648896 ns 570651646 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 734518917 ns 732192500 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1349520.5 ns 1338854 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 780417 ns 764333 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 909417 ns 971166 ns 0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2087500 ns 2047291 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 566986 ns 582645.5 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2979167 ns 2995792 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2496208 ns 2516000 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2619166 ns 2623541.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3728333 ns 3683208 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1738136 ns 1752698 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5799875 ns 5821709 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5883292 ns 5892750 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5800167 ns 5806979 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2892541.5 ns 2887229 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7500 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5333 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6208 ns 6042 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10041 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25118 ns 25775 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212333 ns 225958.5 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221583 ns 220750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220562.5 ns 220625 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215896 ns 206167 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 262400.5 ns 259112 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 307233708 ns 308668791.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 279732584 ns 282575646 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198830375 ns 199775042 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 309726917 ns 309205458 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7656813 ns 7688394 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1090685500 ns 1093080750 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1068219000 ns 1075916375 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 818375167 ns 810723875 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1160424021 ns 1146255478.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26548125.5 ns 26478179 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5812.5 ns 5042 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5708 ns 6250 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6959 ns 6584 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5458 ns 5458 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 154820 ns 170923.5 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 7333 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 7416 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7417 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7041 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 618164 ns 648059.5 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23615 ns 24468 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9250 ns 9333 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 9000 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9625 ns 9729.5 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9750 ns 8792 ns 1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 207782.5 ns 223281 ns 0.93
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 356333 ns 351708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352417 ns 352583 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 356083 ns 352708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 357500.5 ns 351416.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21053.5 ns 21843 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 780146 ns 811563 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 776312.5 ns 793583.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 809375 ns 812375 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 826750 ns 804291 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 303323.5 ns 279114.5 ns 1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 338396 ns 338875 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 325208 ns 321459 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 453375 ns 450271 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10542 ns 10750 ns 0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17732 ns 18538 ns 0.96
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 718917 ns 712021 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 732645.5 ns 730333 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1009833 ns 1002270.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26583 ns 26708 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 257155 ns 261073.5 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 374000 ns 381875 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 331500 ns 326167 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 441875 ns 443625 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30917 ns 30417 ns 1.02
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22404 ns 23393 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 739437.5 ns 731937.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 779666.5 ns 784187.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1041375.5 ns 1027875 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 104312.5 ns 89584 ns 1.16
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 235395 ns 220484 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3625 ns 3375 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3625 ns 3708 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3625 ns 3833 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3459 ns 3458 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17702 ns 17892 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4292 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4250 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4334 ns 4333 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4375 ns 4417 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 245299 ns 288266.5 ns 0.85
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3479.5 ns 4083 ns 0.85
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3792 ns 4062.5 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4334 ns 4334 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3709 ns 3833 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 185222 ns 243078.5 ns 0.76
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8417 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8687.5 ns 8208 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8666 ns 8583 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8500 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1127148 ns 1294141 ns 0.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206541 ns 203583 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212000 ns 209750 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211000 ns 209750 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 202291 ns 199542 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34888 ns 35748 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 648750 ns 610959 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 634312.5 ns 629979 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 632771 ns 632042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596417 ns 624312.5 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 322649.5 ns 366873 ns 0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 998333 ns 1020270.5 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1039375 ns 1019375 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 952083 ns 956541 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 904292 ns 862917 ns 1.05
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208498.5 ns 208035 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4540000 ns 4555583 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4817791.5 ns 4847250 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4468750 ns 4461541 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5130375 ns 5174375 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 959939 ns 927061 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 4042 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3334 ns 3500 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4125 ns 4250 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3750 ns 3375 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 197248.5 ns 241039.5 ns 0.82
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7645.5 ns 7500 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7062.5 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7292 ns 7333 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458 ns 6916 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1027567 ns 1063926.5 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1650375 ns 1524958 ns 1.08
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1182479.5 ns 1178854.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1370292 ns 1368709 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2441916.5 ns 2362167 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215671.5 ns 218600.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12370500 ns 12347875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9601667 ns 9603708 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9328687.5 ns 9285208.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18097145.5 ns 17994500 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1953457 ns 1959865.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17380125 ns 17343125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14471146 ns 14424146 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14397875 ns 14365583 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21055583 ns 21176708 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 91125 ns 90520.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90875 ns 90208 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 94958 ns 94500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 88000 ns 133292 ns 0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126032 ns 126385 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2023583.5 ns 2059229.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2028542 ns 2014083.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2033312 ns 2030292 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2043416.5 ns 2020416.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1084734 ns 1061374.5 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 3458.5 ns 2375 ns 1.46
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1625 ns 1834 ns 0.89
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3500 ns 3542 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1750 ns 2167 ns 0.81
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15936 ns 16672 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2584 ns 2541 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2791 ns 2917 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2917 ns 2750 ns 1.06
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2833 ns 2792 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 195099.5 ns 197485.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5416 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 5958 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 9916 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33830 ns 34400.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224916 ns 213812.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 234875 ns 221000 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 231083 ns 231917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218917 ns 208604 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 348229.5 ns 352524 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21982 ns 22677 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14459 ns 14416 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14208 ns 14125 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14417 ns 14500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14584 ns 14417 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 489892.5 ns 511650.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94917 ns 93854 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93416.5 ns 97145.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 99875 ns 98417 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 92625 ns 140083 ns 0.66
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125549 ns 125784 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921625 ns 1964729 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1933333.5 ns 1938562.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1928500 ns 1927041.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1950604.5 ns 1920667 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 964756 ns 1039090 ns 0.93
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 873521 ns 877500 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 804167 ns 800812.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1218520.5 ns 1223937 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 954959 ns 969958 ns 0.98
lenet(28, 28, 1, 32)/forward/GPU/CUDA 285492.5 ns 285567 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2830854 ns 2803854 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2531000 ns 2511750 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3356083 ns 3356541.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3412042 ns 3428708 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1671062 ns 1675606 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16271 ns 15958 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16500 ns 16562.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18666.5 ns 17041.5 ns 1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18916 ns 17375 ns 1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144500.5 ns 145484 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 260708 ns 223104 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 254749.5 ns 222896 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227979 ns 226708 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226584 ns 253167 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 650846.5 ns 664599 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222167 ns 221146 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222041.5 ns 221500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222166 ns 221666.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220000 ns 221042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 277439 ns 276464 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 561333.5 ns 551791 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 549000 ns 505375 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 558813 ns 509750 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 557729.5 ns 508666.5 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1450310.5 ns 1493627 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4000 ns 4000 ns 1
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4166 ns 4104.5 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5750 ns 4667 ns 1.23
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4042 ns 4042 ns 1
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17089 ns 17326 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7000 ns 7042 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7208 ns 7417 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7166 ns 7250 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7542 ns 7458 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 196929 ns 198652.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18083 ns 17875 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18959 ns 18333 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19250 ns 19750 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18124.5 ns 17146 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 165663 ns 230076 ns 0.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222875 ns 219250 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213896 ns 216020.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 225792 ns 212500 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222042 ns 212479.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1029397 ns 1050719 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4500 ns 4500 ns 1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3958 ns 4583 ns 0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5125 ns 4667 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4333 ns 4583 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 204180 ns 252077 ns 0.81
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10917 ns 10833 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10583 ns 10500 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10500 ns 10250 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10750 ns 10250 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1058573 ns 1102570 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3291.5 ns 3312.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3542 ns 3708 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4417 ns 3959 ns 1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3458 ns 3125 ns 1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 245634 ns 243703 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7229.5 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7583 ns 7333 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7417 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7541 ns 7209 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1074772.5 ns 1111590.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23471041.5 ns 23487541.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43849166 ns 43971125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37957792 ns 37463166.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34964125 ns 34877416 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1792082 ns 1842834.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184426958 ns 184200958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 173017604 ns 173422437.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 147161645.5 ns 146460271 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 411405916 ns 410950833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16521696 ns 16526176 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 426004833.5 ns 425975000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 259123250 ns 259298209 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296958750 ns 296349208.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 480245750 ns 479307000 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183042 ns 183167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 185188 ns 183917 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186041.5 ns 185291.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184333.5 ns 183708.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 226412 ns 232992 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 597750 ns 588709 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 598229 ns 595709 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 632895.5 ns 596042 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 586958 ns 597500 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1097502 ns 1113560 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3838542 ns 4043292 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 4115979 ns 4012396 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3571292 ns 3557000 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4600166.5 ns 4569124.5 ns 1.01
batchedmm(128, Bsize=512)/forward/GPU/CUDA 534974 ns 531536 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17343875 ns 17494562.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18514250 ns 18560917 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16537292 ns 16622646 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20367667 ns 20213416.5 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2795688 ns 2619803.5 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 666 ns 542 ns 1.23
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32682 ns 32024.5 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9042 ns 9334 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9709 ns 9291 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9666.5 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9666 ns 9000 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 266437.5 ns 264542.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 499772583 ns 496971791 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 504959958 ns 509285541 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 422832542 ns 421912146 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 673427063 ns 672227417 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 11842270.5 ns 12489793.5 ns 0.95
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1875482271 ns 1883911021 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1653498000 ns 1668824291 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1486024395.5 ns 1489797958.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2210913770.5 ns 2201017208.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49084588.5 ns 49197806.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1649062.5 ns 1600645.5 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1182584 ns 1172708 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1392250 ns 1388125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2377145.5 ns 2344958.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 218920 ns 218458 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12688458.5 ns 12685750 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 10001583.5 ns 9976000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9698792 ns 9656709 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18502292 ns 18427396 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2042988 ns 2044469 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17689291 ns 17712834 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14793041.5 ns 14779375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14622084 ns 14604916 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21477583.5 ns 21383042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26291 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24105 ns 24118 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67000 ns 67000 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67042 ns 66833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67875 ns 67500 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67208 ns 66834 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 396461.5 ns 410737.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204959 ns 203917 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209958 ns 208625 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209875 ns 209084 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199833 ns 199500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26682 ns 27195 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 646208 ns 625958.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 670000 ns 629916 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 644166 ns 632125 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630416 ns 600062.5 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 354787 ns 358637.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 598417 ns 658417 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 657292 ns 641625 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 664187.5 ns 647542 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 659708 ns 666291.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132717 ns 132681.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2235958 ns 2274708 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2279125 ns 2300125 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2249833 ns 2238125 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2316042 ns 2241291 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1193695.5 ns 1242340 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18500 ns 18020.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19250 ns 18292 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19292 ns 20250 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17500 ns 17500 ns 1
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146082 ns 146876.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 259917 ns 231458 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 259625 ns 227333.5 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230208.5 ns 227500 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 256708 ns 229792 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1005431.5 ns 1067171 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 667 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23900 ns 23878 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9750 ns 9833 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10333 ns 9875 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 10000 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10000 ns 9541 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 259163 ns 263281 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5833 ns 5687.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5916 ns 6208 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6459 ns 7125 ns 0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5833 ns 5417 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 228223.5 ns 235834 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7250 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 8042 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7666 ns 7541.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 6979.5 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 770644 ns 811982.5 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2333 ns 2125 ns 1.10
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2187.5 ns 2312.5 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2292 ns 2500 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2125 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17986 ns 18261 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6500 ns 6375 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6666 ns 6520.5 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6666 ns 6708 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6625 ns 6375 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 321059 ns 336632.5 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749208.5 ns 749209 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 748958 ns 748895.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 750125 ns 749542 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748834 ns 754083 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21410 ns 21329 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 798125 ns 818750 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791208 ns 788167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 837729.5 ns 791584 ns 1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 775270.5 ns 790584 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 301663.5 ns 299791 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7500 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5291 ns 5334 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5916 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10292 ns 10208 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33301 ns 33718 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232896 ns 256167 ns 0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 268833.5 ns 235520.5 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267354.5 ns 240500 ns 1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215500 ns 250875 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 361937 ns 365654 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10000 ns 10312.5 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9833.5 ns 10416 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11042 ns 10812.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10333 ns 10166.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 250034.5 ns 245731 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24334 ns 25083 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25250 ns 24667 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24542 ns 24125 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24334 ns 24500 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1111417.5 ns 1139764 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106812374.5 ns 106439229 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126726167 ns 127176500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121727417 ns 120453645.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118228479 ns 117602312.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2616848 ns 2646453 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 391804291 ns 394264417 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 379056792 ns 380211666 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 355535666 ns 421708312.5 ns 0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 486452916 ns 479818917 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15186296 ns 15158878 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 756685666.5 ns 756832624.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 774854291 ns 775894292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 746786813 ns 748243271.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 947077458 ns 761933208.5 ns 1.24
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8416 ns 7145.5 ns 1.18
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7125 ns 7834 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 9541 ns 0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9604 ns 7417 ns 1.29
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 240976 ns 241749 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14250 ns 14291.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14291 ns 14166 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14167 ns 14167 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14166 ns 13708 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1095523 ns 1098247 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 6042 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6125 ns 6750 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6687.5 ns 7083 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6292 ns 5834 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 239291 ns 240471.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12583 ns 12667 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13125 ns 13333 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13291.5 ns 13354.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12417 ns 12334 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 797358.5 ns 800476.5 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5459 ns 5333 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5833 ns 5875 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 7000 ns 6000 ns 1.17
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5542 ns 5500 ns 1.01
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16938 ns 17559 ns 0.96
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15500 ns 15459 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15458 ns 15437.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15666 ns 15667 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15875 ns 15750 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 200590 ns 202574 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 417 ns 417 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 416 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 416 ns 292 ns 1.42
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23824 ns 24102 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6583 ns 6333 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6209 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6666.5 ns 6750 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6583 ns 6333 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 239979.5 ns 242831.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 5917 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5958 ns 5792 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24627 ns 25033 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20916.5 ns 21375 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21209 ns 21125 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21833 ns 21375 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21292 ns 21020.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 265615.5 ns 267836 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 192687.5 ns 144833 ns 1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146521 ns 145250 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149374.5 ns 150083.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 142250 ns 188375 ns 0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168462.5 ns 168310 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1318667 ns 1351833 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1326875 ns 1369333 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1328208 ns 1322041 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1311167 ns 1327250 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1370856 ns 1368007 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22125 ns 23042 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22083 ns 24041 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24209 ns 24917 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24417 ns 21833 ns 1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 357178 ns 356401.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 130958 ns 126958 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 180395.5 ns 120333 ns 1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 130875 ns 180250 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 178917 ns 180749.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1498842 ns 1484885 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23528 ns 23370 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6479.5 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6791 ns 6416 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6834 ns 7042 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 6333 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 258073.5 ns 260419.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4500 ns 4333 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5250 ns 5041.5 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5125 ns 5459 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4667 ns 4583.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256140 ns 255220 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10000 ns 10166.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10416 ns 10167 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10250 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10208 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1357774 ns 1368092 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1666 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23069 ns 23227 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 5750 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6084 ns 5750 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5917 ns 5750 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5667 ns 5583 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 275859 ns 278026 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6814167 ns 6781854.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6368854.5 ns 6363854.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6497917 ns 6534166 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7560667 ns 7654958.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215030 ns 216771 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24038396 ns 24093667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21318250 ns 21335604 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21055625 ns 21037958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29800458 ns 29730292 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2117334 ns 2100300 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37406895.5 ns 37311042 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45481041 ns 45649479 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45606750 ns 45692458 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49407375 ns 38098959 ns 1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 5520.5 ns 1.15
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6208 ns 6708.5 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7292 ns 7250 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5916 ns 6125 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 237163.5 ns 240533.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8083 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8666 ns 9083 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8416 ns 8417 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8250 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1062411 ns 1077102 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1544167 ns 1489187.5 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1249833.5 ns 1236771 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1625709 ns 1617916 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2004375 ns 2170020.5 ns 0.92
lenet(28, 28, 1, 128)/forward/GPU/CUDA 275720 ns 282849 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7903083 ns 7909229.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6659625 ns 6634750 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7184500 ns 7161708 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10128083 ns 10483708.5 ns 0.97
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1884846.5 ns 1903700.5 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 369396 ns 367625 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 353625.5 ns 349896 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 456542 ns 453917 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 24041.5 ns 24459 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46544 ns 43502 ns 1.07
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 743500 ns 727167 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 796417 ns 803167 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1071583 ns 1057604 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 125958 ns 121792 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 312111.5 ns 307546.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397375 ns 397583 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 212250 ns 213333 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288125 ns 288209 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 753500 ns 751125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44394 ns 44141 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 673292 ns 675500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 472125 ns 475667 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 531791 ns 531375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974625 ns 972666.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191967.5 ns 191213 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657167 ns 658208.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 669958.5 ns 643834 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 661104 ns 655125 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 662708 ns 681792 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132971.5 ns 132164.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2458250 ns 2526833 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2498250 ns 2530541 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2467687 ns 2451667 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2501875 ns 2454146 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1568577 ns 1206173 ns 1.30
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 4333 ns 2604 ns 1.66
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2583 ns 2459 ns 1.05
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4417 ns 4375 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2750 ns 2583 ns 1.06
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16411 ns 16766 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5375 ns 5333 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5458 ns 5542 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5625 ns 5583 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5625 ns 5542 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 199892.5 ns 199467 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1463541 ns 1459833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1497208 ns 1490334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1503375 ns 1497791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1442834 ns 1439750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41596 ns 41167 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5109479 ns 5155562 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5289042 ns 5314187.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5301333.5 ns 5282833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4680604 ns 4979791 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198982.5 ns 198405.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33311 ns 33352 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15125 ns 15208 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15084 ns 15000 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15250 ns 15209 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15250 ns 15291 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 376159 ns 379437.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71208 ns 71625 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71209 ns 71416 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71250 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71500 ns 71083 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112893 ns 113188.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317750 ns 321770.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 323708 ns 330770.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 334166 ns 319333 ns 1.05
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 320500 ns 326458 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 195635 ns 194877 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1041 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23896 ns 23702 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 7917 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 8125 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8333 ns 8125 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 7916 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 263562 ns 263485 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 509521 ns 497624.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 479125 ns 471604 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 564625 ns 563708 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 232458.5 ns 218208 ns 1.07
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129625 ns 129739 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1393208 ns 1355292 ns 1.03
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1479000 ns 1470187.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1765792 ns 1719583.5 ns 1.03
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 868125 ns 867375 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 276144 ns 275487 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31637 ns 31436 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6334 ns 6208 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6333 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6458 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6667 ns 6333 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 263537 ns 262275 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1722958.5 ns 1727063 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1735250 ns 1729458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1733292 ns 1725417 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1763312 ns 1768875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169598.5 ns 168537 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4353521 ns 4367874.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4379875 ns 4385375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4349063 ns 4367104 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4390959 ns 4357459 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1422688.5 ns 1262273 ns 1.13
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6938 ns 6708 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6875 ns 6541 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7166 ns 7000 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6583 ns 6875 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20547 ns 20525 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 50541 ns 33063 ns 1.53
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 50312.5 ns 33083 ns 1.52
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 51250 ns 48041.5 ns 1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 58249.5 ns 53792 ns 1.08
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 308428 ns 291536.5 ns 1.06
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17750 ns 17333.5 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17875 ns 17792 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 19125 ns 18209 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17500 ns 17666 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18339 ns 18396 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53375 ns 53209 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53166 ns 53417 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53250 ns 53292 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53458 ns 53375 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 344770 ns 338706.5 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75459 ns 75500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75375 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75395.5 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75458 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47276 ns 46489 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 336417 ns 329084 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 341125 ns 336667 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 339250 ns 328958 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 336541 ns 323917 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213552 ns 209091.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1489000 ns 1486166 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1522292 ns 1517709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1529458 ns 1525792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1468458 ns 1464375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52575 ns 52406 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5115542 ns 5153729.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5292541 ns 5303250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5289458.5 ns 5257500 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4978625 ns 4990145.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 206120 ns 203681 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28125 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28375 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24358 ns 24536 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66334 ns 66708 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66167 ns 66125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66209 ns 66250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66750 ns 66416 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 526089 ns 535849 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1498042 ns 1468041 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 911000 ns 912854 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1149625 ns 1130187.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2098500 ns 2251604 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 582137 ns 583084 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3080771 ns 3113959 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2593125 ns 2660771 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2751125 ns 2734000 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3818125 ns 3802646 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2100592 ns 2002672 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7913063 ns 7929500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8011208 ns 8011167 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7901167 ns 7911791.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4863125 ns 4826833 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81500 ns 81437.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82000 ns 83395.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84125 ns 84437.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 136500 ns 0.61
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194175 ns 193251.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020500 ns 2033479 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2036292 ns 2014584 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2018708 ns 2016000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2021916 ns 2013958 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 810603 ns 792396 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.