Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

perf: rework CPU groupnorm implementation #134

Merged
merged 1 commit into from
Aug 18, 2024
Merged

perf: rework CPU groupnorm implementation #134

merged 1 commit into from
Aug 18, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

@avik-pal avik-pal force-pushed the ap/gn_perf_fix branch 2 times, most recently from 7db5c7d to dad57d8 Compare August 17, 2024 21:43
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 835e49c Previous: c1fafb0 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6042 ns 5437.5 ns 1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6208 ns 6395.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7625 ns 7833 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 6500 ns 0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 128632 ns 119634 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2508335 ns 2512796 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 885333 ns 721042 ns 1.23
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 423465 ns 431465 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10041 ns 9917 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9834 ns 9834 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9833 ns 9958 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9833.5 ns 9958.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 540778 ns 541539 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17733741 ns 18802089 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2391125 ns 2451375 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 701457 ns 692737 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1833 ns 3000 ns 0.61
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1541.5 ns 1750 ns 0.88
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1667 ns 1708.5 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1708 ns 1958 ns 0.87
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21581 ns 21726 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1321599 ns 1355600 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 217750 ns 217375 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31260 ns 36941 ns 0.85
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3354.5 ns 4291.5 ns 0.78
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3833 ns 3666 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4125 ns 4333 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3458 ns 4167 ns 0.83
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 145164.5 ns 144722 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9880239.5 ns 8880529 ns 1.11
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1628813 ns 1648666.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 151966.5 ns 151411.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57958 ns 57834 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46333.5 ns 46729.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46000 ns 46291 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84395.5 ns 84333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37356 ns 37352 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 589953 ns 561155 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1085187.5 ns 1107020.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 79351 ns 85191 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024854.5 ns 2030042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083145.5 ns 2097625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2096895.5 ns 2087958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2011750 ns 2016750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 231392.5 ns 232273.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8445771 ns 8341437 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5641416 ns 7266041 ns 0.78
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1521466 ns 1232312 ns 1.23
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146709 ns 178084 ns 0.82
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 173083 ns 147125 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148833 ns 149875 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150000 ns 147833.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165553 ns 165763.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8451979 ns 8166656 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1654417 ns 1617708.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 174042 ns 173962 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1112833 ns 1106708 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1108542 ns 1119084 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1139375 ns 1111167 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1132833 ns 1119063 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 691332 ns 689246 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35667293.5 ns 35815575 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5940750 ns 6026687.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1039131 ns 1035291 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4959 ns 4999.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4375 ns 4000 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5833.5 ns 5042 ns 1.16
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4417 ns 4875 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 93077 ns 91758 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5569355 ns 5740227 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 668313 ns 730666.5 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 59811 ns 62971 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8584 ns 8750 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8354.5 ns 8708 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 9125 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8875 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 598095 ns 599371 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 34936414 ns 35003908 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5813042 ns 6751208 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388894 ns 390174 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18500.5 ns 19209 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18459 ns 19062.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21458.5 ns 21666.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18333 ns 17584 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67113.5 ns 66535 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3224484 ns 3239099 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1352625 ns 1353167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 72711 ns 72651 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223000 ns 116792 ns 1.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220917 ns 116375 ns 1.90
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212500 ns 117666 ns 1.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218416 ns 119084 ns 1.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 353362 ns 351956 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 15738065 ns 14239465 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5754750 ns 5649000 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477395 ns 474255 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 833 ns 0.75
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 584 ns 750 ns 0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 792 ns 1.10
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 750 ns 666 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20409 ns 20726 ns 0.98
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1264144.5 ns 1170656.5 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 298416 ns 292250 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 31760 ns 32761 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1437.5 ns 1334 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1416 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1625 ns 1666 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1500 ns 1375 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 124808 ns 124916 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8831021 ns 8928265 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1625833 ns 1698771 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 128451.5 ns 125781 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7334 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 6125 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 6083 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10125 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24078 ns 23941 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1330146 ns 1217417.5 ns 1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 438791 ns 680500 ns 0.64
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47170 ns 47021 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 235000.5 ns 232875 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 265209 ns 230291 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230083 ns 241958 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224833 ns 223792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 184070 ns 187237 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29584191 ns 33020508 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8711646 ns 8706959 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 612956 ns 613946 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4166 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4167 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23455 ns 22984 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2048672 ns 2096362 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 218812.5 ns 221520.5 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 49110 ns 50180 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16791 ns 16834 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16833 ns 17125 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17667 ns 16958 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16916 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 198509 ns 195722.5 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10575947 ns 10651426 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 909375 ns 946584 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 179231.5 ns 180062 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 510083 ns 510750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405334 ns 404750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 405292 ns 404250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865625 ns 864833 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113245 ns 113177 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 396869 ns 401031 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 454459 ns 453604.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 248983 ns 249693 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2307416 ns 2323208 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2039645.5 ns 2033333.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2032167 ns 2026228.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3202437 ns 3193583.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 242211.5 ns 240360.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12568672 ns 9442101 ns 1.33
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1984042 ns 1968896 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 762058 ns 758862.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7021 ns 6270.5 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6292 ns 6666.5 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6750 ns 7687.5 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6334 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 93474.5 ns 93492.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5518295 ns 5508786.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 803166.5 ns 864625 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60690 ns 60751 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11875 ns 10083 ns 1.18
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12125 ns 12167 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12083.5 ns 12750 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10916 ns 11584 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 649108 ns 662173.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38994231.5 ns 39475440.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5594063 ns 5972771 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 415844.5 ns 403374 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23495 ns 23307 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2211478 ns 2179447 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 227708.5 ns 328812.5 ns 0.69
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 54380 ns 51631 ns 1.05
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2208 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2083 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 242428.5 ns 233425 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11060678 ns 11214097 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2013125 ns 2043458.5 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 190522 ns 180732 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8375.5 ns 11541 ns 0.73
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9666 ns 11458 ns 0.84
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11604.5 ns 12333 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8750 ns 11000 ns 0.80
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 112442 ns 109091 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3290927 ns 3254189.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 706834 ns 740875 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 75201 ns 75381 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18521 ns 32812 ns 0.56
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 19083 ns 35583 ns 0.54
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18709 ns 33167 ns 0.56
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17854 ns 33583.5 ns 0.53
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 642801 ns 622088 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17172659 ns 16800283 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 4542667 ns 4505292 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 388784 ns 385519 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 708 ns 0.82
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35677 ns 35978 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1253475 ns 1178150 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 289812.5 ns 459291.5 ns 0.63
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45940 ns 46000 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9542 ns 9542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10833.5 ns 12041.5 ns 0.90
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11062.5 ns 11334 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10209 ns 9833 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 262682 ns 265870.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18628091 ns 18598955 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4787791.5 ns 5290625 ns 0.90
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 373353 ns 374714 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397042 ns 397625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288084 ns 287584 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288042 ns 287791 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756667 ns 756375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111881 ns 111994 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 330206.5 ns 329207.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 489542 ns 468416 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 77540 ns 76521 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1442583 ns 1448895.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1134209 ns 1136209 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133750 ns 1131396 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2360208 ns 2356083 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 208137.5 ns 207006.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 11262758 ns 9481058 ns 1.19
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1637000.5 ns 1625208.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 323273 ns 321553 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7021 ns 7312 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7334 ns 7541 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8000 ns 8062.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7333 ns 7333.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 148568.5 ns 153081 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5928806 ns 5776534 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 451125 ns 470938 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 60270 ns 61150 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15833.5 ns 16854.5 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13500.5 ns 15708.5 ns 0.86
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15104 ns 15542 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13541 ns 12479.5 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 993776 ns 1021250 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42935151 ns 41825159 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5915083 ns 6411500 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 436475 ns 426474 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25000 ns 28083.5 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 26292 ns 26625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27770.5 ns 29208 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 30125 ns 25208 ns 1.20
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227462 ns 222882.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8039495.5 ns 7852868 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 587166 ns 621583.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 117001 ns 117831 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 154292 ns 142916.5 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 146604 ns 114958 ns 1.28
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 153667 ns 149541.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104208 ns 151500 ns 0.69
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1193735 ns 1187333.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42952167 ns 44743732 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5910000 ns 5970583 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 599656 ns 597955 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77499.5 ns 78166.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75083 ns 76333 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 85042 ns 77042 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74459 ns 76396 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 234612 ns 233021 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7732168 ns 7690691.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 541187 ns 535250 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 126292 ns 124351.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 291750 ns 283125 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 301271 ns 315833.5 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 291042 ns 293375 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 294709 ns 299584 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1211278 ns 1223336.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42220288 ns 41588499 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6618416.5 ns 6484750 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 698427 ns 698582 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16583.5 ns 16646 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 16875 ns 17500 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18542 ns 17667 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16604 ns 16583 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 165607 ns 164724.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5776171 ns 5915268.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 724479 ns 443666.5 ns 1.63
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239272 ns 237842 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 28625 ns 26583 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27541 ns 26625 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 29042 ns 27750 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27334 ns 29875 ns 0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 1045694.5 ns 1040078 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41576377 ns 42774820 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5989729.5 ns 6352728.5 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 704457 ns 699307 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10833 ns 12687.5 ns 0.85
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11437.5 ns 13166.5 ns 0.87
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13104.5 ns 13479.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11000 ns 13687.5 ns 0.80
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 140187 ns 139794.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3519988 ns 3546983 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 762167 ns 908479 ns 0.84
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 242183 ns 239582 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21958.5 ns 37750 ns 0.58
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22646 ns 37500 ns 0.60
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22375 ns 36959 ns 0.61
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 23083 ns 37125 ns 0.62
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 756953 ns 753503 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 22864881 ns 22091236.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5241812.5 ns 5265708.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 689067.5 ns 684491.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 64271 ns 63479 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63500 ns 64042 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 66042 ns 65666.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67250.5 ns 63021 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 122269.5 ns 119325.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3529561 ns 3429402.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1376896 ns 1372375 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 239082 ns 237212.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 487542 ns 394042 ns 1.24
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 450354 ns 390833 ns 1.15
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 444375 ns 356104 ns 1.25
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 438791.5 ns 339562.5 ns 1.29
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 554989 ns 553205 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21251546 ns 20941175 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6388250 ns 5998187.5 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 721452.5 ns 721532.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7833 ns 7334 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7000 ns 7667 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8417 ns 7541 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7417 ns 7042 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 163171 ns 160853.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5828490 ns 5800036 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 475875 ns 431041 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59641 ns 59401 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15333 ns 15583.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15437.5 ns 15229.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14875 ns 15375 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13416 ns 14375 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1017848.5 ns 1015035 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 39985725 ns 38865855 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5602875 ns 5628792 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 407364.5 ns 407024 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6152250 ns 6151583 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6373750 ns 6376417 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6381167 ns 6376250 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11917750 ns 11912312.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301774 ns 302602.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 302353.5 ns 298133 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19110187 ns 19016604 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 20002875 ns 19946812.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19995333 ns 19945334 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36566937.5 ns 36494333 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1017432 ns 1019494 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1158812 ns 1164692 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 917 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1000 ns 1000 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 959 ns 917 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1000 ns 917 ns 1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23513 ns 23519 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2124512 ns 2109375.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 232500 ns 324041.5 ns 0.72
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 217342 ns 214402 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3709 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3666 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 297403 ns 298735 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11619942.5 ns 11135634 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2171395.5 ns 2209459 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 641337 ns 645226 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8166 ns 8896 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8354 ns 8917 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9604.5 ns 10041.5 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8541 ns 8500 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 136618 ns 136574 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3587474 ns 3620427 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 795083.5 ns 861959 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67631 ns 67761 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12250 ns 14187.5 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13104.5 ns 14604 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12708 ns 13666 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12417 ns 13500 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 718472 ns 714394 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22799808 ns 22432282 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5108167 ns 5400166 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 364604 ns 362444 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 334 ns 0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23005 ns 22401 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2119128 ns 2057885 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 224458 ns 322875 ns 0.70
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 51830 ns 51141 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2709 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2959 ns 2958 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3542 ns 2792 ns 1.27
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2959 ns 2792 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 213984.5 ns 211494.5 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9979577.5 ns 9841082 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1641188 ns 1694625 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 161492 ns 161162 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11916.5 ns 14229 ns 0.84
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11833.5 ns 14125 ns 0.84
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12604.5 ns 15042 ns 0.84
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11666 ns 13166.5 ns 0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 138131.5 ns 138230.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3454696 ns 3441989 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 884249.5 ns 904000 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238223 ns 239462 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22625 ns 30792 ns 0.73
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21334 ns 31666.5 ns 0.67
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21604.5 ns 31083 ns 0.70
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 22542 ns 31083.5 ns 0.73
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 649586.5 ns 650508.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20411511 ns 22682431.5 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4565459 ns 4794458 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 654817 ns 652296.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4375 ns 4334 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4416 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24134 ns 24503 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2113270 ns 2175697 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223770.5 ns 225542 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 52760 ns 52080 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16583.5 ns 16709 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16583 ns 16750 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16542 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16542 ns 16500 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 354357.5 ns 351126 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12701524 ns 12669321.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1070167 ns 1080541 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 213352.5 ns 211892 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 1959 ns 2083 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 2042 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2167 ns 1959 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36744 ns 36935 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1245549.5 ns 1233787 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 284812.5 ns 461250 ns 0.62
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 207412 ns 206092 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17000 ns 17917 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22999.5 ns 18000 ns 1.28
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 19792 ns 17500 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 18708 ns 16709 ns 1.12
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 303916 ns 304287 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21298274 ns 20764724.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5237542 ns 5344187.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 700467 ns 699867 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59083 ns 59916 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65646 ns 65292 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65437.5 ns 64708 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 66708 ns 51250 ns 1.30
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66566 ns 66521 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 95451 ns 96421 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 202000 ns 164541 ns 1.23
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 165375 ns 113937.5 ns 1.45
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 170000 ns 141875 ns 1.20
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 312729 ns 312750 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 231402.5 ns 229949 ns 1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 583327 ns 584276 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82916.5 ns 111000 ns 0.75
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 122000 ns 108542 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84875 ns 110729.5 ns 0.77
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81479.5 ns 145625 ns 0.56
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193530.5 ns 193372 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5543361 ns 5531442 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2086125 ns 2074479 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171071 ns 169392 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1886228.5 ns 1119437.5 ns 1.68
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1899084 ns 1129458 ns 1.68
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1928250 ns 1170417 ns 1.65
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1913125 ns 1244625.5 ns 1.54
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 571942 ns 570757 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27732022 ns 26295440 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9246958 ns 9147146 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1081981 ns 1078351 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21729 ns 21759 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 1940449 ns 2164251 ns 0.90
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 337667 ns 367833 ns 0.92
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 45551 ns 45110 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1875 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 268734 ns 266995 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9968011 ns 9594583 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1106271 ns 1574833 ns 0.70
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 187482 ns 189592 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10875 ns 9625 ns 1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9895.5 ns 10292 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11500 ns 10312.5 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 10479 ns 8750 ns 1.20
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 135713 ns 136132.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3680004 ns 3223474 ns 1.14
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 895479 ns 878166 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 240337.5 ns 239692 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10417 ns 10104.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10145.5 ns 10625 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9708 ns 10145.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10834 ns 10042 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 573950.5 ns 569140 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19676673.5 ns 20087444 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4463792 ns 4786083 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 651017 ns 650531.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57916 ns 57708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46625 ns 46917 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46125 ns 46667 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84042 ns 84292 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40444 ns 39840 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1435250 ns 1317862 ns 1.09
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1154417 ns 1152833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75161 ns 78095.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1874959 ns 1897125 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976312.5 ns 1927479 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1964208 ns 1973791.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881292 ns 1896417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 234207 ns 236470 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31957727 ns 32730804 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11022416.5 ns 10909500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1030711 ns 1022350.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 419042 ns 417791.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 419375 ns 416833.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 420208 ns 418771 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 420542 ns 418062 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 236036 ns 235229 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8282465 ns 7864869 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 542041 ns 534854.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 288253 ns 286883 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 783792 ns 768334 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 778938 ns 735375 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 747250 ns 762250.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 782750 ns 674042 ns 1.16
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1135371 ns 1132260.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45251960.5 ns 47588718 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6497188 ns 6440750 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 921490 ns 918239 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3450666.5 ns 3464500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3366250 ns 3441791 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3434042 ns 3444833 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3445521 ns 3457959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 179244 ns 175447 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8695448.5 ns 8231556 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1415312.5 ns 1413084 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 410215 ns 431194 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6122020.5 ns 6209167 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6209458 ns 6208292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6206021 ns 6192917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6177041.5 ns 6202041.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1069713 ns 1069795 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51379719 ns 50322190 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7520979 ns 7331875 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1567331 ns 1565490.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 471687.5 ns 471791 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341750 ns 342062.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 342334 ns 342250 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 904375 ns 901625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46351 ns 46247 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 859225 ns 843184 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 543833 ns 516354 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 251242.5 ns 249962 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2329250 ns 2340875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2036125 ns 2041666.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2037791 ns 2038042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3207625 ns 3197458 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 258561 ns 253522 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 14028982 ns 12874338 ns 1.09
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2234208 ns 2210458 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 790883 ns 788063 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57833 ns 57292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 45833 ns 46500 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45875 ns 46417 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83792 ns 84125 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28439.5 ns 28420 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1363713 ns 1377166 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1172375 ns 1162021 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76791 ns 76810 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024708 ns 2044292 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1952958 ns 2076458.5 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2099229 ns 2089958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1979459 ns 1998583.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 239189 ns 240601 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36882348 ns 35733583 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11484625 ns 11180416 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1049721 ns 1043950 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57833 ns 58125 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46208 ns 46583 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46292 ns 46542 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84250 ns 84000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50347 ns 49930 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 808260.5 ns 777391 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1121187.5 ns 1106417 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 74401 ns 72511 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924917 ns 1889916 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970250 ns 1973000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977791.5 ns 1972479 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1857458 ns 1872625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 245621.5 ns 246746 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17757586 ns 18203740 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9925583.5 ns 9664041 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 923799.5 ns 928509 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 459 ns 292 ns 1.57
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34804 ns 35428 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1218893 ns 1198883.5 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 397375 ns 429645.5 ns 0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48010 ns 48131 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 6917 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7416 ns 7791.5 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8520.5 ns 7208.5 ns 1.18
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 7209 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 215607.5 ns 211219 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 21091113 ns 21407176 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5195958 ns 5360666 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 372224 ns 376479 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32770 ns 32448 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1215240.5 ns 1275807 ns 0.95
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 256917 ns 259000 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 42270 ns 39671 ns 1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 2625 ns 1.14
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3167 ns 2667 ns 1.19
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3209 ns 2625 ns 1.22
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 201962.5 ns 200262.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 8332092.5 ns 7622969.5 ns 1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1059833 ns 979750 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 160791.5 ns 155666.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 455375 ns 449792 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 459396 ns 478708 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 457834 ns 452229 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 424000 ns 474583.5 ns 0.89
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 141718 ns 140930 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6004284 ns 6179425.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2282458 ns 2484187.5 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 354068.5 ns 361833 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3807666 ns 3176354 ns 1.20
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3805312 ns 3261062.5 ns 1.17
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3799042 ns 3262333.5 ns 1.16
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3816667 ns 3221209 ns 1.18
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 775227 ns 771390 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34318701 ns 34655170.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11702250 ns 11378833 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1497905 ns 1317413 ns 1.14
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49826958 ns 49865792 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35532875 ns 35514646 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35542209 ns 35524708 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97217916.5 ns 97122291.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1598964 ns 1600652 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1015720 ns 1009800 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154448729 ns 154457333 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112325083.5 ns 112547249.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112474459 ns 112457583 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295795583.5 ns 295326229 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6481568 ns 6539911 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5536597 ns 5525076 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 19167 ns 19208 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 18666.5 ns 19021 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17000 ns 16583 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15459 ns 14958 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20686 ns 20510 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1139455 ns 1178878.5 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 223292 ns 225334 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27480 ns 25670 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10771 ns 10959 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9125 ns 9292 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9375 ns 9250 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17291 ns 17229 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 295440.5 ns 293160 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10592984.5 ns 10236863.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1611333 ns 1622291 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 155402 ns 153751 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9708 ns 9167 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8771 ns 9417 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10167 ns 10833 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8541.5 ns 8833 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 141417 ns 138920 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3531192 ns 3603346 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 816542 ns 839250 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 240452 ns 238792 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9042 ns 11875 ns 0.76
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9792 ns 12292 ns 0.80
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10041 ns 11854.5 ns 0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9750 ns 11917 ns 0.82
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 699140.5 ns 698758 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23822389 ns 24631890 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 4852938 ns 5382334 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 674987 ns 668936 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10542 ns 12458 ns 0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9375 ns 13042 ns 0.72
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10687.5 ns 13458 ns 0.79
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9375 ns 12125 ns 0.77
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 138047 ns 137317.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3390144 ns 3499319.5 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 909771 ns 896354.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 71791 ns 69695.5 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13042 ns 24541.5 ns 0.53
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13709 ns 25229 ns 0.54
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13979 ns 24500 ns 0.57
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13875.5 ns 24042 ns 0.58
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 648061 ns 646856 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20050250 ns 23295762.5 ns 0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4504354 ns 4815416 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 356714 ns 351994 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 459 ns 584 ns 0.79
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 584 ns 459 ns 1.27
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35556 ns 36062 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1229644.5 ns 1169666 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 286417 ns 452041 ns 0.63
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 208252 ns 206923 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 8792 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8458 ns 8875 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8916 ns 8250 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns 8375 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 235548 ns 235191.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21701037 ns 22413504.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5360334 ns 5654041 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 672887 ns 673696.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16979 ns 17583 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 16416.5 ns 16083 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14520.5 ns 14333 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10458 ns 11333 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21463 ns 21821 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1174474 ns 1137736 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 211958 ns 211417 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 189952 ns 188972 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 31958 ns 32104.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32167 ns 31729.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32208 ns 32167 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32208 ns 32125 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 313430 ns 310996.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11266559 ns 11938483 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1726583 ns 1824500 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 609127 ns 604126 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 477020.5 ns 486833.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 445875.5 ns 508541 ns 0.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 482750 ns 486625 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 475395.5 ns 454125 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195221 ns 194584 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6242919.5 ns 5926149 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2184791 ns 2088042 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 352983 ns 352254 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3824937 ns 3061708 ns 1.25
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3830041.5 ns 3216520.5 ns 1.19
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3824333 ns 3216333 ns 1.19
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3826021 ns 3168521 ns 1.21
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 581313.5 ns 577264 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29006462 ns 29149705 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10014250 ns 9970042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1393034.5 ns 1384533.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 787158958 ns 784523458 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542549542 ns 544902000 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 543228083 ns 543012875 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1564693291.5 ns 1509420062.5 ns 1.04
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22779955 ns 22536899 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14035957 ns 14041441 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3021582792 ns 2510925417 ns 1.20
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 3018211250 ns 1803839541 ns 1.67
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 2290281375 ns 1795547625 ns 1.28
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4758106584 ns 4752835000 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 367725794 ns 307750578 ns 1.19
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88501294 ns 88099937 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76083 ns 77917 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79333.5 ns 77875 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79333 ns 79042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 83042 ns 75791 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235362 ns 233766.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8035484 ns 8078290 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 539395.5 ns 530146 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 110001 ns 109531 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 243833 ns 192667 ns 1.27
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 277208 ns 291792 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 208708 ns 273687.5 ns 0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 248583 ns 256646 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1123869 ns 1119007 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 62288144 ns 42599415 ns 1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6396917 ns 6376104 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 644607 ns 642377 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199874958.5 ns 199939334 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138581125 ns 139299000 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139276834 ns 139327708 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389347000 ns 388944583 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5835227 ns 5821793 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3447300.5 ns 3425144 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619507958.5 ns 617991979 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 440048792 ns 440381666 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 441312104 ns 440103937.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1181437458 ns 1183443041 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26630511 ns 26658646.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21849307 ns 21776749 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7375 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6333 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6042 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10084 ns 9959 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27869 ns 27817 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1499999.5 ns 1182446 ns 1.27
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 607271 ns 639250 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47951 ns 47461 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220187.5 ns 226021 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228250 ns 221917 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223562.5 ns 221000 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207875 ns 207292 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238148 ns 239313.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47009123 ns 32659665 ns 1.44
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8821687.5 ns 8880750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 541335 ns 531505 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10334 ns 10000 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9270.5 ns 10834 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10042 ns 10500 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8791 ns 9041 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 134453 ns 134348 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 4752172 ns 3385012 ns 1.40
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 902562.5 ns 900583 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 73011 ns 75651 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7541 ns 9542 ns 0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7917 ns 9625 ns 0.82
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 9000 ns 0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 9083.5 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 573820 ns 572551.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 23140180.5 ns 19627467.5 ns 1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4653750 ns 4779292 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 337993.5 ns 324513.5 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26568 ns 26716 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1525380 ns 1199625 ns 1.27
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 453000 ns 463625 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 48711 ns 50855.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9729 ns 10625 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10209 ns 10687.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10937.5 ns 10375 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9584 ns 10479.5 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 270217 ns 270531 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 35388163 ns 23743488 ns 1.49
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5811500 ns 5988708 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 393654 ns 394399 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106604 ns 106500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 99562.5 ns 99666 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 100458 ns 99312.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146417 ns 146625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24179 ns 24426 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1634225.5 ns 1181882 ns 1.38
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 274833 ns 266166 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190671.5 ns 190032 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 504916.5 ns 502958 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 503833 ns 482375 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 478583 ns 504042 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 501833.5 ns 514333.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 252696 ns 251300 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 14907856 ns 11770034 ns 1.27
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2190209 ns 2216125.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 624282 ns 620221 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5125 ns 6354 ns 0.81
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 7041.5 ns 6666 ns 1.06
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7292 ns 6167 ns 1.18
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4042 ns 4292 ns 0.94
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16750 ns 17553 ns 0.95
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 73431 ns 73251 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11396 ns 11562.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10916.5 ns 11125 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11000 ns 10792 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16875 ns 16417 ns 1.03
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 232583 ns 231268 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 373234 ns 372348.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39166.5 ns 39396 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 46166.5 ns 50709 ns 0.91
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52375 ns 51500 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13542 ns 13583 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20275 ns 20068 ns 1.01
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 81410.5 ns 79491 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37145.5 ns 36312.5 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30937.5 ns 31833 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32000 ns 30709 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 63000 ns 57166.5 ns 1.10
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 209011 ns 207413 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 402554 ns 420690 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1667 ns 1875 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1916 ns 1875 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2000 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1916 ns 1708 ns 1.12
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20146 ns 20504.5 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1486576.5 ns 1119812.5 ns 1.33
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 319520.5 ns 322917 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 29000 ns 28660 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2250 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2167 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 2167 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2375 ns 2084 ns 1.14
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 222775.5 ns 220155.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 12262903.5 ns 9008904 ns 1.36
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1557792 ns 1683062 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 139472 ns 138171 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6417 ns 4937.5 ns 1.30
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 5417 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5584 ns 5875 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 4104.5 ns 1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 161768 ns 160460 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5903240.5 ns 5609681 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 448729 ns 442896 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 64810 ns 61631 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8250 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8458 ns 8500 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8375 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8459 ns 8208 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 951965.5 ns 947221 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 38542963.5 ns 38843143 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5697834 ns 6350958 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391494 ns 388709 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56708 ns 57375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57459 ns 58250 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57792 ns 58000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58209 ns 58250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38079 ns 38316.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1233354 ns 1212852.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 391875 ns 617562.5 ns 0.63
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206592 ns 206812 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449437 ns 477416 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 476791 ns 507791.5 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 466875 ns 472666.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 461875 ns 444792 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 274183.5 ns 276598 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28394052 ns 26668340 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8043958 ns 7961291 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 802558 ns 797578 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3312667 ns 3332792 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2336249.5 ns 2334417 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2342875 ns 2333542 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6347375 ns 6313375 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204537 ns 205221 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 212302 ns 202667 ns 1.05
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11415541.5 ns 11461208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8328417 ns 8317917 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8327584 ns 8320916 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21174250 ns 21185000 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 735759 ns 735643 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1061331 ns 1062746 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4958 ns 5000 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 5396 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6666 ns 6291.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4750 ns 4645.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 154138 ns 153484 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5561590 ns 5666549 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 804458 ns 812625.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56461 ns 56320 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7250 ns 7250 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7270.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7334 ns 7500 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7541 ns 7208.5 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 807821.5 ns 801828 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 35003795 ns 35498019 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5490583.5 ns 5867167 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 378974 ns 378564 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 91396.5 ns 131459 ns 0.70
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 102583 ns 142167 ns 0.72
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 127792 ns 142146 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 97645.5 ns 164292 ns 0.59
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 157957 ns 149293 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5977195 ns 5868037 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2236250 ns 2917812.5 ns 0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186092 ns 186522 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024312 ns 1341416.5 ns 1.51
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1994291 ns 1357249.5 ns 1.47
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2024250 ns 1356499.5 ns 1.49
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2017625 ns 1449166.5 ns 1.39
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 770062 ns 771014 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32930857 ns 32793983.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11041979.5 ns 11204625 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1258193 ns 1252157 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34583.5 ns 34437 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36312.5 ns 35083 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35209 ns 35333 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15885 ns 15590 ns 1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 72280 ns 71891 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2604.5 ns 2500 ns 1.04
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3000 ns 2959 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3000 ns 2792 ns 1.07
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2333 ns 2145.5 ns 1.09
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 149559 ns 148484 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 352028.5 ns 350923.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 6083 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 5958 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10000 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37142 ns 37340 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1251206 ns 1174226.5 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 616812.5 ns 360000 ns 1.71
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48800 ns 48850 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215687.5 ns 214187.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228792 ns 231708.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222646 ns 228771 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207791.5 ns 206750 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 251808 ns 256068 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26763071 ns 26779708 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7803770.5 ns 7835125.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 518295 ns 515070 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3959 ns 3916 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21823 ns 22329 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2050078 ns 2087967 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 247063 ns 247041 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 47650 ns 47341 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14917 ns 15000 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14958 ns 15000 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14958 ns 14875 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14750 ns 14791 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 338791 ns 341001 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11385289.5 ns 11479170 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1006250 ns 1027583 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 204762 ns 206562 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 102208.5 ns 109209 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 104917 ns 138417 ns 0.76
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 129750 ns 111250 ns 1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 104021 ns 147334 ns 0.71
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 159126.5 ns 160787 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5714397.5 ns 5651486 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2300354 ns 2194562.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185907 ns 184382 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922334 ns 1235958 ns 1.56
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1913292 ns 1245041.5 ns 1.54
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926521 ns 1239895.5 ns 1.55
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1902937.5 ns 1331583 ns 1.43
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 764317.5 ns 762378 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29980716 ns 31522616.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10895437.5 ns 10791938 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1074156 ns 1076411 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19042 ns 21604 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19167 ns 21479.5 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20250 ns 23312.5 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18354.5 ns 20521 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 125836 ns 125219.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3384577.5 ns 3275413 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1379375 ns 1406750 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81551 ns 81081 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218750 ns 131166.5 ns 1.67
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226250 ns 141083.5 ns 1.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217229 ns 160916 ns 1.35
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 229334 ns 123166.5 ns 1.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 571424 ns 566379.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20023693.5 ns 19392377 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6043770.5 ns 6064791 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 480905 ns 477160 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24937 ns 23541 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 31958 ns 31042 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28250 ns 29374.5 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1375 ns 1520.5 ns 0.90
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16933 ns 16877.5 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 83400 ns 83491 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5062.5 ns 4396 ns 1.15
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5229 ns 5458 ns 0.96
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 4958 ns 5104.5 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4292 ns 4958 ns 0.87
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 229400 ns 228917.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 383374 ns 372328 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 305459 ns 306604.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 305916.5 ns 306333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 307625 ns 307917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306333 ns 304979 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 257756.5 ns 258687 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7522370.5 ns 7844322 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 555271 ns 573771 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 279323 ns 277933 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 581709 ns 542250 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 540500 ns 543479 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595667 ns 585709 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 591291.5 ns 538604 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1191373.5 ns 1187647 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45474161 ns 43445666 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6084625 ns 6112625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 873299 ns 870608 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19166.5 ns 21125 ns 0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20750 ns 21749.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21459 ns 21812.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21916 ns 18020.5 ns 1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132926.5 ns 132203 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3790040 ns 3739210.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1493395.5 ns 1504000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77091 ns 78030.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219437.5 ns 167333 ns 1.31
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219583.5 ns 134500 ns 1.63
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220812.5 ns 159042 ns 1.39
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214062.5 ns 124209 ns 1.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 872731 ns 880193 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25269894 ns 25393029 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7298812.5 ns 7251645.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 548566 ns 543615 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6354.5 ns 6520.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7125 ns 6896 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7584 ns 7500 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6791 ns 6084 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 156211.5 ns 156999 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5636934.5 ns 5793375.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 787166 ns 870687 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69301 ns 68881 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10104.5 ns 9583.5 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10417 ns 10687.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10437.5 ns 10958 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9979.5 ns 9604 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 883875 ns 885668 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37160740 ns 37904917 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5369959 ns 5829833 ns 0.92
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 397264 ns 393169 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5875 ns 5041 ns 1.17
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6333 ns 5520.5 ns 1.15
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6708.5 ns 6479.5 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5125 ns 6729 ns 0.76
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 159436 ns 159186.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5501137.5 ns 5729090.5 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 793167 ns 876750 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 60120 ns 60940 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7479.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7584 ns 7875 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7625 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7604.5 ns 7292 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 829204 ns 832069 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 37986219 ns 38659149 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5619458.5 ns 6332125 ns 0.89
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 401834 ns 393804 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14497458 ns 14495625 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10132666 ns 10140833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10124562.5 ns 10106521 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27906167 ns 27875104.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 530686 ns 529047 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 398124 ns 386994 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46170625 ns 46373458 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33506291.5 ns 33487354.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33427292 ns 33494417 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85895667 ns 85793667 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2656682 ns 2653722 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3278943 ns 3282222 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 67166.5 ns 67187.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 66292 ns 66875 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67958.5 ns 67708 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67104 ns 64833 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133780.5 ns 135830.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3478485 ns 3619962.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1540000 ns 1520104 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 228942 ns 228172.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 440145.5 ns 364542 ns 1.21
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 483083 ns 375833 ns 1.29
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 453417 ns 409937.5 ns 1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 466041.5 ns 354250 ns 1.32
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 792134 ns 800863 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25961291 ns 26204351 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7763375 ns 7686729.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 805429 ns 805248 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32596 ns 33339 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1178603 ns 1169338.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 288000 ns 468292 ns 0.62
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49240 ns 49541 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9333.5 ns 9417 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10292 ns 10042 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11042 ns 10334 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9750 ns 9791.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 297797 ns 301951 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 24618078 ns 22325214.5 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5354667 ns 5516417 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 388694 ns 390349 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9833 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9792 ns 9792 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9792 ns 9750 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23426 ns 23482 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2048844 ns 2052000 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 224416 ns 223708 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216872 ns 216553 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46208 ns 46125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46416 ns 46750 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46209 ns 46209 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46083 ns 45709 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 308866 ns 309105 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11831397 ns 11209966.5 ns 1.06
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 968625 ns 958416.5 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 622956.5 ns 624786 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56166 ns 56500 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57167 ns 57333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57083 ns 57041 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57917 ns 57875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29168 ns 29824 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1167578 ns 1164396.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 411062.5 ns 641250 ns 0.64
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206532 ns 204527 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 452416 ns 454458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 509583.5 ns 478167 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 476416 ns 478521 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 444959 ns 446521 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 254124 ns 259076 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31789996 ns 33336521.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9426125 ns 9171667 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 849939 ns 842558 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 640625 ns 637625 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 619541.5 ns 644625 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 630458 ns 639750 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 662667 ns 661334 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 224984.5 ns 225397 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7955006 ns 8285194.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1362896 ns 1367250 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 251843 ns 241067 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2230750 ns 2232333 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2232854.5 ns 2248792 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2228750 ns 2225834 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2273000 ns 2254541.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1043963.5 ns 1060393 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47501673.5 ns 47792349 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8542916 ns 9665042 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1385645 ns 1382969 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20937.5 ns 23229 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20333 ns 22666.5 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21604.5 ns 24875 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19770.5 ns 22250 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 127036.5 ns 127723 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3566947 ns 3567777 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1521478.5 ns 1513396 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76451 ns 75991 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 242292 ns 169709 ns 1.43
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 258208.5 ns 136646.5 ns 1.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 233313 ns 153833.5 ns 1.52
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220271 ns 139917 ns 1.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 826844.5 ns 837422 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26518846 ns 25105631 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7525209 ns 7698542 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 565015.5 ns 564226 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23432 ns 23760 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1184152 ns 1214816 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 488395.5 ns 475542 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 53071 ns 50011 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10771 ns 11125 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10854.5 ns 11000 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11187.5 ns 11292 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10541.5 ns 10125 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 275712 ns 279802.5 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24826970.5 ns 25720935 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6139916.5 ns 6173209 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 406834 ns 412239 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8083 ns 9313 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9625 ns 10187.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10625 ns 11042 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8791.5 ns 9396 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 135624.5 ns 137005 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3326426 ns 3441963 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 887166.5 ns 884667 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 68240.5 ns 68471 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 9542 ns 0.76
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 9500 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 9083.5 ns 0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8834 ns 0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 551032.5 ns 555464 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17379523 ns 17569095 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4314688 ns 4514000 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 336788 ns 331008 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1333 ns 1479.5 ns 0.90
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1708.5 ns 1792 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1958 ns 1792 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1646 ns 1500 ns 1.10
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20952 ns 21494 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1135875.5 ns 1185348 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 306459 ns 314000 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 192772 ns 191562 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3209 ns 3334 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3437.5 ns 3333 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3541 ns 3250 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3416.5 ns 3208 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 239311.5 ns 241554 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10237973 ns 10425050 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1678250 ns 1821521 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596836 ns 592756 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 149416 ns 149166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 129833 ns 130208.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129499.5 ns 128645.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225958.5 ns 225834 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24017 ns 24389 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1143847 ns 1201989 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 301229 ns 302583 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36430 ns 36510 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143250 ns 143708 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 123813 ns 111020.5 ns 1.12
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 127875 ns 134437.5 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 255500 ns 262958 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 238432 ns 239921 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10471458 ns 10686272 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2041416 ns 2070938 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 221102 ns 224742 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7250 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10167 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33217 ns 33902 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1230067 ns 1196982 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 352792 ns 714417 ns 0.49
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50240 ns 50290 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220625 ns 263208 ns 0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229604 ns 231250 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 239229.5 ns 241146 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214270.5 ns 213791 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 268350 ns 274134 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28209273 ns 27340584 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8153604 ns 7954750 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 529575 ns 527115 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15437.5 ns 14979 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15208 ns 15020.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15875 ns 16209 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15166.5 ns 14812.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 156499 ns 156749.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5474853 ns 5690134 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 805708 ns 864916 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 239243 ns 238262 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23812.5 ns 22625 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22875 ns 24458.5 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24041 ns 24354.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23312 ns 23250 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 922540.5 ns 928671 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39041185.5 ns 41299107 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5759833 ns 5736208.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 693637 ns 690377 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9562.5 ns 10917 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9625 ns 11854.5 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10709 ns 12167 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10333 ns 11666.5 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 140292 ns 141298.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3529937.5 ns 3563617 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 797604 ns 843146 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 71605.5 ns 70831 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13459 ns 29354 ns 0.46
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14353.5 ns 30000 ns 0.48
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14375 ns 29083 ns 0.49
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14334 ns 28771 ns 0.50
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 760819 ns 768813 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 20915172 ns 20417695 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5000417 ns 5157104.5 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 373884 ns 372409 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8812.5 ns 13333 ns 0.66
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9625 ns 12791.5 ns 0.75
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10042 ns 13208 ns 0.76
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10583.5 ns 12146 ns 0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 138903.5 ns 139709 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3466863.5 ns 3557647.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 883500 ns 909917 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 70570 ns 73631 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12563 ns 23750 ns 0.53
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13083.5 ns 23833 ns 0.55
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13124.5 ns 24208 ns 0.54
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12729.5 ns 23458 ns 0.54
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 622007 ns 626538 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19686555 ns 19300609 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4122375 ns 4583187 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 346024 ns 344169 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 31584 ns 31375 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 35208 ns 33396 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31209 ns 31208 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1917 ns 2021 ns 0.95
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16552 ns 16930 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 74491 ns 74830 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5208 ns 5125 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5500 ns 5354.5 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5333.5 ns 5166 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6541.5 ns 6375 ns 1.03
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 149130 ns 150124.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 374329 ns 370864 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25816 ns 26814 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1207741.5 ns 1225504 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 293417 ns 454292 ns 0.65
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48471 ns 48391 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7083.5 ns 7500 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns 7833 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 7250 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7667 ns 7208.5 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 197623 ns 201565.5 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22789021.5 ns 23235084.5 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5867521 ns 5935021.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 392354 ns 390884 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1959 ns 2083 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 2042 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2041 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2041 ns 2000 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27291 ns 28025 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1215403 ns 1226218.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 313396 ns 471291 ns 0.66
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 209502 ns 207862 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17333.5 ns 17500 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17749.5 ns 17854 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17916 ns 17958 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18000 ns 17083 ns 1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 282695.5 ns 285936.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25744566.5 ns 25228181.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5995917 ns 6216792 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 717467 ns 714727 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 192292 ns 148333 ns 1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 150833 ns 176416.5 ns 0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 175500 ns 152625 ns 1.15
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 148042 ns 170125 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 218870 ns 222295.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8000798 ns 7715073.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1455667 ns 1456521 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 210472 ns 210297 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1322708.5 ns 1320395.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1316291.5 ns 1319125 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1325604 ns 1317292 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1338875 ns 1294854 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 979638 ns 992319 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45361365.5 ns 44464195.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6376521 ns 6790917 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1017855.5 ns 1121586.5 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23854 ns 26396 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27083.5 ns 25667 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28459 ns 26916.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25563 ns 25375 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 264500.5 ns 268095.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7809214 ns 8204793 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 895084 ns 733937.5 ns 1.22
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 118461 ns 119821 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 119708 ns 173375 ns 0.69
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 177541 ns 155792 ns 1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 143167 ns 131333 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 168708 ns 178833 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1198851 ns 1210208.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45600571 ns 43840934 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6093687.5 ns 6272209 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 605316.5 ns 603516 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23314 ns 23508 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1228429.5 ns 1238248 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 394000 ns 445167 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 51920 ns 48931 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7209 ns 7875 ns 0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7709 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8167 ns 7500 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7875 ns 7542 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 203859 ns 207810 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25563216.5 ns 24656260.5 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5757500 ns 6158292 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 399854 ns 393874 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6646 ns 5875 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6125 ns 6916 ns 0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6937.5 ns 7125 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5625 ns 5499.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 166454.5 ns 167316.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5809274 ns 5777197 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 708770.5 ns 447625 ns 1.58
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 238783 ns 237783 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 10166.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9959 ns 10166 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10041 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9833 ns 9666 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 967413.5 ns 970034.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41785539 ns 42782054 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5925645.5 ns 6418791.5 ns 0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 679277 ns 681416.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22759 ns 22878 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2139858 ns 2049585 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 226937.5 ns 324604 ns 0.70
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 216062 ns 215392 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4625 ns 4583 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4875 ns 4833 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4791 ns 4584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4666 ns 4583 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 236640.5 ns 238765.5 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9553141 ns 9762790 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1652125 ns 1655625 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 600776 ns 596696 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9167 ns 8937.5 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8500 ns 8875 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9520.5 ns 10958 ns 0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8521 ns 8667 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 137766 ns 139144 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3877362 ns 3594360 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 786312.5 ns 862042 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 69650.5 ns 71351 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 11625 ns 0.70
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9062.5 ns 11833 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 11250 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 11062.5 ns 0.80
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 669732 ns 677180 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21485053 ns 20723580 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4919771 ns 5232500 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 353474 ns 352763 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126458 ns 127209 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129000 ns 129333.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129520.5 ns 128479.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 181166.5 ns 183250 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46508 ns 46780 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 93291 ns 93411 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 302834 ns 339917 ns 0.89
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 343749.5 ns 329041 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 334792 ns 341708 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 592625 ns 609125 ns 0.97
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 206983 ns 208062.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 491556 ns 486820 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398125 ns 397604.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288125 ns 288333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288125 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 757209 ns 756875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43574.5 ns 43764 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1416178 ns 1410294 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 500459 ns 493709 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83591 ns 83675.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1457375 ns 1466291.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1126667 ns 1136791 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132875 ns 1135375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2362562.5 ns 2361708 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 263463.5 ns 265609 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 13311016 ns 11700873 ns 1.14
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1816292 ns 1792250 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 354288 ns 353883.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 547708 ns 599750 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 649084 ns 646541 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 652396 ns 642354.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 654334 ns 651750.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 201982 ns 222052.5 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8255620.5 ns 8001504 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1373250 ns 1368083.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 253218 ns 250553 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2444583 ns 2433792 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2438958 ns 2453333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2456021 ns 2450084 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2465104 ns 2446146 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1059434 ns 1076593 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49903929 ns 49144651 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9206000 ns 9471917 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1486386 ns 1481145 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33750 ns 33792 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35833 ns 35646 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 35250 ns 33916.5 ns 1.04
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 708.5 ns 979.5 ns 0.72
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16085 ns 16163 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 73280 ns 73271 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3000 ns 3041 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3208 ns 3416.5 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3500 ns 3208 ns 1.09
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3208 ns 3042 ns 1.05
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 147479 ns 149087.5 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 345454 ns 347298 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406479.5 ns 406917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408083 ns 409000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 409084 ns 408542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421791 ns 421709 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43788.5 ns 44262 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1374390 ns 1423580 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1159229.5 ns 1169583 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 242877.5 ns 240142.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3872667 ns 3869979.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3987437 ns 3995917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3999542 ns 3987625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3767021 ns 3783042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 248367 ns 254172 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35969162.5 ns 36698272 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11556812.5 ns 11782500.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1239688 ns 1239597.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34416 ns 34650 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1254522 ns 1238882 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 183083.5 ns 261875 ns 0.70
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 43435.5 ns 42510 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15792 ns 15833 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16000 ns 16041 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15958 ns 15750 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15541 ns 15459 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 272573.5 ns 274536 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9029094 ns 9047424.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 888333 ns 874416.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 173932 ns 179351 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404292 ns 403937.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295917 ns 295833 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295562.5 ns 295583 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 761375 ns 760750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113129.5 ns 113238 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1005696.5 ns 1008963 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 442209 ns 409875 ns 1.08
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 91631 ns 90691 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1471792 ns 1492104.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1150417 ns 1161916 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1155729.5 ns 1160479 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2383583 ns 2384834 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 247677 ns 255747 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 9605874 ns 12021719 ns 0.80
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1880667 ns 1920937.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 355334 ns 355564 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26332 ns 26941 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1258812 ns 1191246 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 457667 ns 444000 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 211573 ns 211532 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8062.5 ns 8688 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8666.5 ns 9083 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9083 ns 8375 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8791.5 ns 8500 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 216840 ns 220821.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24545217 ns 25176337 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5775000 ns 6029000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 704998 ns 698797 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 834604 ns 832791.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 617104 ns 619250 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621854.5 ns 616458 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1542333.5 ns 1546542 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 131980 ns 131009 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 163901 ns 163946.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2694104 ns 2694041.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 2004583.5 ns 2004458 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2005604 ns 2011312.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4957938 ns 4947688 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 252564.5 ns 250909 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 882654.5 ns 881169 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32204 ns 32264 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1208893 ns 1169718 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 431375 ns 445458 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 50650 ns 48990 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7041.5 ns 7604.5 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7416 ns 7667 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7791 ns 7417 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7688 ns 7334 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 226380.5 ns 228776.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21316304 ns 21955889 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4812583 ns 5605854.5 ns 0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 372278.5 ns 371843.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2369521 ns 2428708 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2394479 ns 2407042 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2397750 ns 2383833 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2391083.5 ns 2399125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 197960 ns 216906.5 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8220871.5 ns 7952502 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1425417 ns 1453042 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 360798.5 ns 358813.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4620416 ns 4653958 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4653625 ns 4663000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4665708.5 ns 4642292 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4681167 ns 4666271 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 967988.5 ns 972209 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47578796 ns 46330123 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6898542 ns 6641750 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1260673 ns 1413054.5 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6729 ns 7083.5 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7104 ns 7375 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7209 ns 7145.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7333 ns 7583 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23200 ns 23810 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1027845 ns 1108764.5 ns 0.93
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 272042 ns 271104 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 39230 ns 38020 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 53333 ns 69417 ns 0.77
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 33562 ns 32917 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48645.5 ns 49958 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 49125 ns 64167 ns 0.77
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 237637 ns 236370 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10782045 ns 10552299 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2087041.5 ns 2069000 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 222347 ns 236452 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21792 ns 21875 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 25875 ns 25479 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24334 ns 24583.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5271 ns 5292 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18207 ns 18099 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 85991 ns 85341 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12000 ns 12042 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10500 ns 10500.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 11000 ns 10375 ns 1.06
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18437.5 ns 18292 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 246951.5 ns 245984.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 378013 ns 375823.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405625 ns 405750 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297042 ns 296958 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296542 ns 296709 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 763333 ns 762834 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46048 ns 46525 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1384674.5 ns 1383371.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 407292 ns 429125 ns 0.95
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89421 ns 89211 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1487979 ns 1492083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1159229 ns 1167583 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1164583 ns 1165979.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2387791.5 ns 2388333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 298677.5 ns 295648 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 14095391.5 ns 12300838 ns 1.15
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2104333 ns 2096021 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 380349 ns 377223 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433750 ns 435625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436520.5 ns 438500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436541.5 ns 438792 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448583 ns 448750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 55246 ns 55236 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1030654 ns 1047143 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1162770.5 ns 1128750 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 237662 ns 236028 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3862458 ns 3892667 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4025771 ns 4033792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4032500 ns 4016916.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3818750 ns 3818500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 268759 ns 271864 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31040419 ns 31422713 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10498666 ns 10635458 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1247197.5 ns 1232417 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8709 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7667 ns 7667 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7667 ns 7625 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12459 ns 12375 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24118 ns 23822 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2224323 ns 2169618 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 225292 ns 226458 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 218122 ns 217182 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45125 ns 45208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45167 ns 45792 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45666 ns 45167 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44917 ns 45042 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 364860.5 ns 362753 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13727353.5 ns 11448473 ns 1.20
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1837020.5 ns 1763770.5 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 675127 ns 670177 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86208 ns 160583 ns 0.54
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 87833 ns 135792 ns 0.65
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 124750 ns 138250 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91625 ns 122542 ns 0.75
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190737.5 ns 190208 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5938165.5 ns 5709657 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2108667 ns 2138667 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185242 ns 206202 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2002125 ns 1263083 ns 1.59
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2012583.5 ns 1291854 ns 1.56
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2000062 ns 1283542 ns 1.56
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024270.5 ns 1348667 ns 1.50
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 574437 ns 573721 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28287780 ns 27951504.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9414709 ns 9876750 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1110142 ns 954730 ns 1.16

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/gn_perf_fix branch 3 times, most recently from 1409228 to 496707b Compare August 17, 2024 23:45
src/impl/groupnorm.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal force-pushed the ap/gn_perf_fix branch 2 times, most recently from 2d8396a to e8493fb Compare August 18, 2024 00:25
@avik-pal avik-pal merged commit f49968a into main Aug 18, 2024
67 of 71 checks passed
@avik-pal avik-pal deleted the ap/gn_perf_fix branch August 18, 2024 02:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant