Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

chore: trigger build #177

Closed
wants to merge 1 commit into from
Closed

chore: trigger build #177

wants to merge 1 commit into from

Conversation

avik-pal
Copy link
Member

Don't Merge triggering build with new Enzyme release

@avik-pal avik-pal closed this Oct 17, 2024
@avik-pal avik-pal deleted the ap/trigger11 branch October 17, 2024 23:05
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 9b7286d Previous: 604783f Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5042 ns 5375 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5125 ns 5250 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6583 ns 7708.5 ns 0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5125 ns 5416 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 106934 ns 113361 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2903264 ns 2795172 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 418904 ns 601544 ns 0.70
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 9729.5 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9875 ns 9938 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 10167 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9750 ns 11063 ns 0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 509431 ns 544547 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18145309 ns 17852957 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 664847 ns 629346 ns 1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1312.5 ns 1500 ns 0.88
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1604 ns 1458 ns 1.10
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1917 ns 1771 ns 1.08
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1312.5 ns 1583 ns 0.83
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 20523 ns 20770 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1336257 ns 1342503 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31420 ns 30997 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3521 ns 4104 ns 0.86
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3959 ns 4500 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4333 ns 4500 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3833.5 ns 4333 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 136761.5 ns 134970 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9485308 ns 8677498 ns 1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 147442 ns 138579 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57417 ns 57666.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47541 ns 46875 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47542 ns 47125 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80541 ns 81458 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37101 ns 36587 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 559661 ns 582336 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 79531 ns 69420 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2034750 ns 2030375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2084521.5 ns 2088625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2070833 ns 2086625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000729 ns 1998562 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 219850 ns 217216 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8230111 ns 8077777 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1184482 ns 930850 ns 1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148042 ns 175083 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 169209 ns 147291 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 175125 ns 150021 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147208 ns 151750 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167624 ns 166825 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8071401.5 ns 7358467.5 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 187542 ns 262570 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1108854.5 ns 1115103.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1109542 ns 1110771 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1126333 ns 1113771 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1112291 ns 1136250 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 648435 ns 639845.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34776278 ns 33057102 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1012710 ns 864075 ns 1.17
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4917 ns 3792 ns 1.30
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4812.5 ns 4479 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5666.5 ns 6583 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 6375 ns 0.61
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 84692.5 ns 85209.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5745043.5 ns 5875726.5 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 61061 ns 59531 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8666 ns 8417 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8708 ns 8750 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9041 ns 9042 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8958 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 567531 ns 557500.5 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 36159388.5 ns 34838164 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390569 ns 370833 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17125 ns 17958 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17542 ns 16458 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19917 ns 21125 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17437 ns 17292 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 63455.5 ns 63776.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3109889.5 ns 2927491.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75491 ns 82870 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218459 ns 212625 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212021 ns 213042 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224041.5 ns 212771 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214917 ns 212291 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 330824 ns 329859 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 14581752 ns 12611094 ns 1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 471355 ns 405232 ns 1.16
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 667 ns 0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 666 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 916 ns 875 ns 1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 709 ns 0.82
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 19184 ns 19101 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1196032.5 ns 1145778 ns 1.04
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30790 ns 26409 ns 1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1334 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1583 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 118015 ns 117126.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8426159.5 ns 8850213 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 123561 ns 115676 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416.5 ns 7375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 6041 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6084 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 9958 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24050 ns 23587 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1274308.5 ns 1261233 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47100 ns 52723 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230291 ns 229167 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232000 ns 230667 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 266667 ns 267875 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212979 ns 257458 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 190269 ns 182744 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 30917862 ns 32590762.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 603736 ns 548449.5 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23276 ns 22860 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2010135 ns 1933593 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48340 ns 39504 ns 1.22
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16833 ns 17042 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16875 ns 16875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17209 ns 17083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17166 ns 16875 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 187223 ns 185787.5 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10116574 ns 10029430 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 177062 ns 162052 ns 1.09
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 491333 ns 491583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 385750 ns 385625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 385833 ns 386458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 847083.5 ns 844083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113452 ns 113763 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 401173 ns 418213 ns 0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 244572 ns 388657 ns 0.63
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2152291.5 ns 2155583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1860958 ns 1863374.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1862458 ns 1865167 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3118353.5 ns 3377520.5 ns 0.92
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 230368 ns 229580 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 10676380 ns 9922983 ns 1.08
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 743078 ns 610962 ns 1.22
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6354 ns 6500 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6437.5 ns 5500 ns 1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6958 ns 7667 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5167 ns 1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 84550.5 ns 84720.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5551475 ns 5300415 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 59440 ns 59932 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11459 ns 11229 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11666.5 ns 11395.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 12334 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11438 ns 10667 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 610257 ns 602168 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 37918559 ns 38613143.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 410225 ns 383917 ns 1.07
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23314 ns 23328 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2245583 ns 2178076 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 49110 ns 41367 ns 1.19
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2166 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 229615 ns 228927.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11471919 ns 11774524 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 180602 ns 165900 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8083 ns 9584 ns 0.84
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9000 ns 8333 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10771 ns 9895.5 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8833 ns 8542 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 109160 ns 105241 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3234020 ns 3103348.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 72921 ns 71955 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18542 ns 17688 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17250 ns 16666.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18770.5 ns 18708 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 19146 ns 17562 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 601629.5 ns 595171 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 18001662.5 ns 16252508 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 380179 ns 358129 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34903 ns 34578 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1280813.5 ns 1237584 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 47631 ns 41387 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8958 ns 9229 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8812 ns 8958.5 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9937.5 ns 9750 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9250 ns 8104 ns 1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 260309 ns 257823 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 20391796 ns 18331589 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 367584 ns 349944 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396708 ns 397270.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288333 ns 288083 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288042 ns 288666.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750541 ns 751792 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111691.5 ns 112022 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 332135 ns 349915 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 74831 ns 74609 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1459375 ns 1454270.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133375 ns 1130500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1131542 ns 1131583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2439208 ns 2437959 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 201195 ns 200057 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10334858 ns 7687949 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 324153 ns 302285 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6917 ns 7750 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 7083.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7542 ns 8312.5 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6667 ns 6687.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 134405 ns 139766 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5807519.5 ns 5685169 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 59841 ns 60383 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14750 ns 13479.5 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14500 ns 12750 ns 1.14
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15625 ns 15125 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15708 ns 14625.5 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 889477 ns 923489 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 43796054 ns 42519536.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 430555 ns 407432 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23625 ns 25625 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24834 ns 23666 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25000 ns 29417 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23896 ns 24041 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 185493.5 ns 186240.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7687837.5 ns 7554376 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115711.5 ns 120505 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 153209 ns 152187 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 145583.5 ns 145250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 153291 ns 146917 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104333 ns 103958 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1016908 ns 1013659 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42426309 ns 44493070 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 578946 ns 535240 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74292 ns 74583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74354.5 ns 79584 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78334 ns 76791.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74375 ns 76083 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 191164.5 ns 190594.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7963971 ns 7364811 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 123771 ns 121316.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 319458 ns 273562.5 ns 1.17
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 311000 ns 304084 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 304750 ns 303333 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 274792 ns 307583 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1022252 ns 1045024 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 41781714 ns 39473308 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 691217 ns 624192 ns 1.11
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13104.5 ns 12417 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13020.5 ns 12896 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14062.5 ns 14000 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13375 ns 12500 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 137424.5 ns 138416 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5742536 ns 5479910 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235003 ns 226152 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26520.5 ns 27792 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25209 ns 26458 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26625 ns 28437.5 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27062.5 ns 33937.5 ns 0.80
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 932532.5 ns 924126.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41660148 ns 42086872 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 697632 ns 610976 ns 1.14
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11875 ns 11124.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10584 ns 10333 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12791 ns 12479.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11125 ns 11125 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 118662.5 ns 118543.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3679682 ns 3443799.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 237602 ns 233176 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23209 ns 22291.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 23041 ns 22417 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23208.5 ns 24167 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22084 ns 28562.5 ns 0.77
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 677613.5 ns 668341 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 22594824 ns 21034051 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 677387 ns 569113 ns 1.19
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 63854.5 ns 68709 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 64000 ns 62750 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67208 ns 67520.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62958 ns 64417 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 102574 ns 102389 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3618527 ns 3441143 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 236212 ns 230751 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 466729 ns 506375 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 477417 ns 510167 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 503895.5 ns 475209 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 474459 ns 647896 ns 0.73
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 492357 ns 492781 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21057710 ns 20664230 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 717612.5 ns 593680 ns 1.21
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7437.5 ns 7958 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7291 ns 6750 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8500 ns 8208 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7187 ns 7562.5 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 136181.5 ns 137965 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5536801 ns 5508177.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59160 ns 62687 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13125 ns 16125 ns 0.81
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13542 ns 16250 ns 0.83
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15625 ns 16250 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15041 ns 14833 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 898966 ns 900927 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 39073495 ns 39349971 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 401324 ns 388286 ns 1.03
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6152896 ns 6150354 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6373625 ns 6368167 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6374875 ns 6373937.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11911750 ns 11915167 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 348505 ns 345749 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/oneAPI 52576961 ns 49052559 ns 1.07
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 305468.5 ns 388426 ns 0.79
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19051437.5 ns 19083437.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19952875 ns 19960479.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 20023416.5 ns 19966834 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36511083.5 ns 37142104 ns 0.98
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1017592 ns 1072087 ns 0.95
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI 77843031 ns 78467188 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1167942 ns 1035750.5 ns 1.13
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1041 ns 1000 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 1042 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23689.5 ns 23415 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2126340 ns 2079171 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 210263 ns 200906 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3958 ns 3917 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4083 ns 4000 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4125 ns 4041 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3917 ns 5458 ns 0.72
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 284075 ns 270573.5 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10799500.5 ns 10484095 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625312 ns 486775 ns 1.28
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7583 ns 8687 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7750.5 ns 7459 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9500 ns 9334 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7562.5 ns 7834 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 117941 ns 116220 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3465578 ns 3435001.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 66261 ns 71133 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 13125 ns 12125 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11834 ns 11958 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13208 ns 13000 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12229.5 ns 11750 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 624724 ns 609643.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22359924 ns 21784602 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 355344 ns 341729 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22660.5 ns 22413 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2090808 ns 2035110 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46931 ns 44053 ns 1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 3000 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 2917 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3334 ns 3208 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3167 ns 2916 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 202209.5 ns 194923.5 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9381105 ns 9225861.5 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 156911.5 ns 154488.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11917 ns 11625 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11708 ns 10500 ns 1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12583 ns 12875 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11542 ns 11875 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 117102 ns 115370 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3369882 ns 3433218 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 234642.5 ns 231793 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22229 ns 22667 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21333 ns 22104.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24542 ns 23625 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21334 ns 26729 ns 0.80
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 575582 ns 555861 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19911746 ns 20482208 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 654527 ns 545740 ns 1.20
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4334 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4209 ns 4333 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4291 ns 4208 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24983 ns 23923 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2175425.5 ns 2205811 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 48880 ns 44864 ns 1.09
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16250 ns 16500 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16250 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16250 ns 16166 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16083 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 326969 ns 319806 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12126578 ns 10190777 ns 1.19
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 210562 ns 186077 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 2125 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2167 ns 2209 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2125 ns 2000 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36547 ns 35327 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1188039 ns 1213779 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 208582 ns 199242 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 18500 ns 17104 ns 1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 16624.5 ns 20167 ns 0.82
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17354.5 ns 19000 ns 0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 19688 ns 23083.5 ns 0.85
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 291021.5 ns 284984 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19477895 ns 18211018 ns 1.07
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 687748 ns 583431 ns 1.18
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59437.5 ns 59458 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 64125 ns 65666 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66375 ns 66125 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51125 ns 52833 ns 0.97
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66729 ns 66304 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/oneAPI 84867273.5 ns 87707222.5 ns 0.97
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 98541 ns 110241 ns 0.89
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 132875 ns 153041 ns 0.87
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 110729.5 ns 155229 ns 0.71
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 163458.5 ns 130209 ns 1.26
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 232750 ns 286334 ns 0.81
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 214764 ns 210129.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI 145741629.5 ns 149924497 ns 0.97
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 552656 ns 511145 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81917 ns 106521 ns 0.77
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86646 ns 78958 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 89250 ns 84042 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81916 ns 115521 ns 0.71
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193747 ns 191513.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5265023 ns 5334020 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 169752 ns 267630 ns 0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1909063 ns 1894896 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1910500 ns 1902375 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920458 ns 1878334 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1930417 ns 1895250 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 515414.5 ns 507442 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26285784 ns 28152566.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 918784.5 ns 825763 ns 1.11
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22081 ns 21516 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2134831.5 ns 2100524 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 40930 ns 35507 ns 1.15
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 252925 ns 245735 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 10692380.5 ns 9780504 ns 1.09
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 182692 ns 164548 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9458 ns 10916 ns 0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9188 ns 8291 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10792 ns 11146 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8417 ns 9500 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 117162 ns 114788 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3311573.5 ns 3351587 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 237482 ns 232004 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9458 ns 8916 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 8854.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11833 ns 10917 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8584 ns 9583 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 505476 ns 491693 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19448705 ns 19969043 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 634061.5 ns 536332 ns 1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57958 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47167 ns 46625 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46458 ns 46750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81542 ns 83166 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39572.5 ns 38476.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1348881 ns 1460287 ns 0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77771 ns 71814 ns 1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918812 ns 1905145.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974208.5 ns 1949542 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982521 ns 1958500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1878771 ns 1874958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 218975.5 ns 212675 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31988008.5 ns 33332615 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1006995.5 ns 968925.5 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267167 ns 267500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 270042 ns 271479.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 270125 ns 271209 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267500 ns 268209 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 199236.5 ns 194219.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7576739 ns 7638787 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 283863 ns 271267 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601875 ns 585333.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 592188 ns 600292 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 688500 ns 671042 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 585896 ns 845604.5 ns 0.69
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1011793 ns 991966 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43904369 ns 42952243 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 901014.5 ns 831153 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2191333 ns 2211666 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2189750 ns 2203958 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2188208 ns 2229083 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2218500 ns 2173792 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 156704.5 ns 161646 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8448752 ns 8668502.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 408114.5 ns 470965 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5478916 ns 5493104.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5506458.5 ns 5515875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5476312.5 ns 5526542 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5505042 ns 6852458 ns 0.80
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 932002 ns 959137 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50568026 ns 49532486 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1719777 ns 1437405 ns 1.20
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 478417 ns 478292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 346417 ns 345625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 346833 ns 346750 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 909416 ns 908542 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 45620.5 ns 46909 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 849567 ns 871386 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 245423 ns 393175 ns 0.62
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2168937.5 ns 2137500 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1865604 ns 1869334 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1853875 ns 1859271 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3118125 ns 3380209 ns 0.92
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 258815 ns 264095.5 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 15172873 ns 13390420 ns 1.13
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 774888 ns 632907.5 ns 1.22
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57916.5 ns 57458 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46625 ns 46166 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46417 ns 46250 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 78667 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28412 ns 28560 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1440678 ns 1394875.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77121 ns 73147 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2032479.5 ns 2029292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090834 ns 2078187.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093437.5 ns 2063250 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1970145.5 ns 1963958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 225176 ns 230846.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36084585 ns 36347331 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1040890.5 ns 980522 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58083.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47250 ns 46584 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47042 ns 46917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82500 ns 79958 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47929 ns 48944 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 802274 ns 829446 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 68761 ns 71428.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1925083 ns 1871729 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1980416.5 ns 1973604 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1983458.5 ns 1944167 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1899708 ns 1876792 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 232135.5 ns 238010 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17785153 ns 18705710.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 917619.5 ns 881607.5 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34988 ns 34878 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1197187 ns 1190778.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48631 ns 47028 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 6270.5 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6229 ns 6187.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6833 ns 6125 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 204201 ns 211705.5 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20381663 ns 20119098 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368334 ns 332741 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31750 ns 32902 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1243003 ns 1224139 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 37010.5 ns 36327 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2667 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2708 ns 2667 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3458 ns 4292 ns 0.81
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3292 ns 3167 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 184588.5 ns 187662.5 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 8227541 ns 5673429 ns 1.45
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 149351 ns 136635 ns 1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 495354 ns 467208 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 492666 ns 469417 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 465708 ns 466875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 459375 ns 464979.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135301 ns 137312 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5838247 ns 5812904.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 324193 ns 361475 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4083792 ns 4027749.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4070187.5 ns 4071500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4078000 ns 4067417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4075916.5 ns 5516750 ns 0.74
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 676920 ns 690445 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32374547 ns 32063716 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1471415 ns 1091915 ns 1.35
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49815292 ns 49879250 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35545937.5 ns 35487583 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35523625 ns 35512833.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96973125 ns 96974083 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1622419.5 ns 1622377 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI 55438113 ns 55868634.5 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1055850.5 ns 1579230 ns 0.67
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154508083.5 ns 154423062.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112410479 ns 112364750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112319041 ns 112377416 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295571917 ns 299989812 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6490301 ns 6468945 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI 129703825.5 ns 126761495 ns 1.02
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5576083 ns 7230228 ns 0.77
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 18188 ns 19104.5 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 17500 ns 18375 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17041 ns 17375.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15083.5 ns 15083 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19789 ns 19621 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1104141 ns 1223248 ns 0.90
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26210 ns 28854 ns 0.91
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10875.5 ns 11062.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 8791 ns 8833 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9000 ns 9291 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17291.5 ns 17667 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 247103 ns 252067.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10458746 ns 9844493 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 146071.5 ns 138484 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9375 ns 7937.5 ns 1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8875 ns 8125 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9979.5 ns 10375 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8291.5 ns 8708 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 118486 ns 120230.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3549725 ns 3557828.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 236803 ns 235119 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10771 ns 9708 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9312.5 ns 9084 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 9792 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208.5 ns 10667 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 587036 ns 599437 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22177824 ns 22720103 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 658231.5 ns 557070 ns 1.18
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9229 ns 9291.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9791 ns 8812.5 ns 1.11
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10416 ns 9917 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9208 ns 8958.5 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 114121 ns 118821 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3396921 ns 3465548.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 71690 ns 71593 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15145.5 ns 13687.5 ns 1.11
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13853.5 ns 13604.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16167 ns 14395.5 ns 1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14584 ns 14750 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 559358 ns 570663 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19846391.5 ns 20121784.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 345674 ns 323504 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 542 ns 542 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 667 ns 584 ns 1.14
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34284 ns 35088 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1180662 ns 1218149.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 209633 ns 203871 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8520.5 ns 7562.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7667 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8167 ns 7875 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10333.5 ns 8520.5 ns 1.21
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 228173.5 ns 227876 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 24272825 ns 22566032 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 665251.5 ns 569945 ns 1.17
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15833 ns 16458 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 15792 ns 17041 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14250 ns 16209 ns 0.88
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10604 ns 10979 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 20654 ns 20941 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1114936.5 ns 1150830 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 189192 ns 182992 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 35459 ns 35666 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 35708.5 ns 35167 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 35917 ns 36000 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 35583 ns 57833 ns 0.62
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 262047 ns 265749 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11306916 ns 12188303 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 587246 ns 534293 ns 1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 453583 ns 447500 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 451792 ns 488042 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 456541.5 ns 455709 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 443395.5 ns 496916 ns 0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194459 ns 195513 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6038382.5 ns 5997948.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 348194 ns 328714 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4066000 ns 4024209 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4050916 ns 4055021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4067083.5 ns 4053917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4058792 ns 5501562.5 ns 0.74
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 511129.5 ns 521631.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27375780 ns 27256015 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1358564 ns 1059038 ns 1.28
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 828745583 ns 836727208 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 540680709 ns 553913292 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 540688459 ns 540736625 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1504191667 ns 1517196875 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22557615 ns 22767789 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/oneAPI 174343277 ns 174930068 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14523230.5 ns 10331681 ns 1.41
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3007511084 ns 3773348667 ns 0.80
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 2962783167 ns 1782084291 ns 1.66
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1774752667 ns 1780399750 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4717186625 ns 4786718666 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 119117244 ns 118657187 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI 944339444.5 ns 1332561794 ns 0.71
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 87286346 ns 67063298 ns 1.30
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76833 ns 76542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 80875 ns 76584 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79667 ns 79583 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76416 ns 76708.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 193222 ns 195943.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7583823.5 ns 5455658.5 ns 1.39
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 108921 ns 123300.5 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 193250 ns 191292 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 280041.5 ns 252042 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 281104 ns 199562.5 ns 1.41
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 198375 ns 225542 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 995796 ns 1004442 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43132049 ns 43458500 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 631361 ns 590764 ns 1.07
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199538562.5 ns 199694520.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139125625 ns 138856500 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139078542 ns 139241166 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 394211084 ns 393790959 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5836169 ns 5842492 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI 77966597 ns 78913006.5 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3607862.5 ns 4746717.5 ns 0.76
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 618106583 ns 617676375.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 439744666 ns 439446917 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 438674541.5 ns 439765166.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1195937583 ns 1174222000 ns 1.02
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26625423 ns 26723523 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI 274698573 ns 276392509 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21854587 ns 15854720 ns 1.38
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7334 ns 7292 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6208 ns 6125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 5959 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9833 ns 9834 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26860 ns 26896.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1237975 ns 1173091 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47461 ns 55173 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220166.5 ns 213041.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219875 ns 227729 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229604 ns 220416.5 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205791 ns 206125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 215811 ns 219868 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20823014 ns 20153337 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 524725 ns 541982 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8459 ns 8521 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9667 ns 7458 ns 1.30
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10167 ns 11167 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6875 ns 9250 ns 0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 112059.5 ns 115361 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3481104 ns 3392154.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 70541 ns 74069 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7395.5 ns 7562.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 7958 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10208 ns 8167 ns 1.25
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 7395.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 489162 ns 495697 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18742318 ns 20965461 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 313343 ns 309298 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 417 ns 417 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25812.5 ns 26124 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1173150 ns 1243719 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46821 ns 45334 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8375 ns 9584 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 9062.5 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11500 ns 9792 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9208 ns 9542 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 247400 ns 247606 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22760470 ns 24899790.5 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 393455 ns 382304 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 112333 ns 112312.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 103729.5 ns 103229 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 104458 ns 104104.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 154666 ns 155083 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 23375 ns 23501 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 817681 ns 811475 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189652 ns 192539 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 586396 ns 536562 ns 1.09
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 583374.5 ns 554250 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 585896 ns 535291.5 ns 1.09
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 534334 ns 910854 ns 0.59
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 218992 ns 221242 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11923370.5 ns 11751092 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 607266 ns 560216.5 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5333 ns 5416.5 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6625 ns 6208.5 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7625 ns 6021 ns 1.27
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6166.5 ns 4000 ns 1.54
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16902 ns 17520 ns 0.96
batchedmm(16, Bsize=32)/forward/GPU/oneAPI 72702442 ns 72849606 ns 1.00
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 78091 ns 73648 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12479 ns 11562.5 ns 1.08
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11166.5 ns 11062 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11625 ns 11000 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16625 ns 16666 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 203795.5 ns 207455.5 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI 98168072 ns 97442684 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 377594 ns 330387 ns 1.14
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39833 ns 39667 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51500 ns 51291 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52563 ns 52958.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13583 ns 13625 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 19687 ns 20356 ns 0.97
batchedmm(16, Bsize=128)/forward/GPU/oneAPI 76051886 ns 76663129 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 87971 ns 98364 ns 0.89
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36458 ns 36375.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32395.5 ns 31417 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32104 ns 31229.5 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57167 ns 57000 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 181544 ns 184178 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI 111712645 ns 111708023 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 408524.5 ns 355254 ns 1.15
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1854 ns 1750 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 2042 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2208 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1667 ns 1875 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19104 ns 19575 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1153845 ns 1219758.5 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 27270 ns 29099.5 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2208 ns 2208 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2291 ns 2167 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 2375 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2291 ns 2208 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 196068 ns 198996.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9312847 ns 8766738.5 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136721 ns 128571 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 4583 ns 1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 4417 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6250 ns 6729 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 3958 ns 1.14
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 140814.5 ns 143699.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5579238.5 ns 5704411.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 59971 ns 61955.5 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8604.5 ns 8334 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8083.5 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 8709 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns 8583 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 827700 ns 836045.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 40495109.5 ns 39725172 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387884 ns 364891 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 55208 ns 54833 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 55916 ns 55833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 55708 ns 55583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 56375 ns 56000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36532 ns 36570 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1228814.5 ns 1345223 ns 0.91
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206372 ns 202568 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 523646 ns 476729 ns 1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 535916.5 ns 494500 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 504166.5 ns 494208 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 460541 ns 641625 ns 0.72
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 256291 ns 259886 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27542751 ns 28017517.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 802518 ns 705894 ns 1.14
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3314917 ns 3310333 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2332896 ns 2334062.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2330833 ns 2333375 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6314521 ns 6300479 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204455 ns 204581.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/oneAPI 76381388 ns 77398976 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 216403 ns 373097 ns 0.58
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11460833 ns 11459729 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8303937.5 ns 8305729.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8310083.5 ns 8342854 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21101249.5 ns 21088292 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 734685 ns 744676 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI 119951330 ns 121497637 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1068321.5 ns 1994797.5 ns 0.54
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5250 ns 4833 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5146 ns 4646 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6958.5 ns 7520.5 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 4917 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 132062.5 ns 133339 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5633249 ns 5450569.5 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 55811 ns 61520 ns 0.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7479.5 ns 7083 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7291.5 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7333 ns 7500 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7187.5 ns 7416.5 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 726722 ns 725863 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 36533131 ns 33872141 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 371459 ns 353680 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 122208 ns 100459 ns 1.22
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 98583 ns 123042 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102750 ns 102417 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 95458 ns 121458.5 ns 0.79
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149588 ns 151940.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5726109 ns 5695179 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185537 ns 233346 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029833 ns 2033271 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026979 ns 2026417 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027208 ns 1997458.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2030625 ns 2041833 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 673368 ns 678763 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31179443 ns 31810809 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1113991.5 ns 931831 ns 1.20
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32417 ns 32666 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35792 ns 36562.5 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 34917 ns 36167 ns 0.97
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 500 ns 667 ns 0.75
batchedmm(2, Bsize=4)/forward/GPU/CUDA 14705 ns 15627 ns 0.94
batchedmm(2, Bsize=4)/forward/GPU/oneAPI 72130343.5 ns 72187220 ns 1.00
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 85381 ns 70121 ns 1.22
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2625 ns 2604.5 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2917 ns 2958 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3042 ns 2937.5 ns 1.04
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2209 ns 2167 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 137484 ns 139744 ns 0.98
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI 92342967 ns 92749943 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 354874 ns 289641 ns 1.23
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7208 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6000 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5916 ns 5916 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 9917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35660 ns 35855 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1282087 ns 1252207 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49651 ns 53911 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213312.5 ns 212958.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 252999.5 ns 222708 ns 1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 231666.5 ns 219917 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205458 ns 206209 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 240672 ns 243430 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26115627.5 ns 27468024.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 512115 ns 513269 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3791 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21665 ns 21959 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2106682 ns 2194149 ns 0.96
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 43630 ns 35557 ns 1.23
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14500 ns 14500 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14542 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14584 ns 14500 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14458 ns 14459 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 301039 ns 302419 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11172042.5 ns 11036089 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 195452 ns 179841 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 117271 ns 128041 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 105312.5 ns 144417 ns 0.73
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 105666 ns 106917 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99375 ns 151959 ns 0.65
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 150300.5 ns 140874 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6047278 ns 5963081 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171362 ns 236762 ns 0.72
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1912167 ns 1924583 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1918167 ns 1920500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920208.5 ns 1914229.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1931333 ns 1928875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 660598 ns 673452 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29570117 ns 29935915 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1219983 ns 899671 ns 1.36
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17875 ns 17333 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17645.5 ns 17354.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21583.5 ns 21208 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18500 ns 17375 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104806 ns 108833.5 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3524106 ns 3415955 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76655.5 ns 91100 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229084 ns 216917 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 249729.5 ns 252646 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 225104.5 ns 222166 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216374.5 ns 229125 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 497993 ns 508535.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19805584 ns 19323488.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477940 ns 419764 ns 1.14
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 23708.5 ns 24271 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 30208 ns 30791.5 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28625 ns 29437.5 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1250 ns 1584 ns 0.79
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15640 ns 16398 ns 0.95
batchedmm(16, Bsize=4)/forward/GPU/oneAPI 71882264.5 ns 72518390 ns 0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 87081 ns 76093 ns 1.14
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4771 ns 4500 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5042 ns 4916 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5375 ns 5125 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4791 ns 4625 ns 1.04
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 201063.5 ns 204364 ns 0.98
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI 93151103.5 ns 94073985 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 390274 ns 331675 ns 1.18
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220875 ns 222666 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 223291 ns 220666.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 224542 ns 225667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221042 ns 220583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 218652.5 ns 222506.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7853304 ns 7881934.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 274613 ns 267871 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 507333.5 ns 495084 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 513291.5 ns 511812.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 517583.5 ns 500854 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 495500 ns 675750 ns 0.73
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1033086 ns 1053634 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43068906 ns 42862742 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 851254 ns 780999 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20125 ns 20375 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22271 ns 20000 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23084 ns 23875 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19500 ns 18792 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112207.5 ns 114286 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3446725.5 ns 3510843 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75731 ns 89858 ns 0.84
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221542 ns 212375 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218750 ns 213041 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223104 ns 214458 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213312 ns 212541 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 713841 ns 727333.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25836932 ns 24570511 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 533536 ns 469036 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 6666 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6625 ns 6604.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7458 ns 8750.5 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6084 ns 6208 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 133356 ns 137142 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5813011 ns 5605207 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 67851 ns 60974 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10583 ns 9791 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11166.5 ns 10084 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10625 ns 10750 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10500 ns 10750 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 791888 ns 794651.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 38204864 ns 37034174 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 384539 ns 370101.5 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5000 ns 4666 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5000 ns 4708 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6167 ns 7437.5 ns 0.83
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 4917 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 135679.5 ns 138544.5 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5370478.5 ns 5520602 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 59130 ns 59692 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7458 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7166 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7791 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7708 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 755166.5 ns 755761 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39315379 ns 37179182 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 397459.5 ns 376523 ns 1.06
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14342083 ns 14498417 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10137958 ns 10124125 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10102833 ns 10094833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27729333 ns 27748583.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 547150.5 ns 532665 ns 1.03
batchedmm(128, Bsize=512)/forward/GPU/oneAPI 94505718 ns 94795139 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 394839 ns 866850 ns 0.46
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46201374.5 ns 46333437 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33505729.5 ns 33447541.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33506833 ns 33510458 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85350458 ns 85445667 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2644948 ns 2636151 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI 192552360 ns 192783631 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3310165 ns 5189385.5 ns 0.64
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66875 ns 66458 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67062.5 ns 65687.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 70562.5 ns 70500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66292 ns 66500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 117903.5 ns 118172.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3543442.5 ns 3662360 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234013 ns 237313 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 480625 ns 467958 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 471875 ns 480333.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 479792 ns 474916.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 467417 ns 686583.5 ns 0.68
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 702486.5 ns 715446 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26280981 ns 26609747 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 792918 ns 655875 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33017 ns 32877 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1183373 ns 1227269 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47820 ns 47579 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8708 ns 8750 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9771 ns 9208 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9208 ns 9104.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8167 ns 9750 ns 0.84
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 282049 ns 280778.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20818689 ns 21881943 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 378759 ns 355484 ns 1.07
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9458 ns 9500 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9500 ns 9500 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9459 ns 9500 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9459 ns 9500 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23093.5 ns 23273 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 1930074 ns 1862112.5 ns 1.04
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210972 ns 200655 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 50209 ns 50250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 51166 ns 50500 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 50292 ns 72375 ns 0.69
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 276253 ns 278469.5 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13609571 ns 13204061 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 610356 ns 491037 ns 1.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 55167 ns 54917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 55875 ns 55667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 55459 ns 55584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 55500 ns 56000 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27847 ns 28169 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1197101 ns 1174691 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206462 ns 203240 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 479083 ns 518854 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 504375 ns 500625 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 537333 ns 497750 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 461667 ns 643417 ns 0.72
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 236634 ns 238777 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33673119 ns 31628121.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 841708 ns 758938 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 643604.5 ns 655042 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 591417 ns 613083 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 647458 ns 652541 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 653625 ns 678416.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 206246.5 ns 192069 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8254396 ns 8140636 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 266572 ns 269704 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2241792 ns 2167104.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2237167 ns 2233125 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2233375 ns 2241292 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2235000 ns 2230208.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 932062 ns 929752.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48393957 ns 55073105 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1360079 ns 1217770.5 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19125 ns 19500 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24979 ns 19208.5 ns 1.30
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23437.5 ns 23542 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18583 ns 20000 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111398.5 ns 111306 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3483614 ns 3589059.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79426 ns 91551 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225521 ns 220459 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233166.5 ns 226458 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229166 ns 223104.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219708 ns 219708 ns 1
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 719263 ns 714110 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25222149 ns 26626181 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 554326 ns 487481 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 24044 ns 23491 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1223409 ns 1232519 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47911 ns 43771 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9750 ns 9417 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9667 ns 9291.5 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10417 ns 9708 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10042 ns 9646 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 267747 ns 261581 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25715090 ns 23734390 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 404304.5 ns 381618 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7417 ns 8917 ns 0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9000.5 ns 7583 ns 1.19
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10458 ns 11854.5 ns 0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7959 ns 9042 ns 0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 116633.5 ns 115935.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3300094 ns 3441325 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69880 ns 70456.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7666 ns 8125 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7542 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8000 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 7292 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 492259.5 ns 484010 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17197317 ns 17813154.5 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321574 ns 302215 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1416 ns 1417 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns 1667 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2083 ns 1959 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1542 ns 1500 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20671 ns 20030 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1166362.5 ns 1146657 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 191412 ns 184144 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3459 ns 3708 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3667 ns 3625 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3833 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3542 ns 4917 ns 0.72
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 216790 ns 213101.5 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10825139.5 ns 10511562.5 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 581136 ns 524324.5 ns 1.11
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 147979 ns 148729 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 128271 ns 128917 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 130625 ns 129917 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225209 ns 235541 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23166.5 ns 22778 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1193615 ns 1179919.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36441 ns 46868 ns 0.78
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143209 ns 143645.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 125792 ns 130875 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 130479.5 ns 138417 ns 0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 251104.5 ns 290021 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 214906.5 ns 211960 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10713541 ns 10741797 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 231632 ns 223578 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7500 ns 7167 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5959 ns 5958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5875 ns 5958.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34265 ns 33236 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1203594 ns 1203805 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49981 ns 57207 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225604 ns 221249.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 267167 ns 238542 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 236917 ns 264500 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212812.5 ns 213250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263010 ns 259447 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28448372 ns 27707385 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 521925 ns 530542 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12416 ns 13209 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11958 ns 12166 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13291.5 ns 13584 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12416 ns 12667 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 135191 ns 135078 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5177938 ns 5685986 ns 0.91
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 235942 ns 227730.5 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24458.5 ns 23917 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24750 ns 24083.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25020.5 ns 24750 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23792 ns 30146 ns 0.79
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 844672 ns 833527 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39036318 ns 39963084.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 684117 ns 615374.5 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8708 ns 9271 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9687.5 ns 9541 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10438 ns 10375 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8541 ns 9250 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 119636.5 ns 119628 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3564984 ns 3356719.5 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 70291 ns 74940 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13833 ns 14041 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14083 ns 13958 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14375 ns 14750 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14250 ns 13459 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 646842.5 ns 638262 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22010471 ns 22466836 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 363989 ns 344824 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8875 ns 9666.5 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9042 ns 9208 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10792 ns 10959 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9000 ns 9083.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 119111 ns 118521 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3329855.5 ns 3571671.5 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 72351 ns 79399 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12541.5 ns 13416 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12958 ns 12416 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13625 ns 13479.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13270.5 ns 12708 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 534452 ns 530027 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19320932 ns 19360325 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 339643 ns 317163 ns 1.07
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29104 ns 30896 ns 0.94
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34250 ns 33813 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31333 ns 32249.5 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1667 ns 1875 ns 0.89
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16403 ns 16425 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/oneAPI 77474262 ns 76985679 ns 1.01
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 78711 ns 76663 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5313 ns 5417 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5187.5 ns 5000 ns 1.04
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5542 ns 5479.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6334 ns 6270.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 139944.5 ns 138278 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI 109880855.5 ns 109824422.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 385874 ns 340566 ns 1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 291 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26331 ns 25574 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1212809 ns 1142450 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 50001 ns 45666 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6458 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6520.5 ns 6375 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833.5 ns 6791.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6458.5 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 189017.5 ns 185923.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24086071.5 ns 22900684.5 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393884 ns 365402.5 ns 1.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 2084 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2084 ns 2084 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2166 ns 2083 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2000 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26947 ns 26453 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1232598.5 ns 1207656 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 209912 ns 203645.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17271 ns 18041 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17958 ns 17166.5 ns 1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18437.5 ns 17750 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17416.5 ns 23458.5 ns 0.74
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 273533.5 ns 268326 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25687047 ns 24994377.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 706927 ns 600702.5 ns 1.18
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 174834 ns 147875 ns 1.18
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 177583 ns 155437.5 ns 1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 154000 ns 155125 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147938 ns 151708 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193431 ns 190890.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7836168 ns 7974634 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 191502 ns 271146.5 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1322708.5 ns 1321937.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324354.5 ns 1330625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1323396 ns 1308375 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1323166 ns 1285166 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 871146 ns 867140 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46559834.5 ns 45331705.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 992400 ns 1006962 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24521 ns 25500 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25208 ns 23542 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27666 ns 28708.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24333 ns 24416.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230336.5 ns 226899 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7412823 ns 7680667 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115211 ns 128029 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 119395.5 ns 125062.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 178146.5 ns 165729.5 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 128749.5 ns 125854.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117812.5 ns 180062 ns 0.65
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 999409 ns 998018.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45419603 ns 44411227 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 590316 ns 568743 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 250 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23826 ns 23453 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1180276 ns 1190116 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 49221 ns 44533 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6895.5 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6458 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 6958 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6520.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 205667.5 ns 201834 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24604250 ns 23542895 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 395509.5 ns 372536 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6292 ns 5645.5 ns 1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6083 ns 5375 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6875 ns 7979 ns 0.86
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5834 ns 5166 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 139751.5 ns 139838.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5712361.5 ns 5619575.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 236162 ns 229750 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10354.5 ns 9958 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10167 ns 10042 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10209 ns 10417 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10854.5 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 879601 ns 866511 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 40717350 ns 43130156 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 675267 ns 603858 ns 1.12
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 709 ns 708 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 709 ns 708 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 708 ns 750 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 667 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23026.5 ns 22827 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2088235 ns 2079377 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 210772 ns 202368 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4917 ns 4834 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4875 ns 4833 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5209 ns 5125 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4833 ns 6291 ns 0.77
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 226029 ns 222098 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10810104.5 ns 9952955 ns 1.09
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 586506 ns 471721 ns 1.24
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8146 ns 8750 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8625 ns 7834 ns 1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9270.5 ns 9375 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7291.5 ns 7646 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 118631.5 ns 117939.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3512039 ns 3568146 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 70461 ns 74409 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125.5 ns 8792 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8583 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 8875 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8542 ns 8083 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 573498 ns 568724.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20841257.5 ns 20842961 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 343184 ns 335106 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126249.5 ns 126042 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 128834 ns 129208 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 131000 ns 129542 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 181395.5 ns 180792 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46573 ns 46423 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/oneAPI 70147077 ns 72616088 ns 0.97
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 95481 ns 101850 ns 0.94
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 338333 ns 315875 ns 1.07
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 349000 ns 334166.5 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 337062.5 ns 323291.5 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 567833.5 ns 609395.5 ns 0.93
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 189615.5 ns 187684 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI 92008312 ns 93899553 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 440594.5 ns 405833.5 ns 1.09
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397084 ns 397500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288708 ns 287979.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288709 ns 288375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756250 ns 756000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43906 ns 43964 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1378630 ns 1424885 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 81921 ns 79439 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1460000 ns 1461000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133854 ns 1133834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133374.5 ns 1129645.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2441854.5 ns 2449292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 258953 ns 254140 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11022577.5 ns 11042616 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 350313 ns 254646 ns 1.38
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 652042 ns 626500 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 655770.5 ns 657208.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 649874.5 ns 649750.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 689167 ns 642417 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 182902 ns 185720.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8018912 ns 8332264.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 263233 ns 264649 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2443292 ns 2452625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2446500 ns 2465208.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2444562.5 ns 2459375 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2453167 ns 2376375 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 954882.5 ns 949649 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49210806 ns 53455476.5 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1318509 ns 1323598 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33250.5 ns 32458 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35646 ns 36521 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34312 ns 34833 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 792 ns 959 ns 0.83
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15879 ns 15902 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/oneAPI 67611192 ns 73782106 ns 0.92
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 85861 ns 74499.5 ns 1.15
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3000 ns 3125 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3125 ns 3250 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3542 ns 3375 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3084 ns 3062.5 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 139105.5 ns 137187.5 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI 91963406 ns 98822060.5 ns 0.93
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 349913.5 ns 314258 ns 1.11
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 441375 ns 436500 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 439792 ns 438625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 440208 ns 438791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 450041 ns 445917 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43441 ns 42826 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1439826.5 ns 1503651 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 239472 ns 374379.5 ns 0.64
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4142666 ns 4140000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4279125 ns 4271375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4266750 ns 4270687.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4032604 ns 5468750 ns 0.74
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 238139 ns 236201.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36792447.5 ns 36248116 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1233123 ns 1135862 ns 1.09
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3791 ns 3750 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3792 ns 3791 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3709 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34461 ns 34158 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1301757 ns 1274307 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40721 ns 41117 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15292 ns 15375 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15375 ns 15334 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15625 ns 15500 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15333 ns 15250 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 259750 ns 255579 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9866512.5 ns 8309435 ns 1.19
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 168172 ns 158606 ns 1.06
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404291 ns 404792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 296209 ns 295917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 294584 ns 295958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760542 ns 759750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113769 ns 113245 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1008763 ns 1043498 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89651 ns 91962 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1484624.5 ns 1482854 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1157750 ns 1158625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1145000 ns 1150334 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2462750 ns 2466708 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 238527 ns 236768.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 12433376 ns 9725420.5 ns 1.28
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 352393.5 ns 298578 ns 1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 584 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26037 ns 25569 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1255059.5 ns 1198679 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 210442 ns 202679 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7541 ns 8083 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 7792 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8084 ns 8375 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 8437.5 ns 0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 211417.5 ns 207068.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24400994.5 ns 25228707 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 693867 ns 593474 ns 1.17
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 829271.5 ns 829375 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 616604 ns 617667 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 618458 ns 618667 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1541041.5 ns 1544417 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 131087.5 ns 130866 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI 67626584 ns 74874331.5 ns 0.90
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 167721 ns 211214 ns 0.79
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2681083 ns 2686104.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1986250 ns 1994542 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1998541.5 ns 1998375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4921791 ns 4960479 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 235041 ns 234509 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI 96224859 ns 102181218 ns 0.94
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 859218 ns 831293.5 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 250 ns 250 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 33342 ns 32562 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1243213 ns 1276503 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 49520 ns 48691 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6312.5 ns 6333 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6917 ns 6667 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6104.5 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 226628 ns 227701 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20936439.5 ns 21756022 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 365763 ns 346728 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1723875 ns 1760625 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1762479 ns 1749875 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1732458.5 ns 1744292 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1731875 ns 1755166 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190679 ns 189332 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8143920 ns 7765672 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 366039 ns 413433 ns 0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4353396 ns 4360416 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4360791.5 ns 4366917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4367521 ns 4349104 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4360208 ns 5705104 ns 0.76
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 855902 ns 849205 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46724354.5 ns 48802559 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1250493 ns 1205562.5 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 13708 ns 9604 ns 1.43
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7166 ns 6916 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7834 ns 8208 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7229 ns 6854 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22753 ns 22924.5 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1051464 ns 1184238.5 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 37990 ns 46437 ns 0.82
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 45958.5 ns 50604.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 50187.5 ns 52166 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 65249.5 ns 45458.5 ns 1.44
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 68708 ns 33312.5 ns 2.06
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 213762.5 ns 211538 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10164364 ns 10576796.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 234103 ns 226508 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20791.5 ns 21646 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 25208 ns 26083.5 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24229 ns 24958.5 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5208 ns 5291.5 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18045 ns 18121 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/oneAPI 82328327.5 ns 88732630 ns 0.93
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 90131 ns 73668 ns 1.22
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12312.5 ns 12125 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10500 ns 10667 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10875 ns 10833 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18125 ns 18042 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 222607.5 ns 221707 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI 143966308.5 ns 148404121 ns 0.97
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 383434 ns 322703 ns 1.19
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405834 ns 405917 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297333 ns 296791.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296750 ns 297167 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762541 ns 756709 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46967 ns 46696 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1397676 ns 1393570.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 90631 ns 90770 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1478916 ns 1487375 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1164458 ns 1163500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1157937.5 ns 1157209 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2468458 ns 2472417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 284577.5 ns 283340.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 14188453 ns 11947586 ns 1.19
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 375254 ns 269032 ns 1.39
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 438667 ns 436458 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 439125 ns 443270.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 440208 ns 440750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447917 ns 449000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54562 ns 53940 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 999880.5 ns 1027722 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236502.5 ns 323133 ns 0.73
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4148000 ns 4138541 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4270667 ns 4268354.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4263812.5 ns 4258750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4038187.5 ns 5475229.5 ns 0.74
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 258729.5 ns 255597 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31613962 ns 31502698.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1211463 ns 1132896.5 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9292 ns 9333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 8000 ns 8000 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 8000 ns 8000 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 13250 ns 13250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24006 ns 23885 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2083029 ns 1973050 ns 1.06
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 213222 ns 202528 ns 1.05
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 49667 ns 49625 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 49583 ns 49667 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 50041 ns 49583 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 49416 ns 71667 ns 0.69
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 344384 ns 336641 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12740327 ns 13058534 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 656211.5 ns 508895.5 ns 1.29
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83063 ns 108270.5 ns 0.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82958 ns 86167 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 87395.5 ns 86500 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 107312.5 ns 146083 ns 0.73
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192844 ns 192063 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6092023.5 ns 5750624 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 200502 ns 267851 ns 0.75
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2016750 ns 2018917 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018396 ns 2016937.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2015124.5 ns 2011375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2019500 ns 2024000.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 519972 ns 511598 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27686183 ns 30563079 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1092442 ns 860237 ns 1.27

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant