Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

feat: use fallback GPU implementations with warnings #165

Merged
merged 11 commits into from
Sep 21, 2024
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Sep 21, 2024

New Additions

  • Fallback BatchedMM
  • Conv -- Forward Pass

@avik-pal avik-pal changed the title ci: update buildkite settings feat: use fallback GPU implementations with warnings Sep 21, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 0fa961d Previous: a6c4a16 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7000 ns 5666 ns 1.24
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 5667 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8125 ns 7062.5 ns 1.15
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5833 ns 5541.5 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 114370 ns 117778 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2728634 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 403974 ns 404275 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9854.5 ns 9937.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10062 ns 10041 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10083.5 ns 10291 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10021 ns 9875 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 542155 ns 544239 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18579997 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 688097 ns 11501326 ns 0.059827623354037615
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1895.5 ns 1416.5 ns 1.34
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 2458 ns 1479 ns 1.66
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1833 ns 1625 ns 1.13
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3312.5 ns 1542 ns 2.15
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21595 ns 21518 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1311721 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 29250 ns 29030 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4375 ns 4250 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4104 ns 4333 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4292 ns 4313 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4500 ns 4459 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 144308.5 ns 145904.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9573121.5 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 150842 ns 145511 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58292 ns 58625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 40667 ns 39750 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 40084 ns 40042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83708 ns 83395.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37142 ns 37436 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 554486 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78351 ns 80685.5 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029125 ns 2046125 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082875 ns 2077896 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2076625 ns 2083625.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999854 ns 1999104 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 226507.5 ns 229936 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7112064 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1126831 ns 1490545 ns 0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 181729 ns 162312.5 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146124.5 ns 164083 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 174625 ns 174959 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 145917 ns 153854 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165233.5 ns 166305 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7170202 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204932 ns 198262 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1123062 ns 1121458.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1118333 ns 1114979 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1115334 ns 1119209 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1114333.5 ns 1123521 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 696646 ns 696644 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34659049.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1026230 ns 1026480.5 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 4875 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4334 ns 4916 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6583 ns 5875 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4584 ns 5375 ns 0.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 90392 ns 92112 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5213655 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67531 ns 69791 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8875 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8917 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8875 ns 8959 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8833 ns 8625 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 607432 ns 596620 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 33741334 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 385194 ns 389954 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20687.5 ns 18312 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18167 ns 18104.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21166 ns 20021 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17375 ns 17771 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 65601 ns 67875.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2651220 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76505.5 ns 77581 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212250 ns 235917 ns 0.90
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222167 ns 212458 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214083 ns 213667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217000 ns 225292 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 349996 ns 353373 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 11608614 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 465179.5 ns 470510 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 708 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 625 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 833 ns 959 ns 0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 729.5 ns 0.91
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20444 ns 20362 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1169655 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 31240 ns 32440 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1521 ns 1375 ns 1.11
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1416 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1562.5 ns 1459 ns 1.07
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1479.5 ns 1375 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 124121.5 ns 125347.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8377334 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 136902 ns 135651 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7458 ns 7458 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5417 ns 5292 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5459 ns 5458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10416 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23727 ns 24280.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1272526 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47160 ns 48481 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219666 ns 256833 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 266979 ns 268834 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 240208 ns 238167 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213500 ns 213521 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 191900 ns 190543 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 30716086 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 641871 ns 644671.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23269 ns 23269 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1962626 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46830 ns 48260 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16583 ns 16542 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16542 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16834 ns 16833 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16583 ns 16583 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 192011.5 ns 195985.5 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10074326 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 172192 ns 174616.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511750 ns 511667 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 332166 ns 331875 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 332333 ns 332042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 864834 ns 865458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113501.5 ns 113196 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 392820.5 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 243122 ns 243182 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2281875 ns 2277833 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1757125 ns 1758208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1748833 ns 1758041.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3194666 ns 3193625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 236911 ns 242653 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 9329632 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 745737 ns 741122 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6583 ns 6396 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 7021 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6896 ns 7583 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6208 ns 6084 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 90670 ns 90386 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5216804 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 65650 ns 65841 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11146 ns 11812 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12125 ns 11729.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12125 ns 12250 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11792 ns 10125 ns 1.16
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 654623 ns 626387 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38049525 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 410264 ns 405759 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23122 ns 23421 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2086193.5 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 47250 ns 46570 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2083 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2208 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2084 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 220521 ns 221475.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11675105 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 174296.5 ns 174101.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8583 ns 9041 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9124.5 ns 9292 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10375 ns 10375 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8625 ns 9000 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 108876.5 ns 94379 ns 1.15
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 2938947.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 73720 ns 72281 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17958 ns 17375 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18021.5 ns 17729 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 20166.5 ns 19209 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16792 ns 17562.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 610846 ns 576225.5 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 16134747 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 379093 ns 378363 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34896 ns 35667 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1175218 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46220 ns 46061 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10292 ns 10687.5 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9333 ns 9083.5 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9520.5 ns 9750 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9937.5 ns 8666.5 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 263509 ns 258995 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18433346 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 369503 ns 366948.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398417 ns 399292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215333 ns 215291 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215708.5 ns 215292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756042 ns 756083 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112043 ns 113061 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 324163 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 75161 ns 74731 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1403583 ns 1407958 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 860500 ns 860333 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 860292 ns 860854 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2356250 ns 2357500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 204832 ns 211180.5 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10824254 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 324103 ns 323393 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7083.5 ns 7125 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7208.5 ns 7542 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8750 ns 9000 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6937.5 ns 7250.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 144097 ns 143379.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5731995 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 66151 ns 66420 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14270.5 ns 15250 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13167 ns 14959 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15750 ns 13687.5 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14646.5 ns 12333.5 ns 1.19
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 965886 ns 942342 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41096887.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 426274 ns 425844 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26250 ns 24646 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 28521 ns 28000 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27312.5 ns 26666 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 28791.5 ns 28334 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 197086.5 ns 199235 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7328154 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 114161 ns 114286.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 156458 ns 153084 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 147291 ns 157166.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 116187.5 ns 145958.5 ns 0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 118500 ns 153417 ns 0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1069974 ns 1075111 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42550755 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 588095.5 ns 585190.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73792 ns 76625 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77000 ns 76729 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77083 ns 81229 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 78959 ns 79750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 203286 ns 206416.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 6928996 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 128061 ns 129541 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 298125 ns 307729 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 317958 ns 294250 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 304042 ns 290520.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 318041 ns 291458 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1106723 ns 1105738.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 39233924 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 690307 ns 696697 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16875 ns 16875 ns 1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 16750 ns 16500 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18917 ns 18375 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16709 ns 17584 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 144249.5 ns 145532.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5646190 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 233552 ns 232517.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27208.5 ns 27125 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27896 ns 26750 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27812.5 ns 27208 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 28166.5 ns 26604 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 976554 ns 980431.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 43689238 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 694906 ns 686517 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11042 ns 11625 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10937.5 ns 12250 ns 0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13458 ns 13875 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11125 ns 10458 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 122570.5 ns 123683.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3608787 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235652 ns 236852 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22292 ns 22709 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21958 ns 22063 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22333 ns 23083 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21541 ns 21833 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 698612.5 ns 703893 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21406667.5 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 662166 ns 673557 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 64250 ns 64250 ns 1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 69042 ns 69208 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67209 ns 65937.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 65375 ns 63250 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105516 ns 107264.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3470032.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 232633 ns 232543 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 482000 ns 457334 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 450374.5 ns 450791 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 438729.5 ns 449333.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 443791 ns 488708 ns 0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 510352 ns 515904.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 22832245.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 715037 ns 701456.5 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7062.5 ns 7333.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7292 ns 7750 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8834 ns 9208 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7500 ns 6979 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 142341 ns 144382.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5428957 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 65041 ns 65051 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13395.5 ns 14354.5 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16062.5 ns 15459 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15208 ns 15000 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14145.5 ns 15604 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 944536 ns 949171 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 37204219 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 399994 ns 399874 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6182312.5 ns 6153958.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3226458 ns 3225750 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3224104 ns 3225687.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11924667 ns 11912750 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 351177 ns 350232.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/oneAPI 50010265 ns
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 318033 ns 320283 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19136666.5 ns 19165042 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11136625 ns 11087125 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11100292 ns 11132791 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36502291.5 ns 36531187.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1100585.5 ns 1015711 ns 1.08
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI 76536164 ns
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1166521.5 ns 1168797 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 958 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 1000 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 917 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 22979 ns 23879 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 1957397.5 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 208262 ns 206962 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3666 ns 3750 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3709 ns 3709 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 278006 ns 284113 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10709575 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625756 ns 623016 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7937.5 ns 8312.5 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8229.5 ns 8604.5 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9375 ns 10083 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7750 ns 8146 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 119006.5 ns 119881.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3406755 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 74061 ns 71901 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12187.5 ns 12166.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 12750 ns 12145.5 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 14083 ns 13313 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11333 ns 11395.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 637337.5 ns 642520 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21563096 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 355258.5 ns 357894 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 21948 ns 22935 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2084333.5 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46950 ns 46631 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2833 ns 2917 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2917 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3250 ns 3167 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2833 ns 2958 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 200292.5 ns 206899.5 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9839820 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 155592 ns 161012 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12792 ns 12500 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11834 ns 11354 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12833 ns 13083 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11167 ns 10958.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 119893 ns 121271 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3241253.5 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 236063 ns 233822 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20708.5 ns 20291.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21292 ns 21083 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21250 ns 22187.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21312.5 ns 20104.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 590891.5 ns 597659.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20607237 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 650536 ns 638656 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4416 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 23788 ns 24156 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2147602 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 49121 ns 47331 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16584 ns 16167 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16416 ns 16375 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16416 ns 16333 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16417 ns 16333 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 327093.5 ns 333657 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12732239 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 208292 ns 207757 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 2125 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 1958 ns 2125 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 2084 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2041 ns 2041 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35579 ns 36462 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1202245.5 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 204436.5 ns 202982 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17042 ns 17021 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 18083.5 ns 17625 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 19125 ns 16667 ns 1.15
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16333 ns 17083.5 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 291105 ns 296284 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19886230 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 689131 ns 684797 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59229 ns 59562.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60125 ns 61667 ns 0.97
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 61459 ns 61875 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51375 ns 50958 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66363 ns 66679 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI 86637572 ns
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 115381 ns 117392 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 189770.5 ns 190771 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 148917 ns 149541 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 97708.5 ns 116312.5 ns 0.84
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 309208 ns 298166 ns 1.04
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 214334 ns 219498 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI 148722833 ns
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 614056 ns 614646 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86604.5 ns 83166.5 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 84500 ns 83395.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86375 ns 110041.5 ns 0.78
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85708 ns 83020.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192894.5 ns 190710.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5258216.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204282 ns 206032 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1905792 ns 1873645.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1892750 ns 1919416 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1900750 ns 1920792 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1870625 ns 1919291.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526027 ns 533490 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26753507.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1068255.5 ns 1074210 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21314 ns 21800 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2080980.5 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 41780 ns 43000 ns 0.97
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1875 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1791 ns 1792 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 250042 ns 256181.5 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9716773.5 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 181572 ns 182412 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8500 ns 8458 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8792 ns 9958 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11521 ns 11708 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 7583 ns 1.19
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116760 ns 119063.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3339276 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 234131 ns 234272.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9000 ns 9208 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 9854 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9166 ns 9792 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9500 ns 8750 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 520709 ns 528065 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20652008 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 627683 ns 634101 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58417 ns 58208 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 40333 ns 39375 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39542 ns 39959 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83750 ns 83291 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38752 ns 39916.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1328063 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75841 ns 79101 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920770.5 ns 1906833 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1928521 ns 1969916.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1954583.5 ns 1979458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1888000.5 ns 1901458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 219080 ns 221725 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32011918 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1015755 ns 1161491.5 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 415417 ns 417125 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 417584 ns 420562.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 420917 ns 422103.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 422208 ns 417979 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 206380 ns 210226 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7709399 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 282222 ns 283213 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 669542 ns 680083.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 741000 ns 675125 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 669750 ns 672375 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 769625 ns 672542 ns 1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1044128.5 ns 1049720 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42252360 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 905964 ns 908698.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3364729 ns 3405187.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3444583 ns 3449917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3466021 ns 3463646 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3450375 ns 3430687 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167737 ns 170640 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8364086 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 445672.5 ns 450759.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6249791.5 ns 6244167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6232833.5 ns 6219417 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6233520.5 ns 6254812 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6214208 ns 6201688 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 983988.5 ns 1001354 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 52841091 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1551067 ns 1637156.5 ns 0.95
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 474625 ns 474833 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 254209 ns 253792 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 253250 ns 253584 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 902875 ns 901250 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46316 ns 47396 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 392289 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242651 ns 241892 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2261542 ns 2269791 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1763417 ns 1760416 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1753000 ns 1763687.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3203417 ns 3197937.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 253834.5 ns 271388 ns 0.94
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 14983409 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 769568.5 ns 765898 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58042 ns 58541 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39750 ns 39292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39625 ns 39792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83875 ns 84166 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27776 ns 28606 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1322576.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75170 ns 73921 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024708.5 ns 2031396 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2069229.5 ns 2088958.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2074583 ns 2084000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1987416.5 ns 1977812.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 232148 ns 235137 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37325860 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1193936 ns 1110895.5 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58979.5 ns 58667 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 40042 ns 39833 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39833 ns 40000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83416 ns 83291 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48621.5 ns 49806.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 819083.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75250 ns 76691 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1894958.5 ns 1930083.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1945875 ns 1967645.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1967333 ns 1961750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890354.5 ns 1797166 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 237723.5 ns 240260.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17306002 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1038985 ns 929734.5 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 416 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34420 ns 35036 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1184520 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 46480 ns 46470 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7584 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6875 ns 6875 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 7458 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958 ns 5916 ns 1.18
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 209806 ns 213960 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 21802968 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 362502 ns 368994 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32262 ns 33302 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1231644 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 36940 ns 36481 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2959 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2709 ns 3083 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 3042 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2625 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 186134 ns 192793 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7394765 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 149051 ns 151232 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 455666.5 ns 420458.5 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 457083.5 ns 458333.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 425937 ns 443562.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 449041.5 ns 454625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 137203 ns 138662 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5989310 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 377397 ns 376564 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3794292 ns 3808250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3826895.5 ns 3812458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3819042 ns 3814333.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3722562.5 ns 3779687.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 704611 ns 712866 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32030017 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1471758 ns 1464519 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49940500 ns 49902208 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 26001417 ns 26041000 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 26005875 ns 26000917 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97063333 ns 97099875 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1599184.5 ns 1600470 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI 55216417.5 ns
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1046316 ns 1045150 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154651354.5 ns 154793291.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 89413687.5 ns 88667041.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 89414417 ns 89550541 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295512500 ns 294974291.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6483995 ns 6495543 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI 127440010 ns
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5576760 ns 5606170 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 17208 ns 18750 ns 0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 16208 ns 15666.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 15375 ns 14167 ns 1.09
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15583 ns 15270.5 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20249 ns 20352.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1258826 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27331 ns 25851 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11041 ns 11041 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7667 ns 7833 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 8166.5 ns 7958 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17562 ns 17083 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 257898 ns 261162.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9716305 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 149465.5 ns 148401.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9041.5 ns 8375 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8916.5 ns 9083 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9625 ns 10583 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8250 ns 7916.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 114056.5 ns 113294.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3436422 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 234541 ns 234072 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10083 ns 10521 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10854.5 ns 10416.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11041.5 ns 10042 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10083 ns 9666.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 616548 ns 615911 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23702900 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 656253.5 ns 655506 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9354 ns 9625 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9146 ns 9833 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11459 ns 12042 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9708 ns 8479 ns 1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 119534.5 ns 120314 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3401114 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72100 ns 71931 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15084 ns 13083 ns 1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17854.5 ns 15021 ns 1.19
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14541 ns 14542 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16958 ns 13417 ns 1.26
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 586032 ns 587303 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 18941794 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 342782 ns 344908.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 541 ns 459 ns 1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34336 ns 34757 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1224632 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 204531 ns 201632 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 7333.5 ns 1.14
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9104 ns 9270.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 7833 ns 1.32
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7229.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 229285 ns 231923.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22054400 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 662954 ns 657851 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15333 ns 15875 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 14875 ns 14645.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 12833.5 ns 12167 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10459 ns 10375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21521.5 ns 21214 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1153921 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 188256.5 ns 184672 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 31708 ns 31375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32187 ns 32416 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32375 ns 32270.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32229.5 ns 31541 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 273410 ns 276539 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11009636 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 593353 ns 588126 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 442750 ns 444792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 447875 ns 484417 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 443250 ns 448792 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 443417 ns 443250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194791 ns 194813 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5825309.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 369882 ns 367924 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3838166.5 ns 3843833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3841000 ns 3831916.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3835749.5 ns 3838417 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3829458 ns 3835042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 539590 ns 537386 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28901903 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1361922.5 ns 1358632 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 779827458 ns 784101083 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 415957437.5 ns 418358083 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 419640250 ns 418383604.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1555951979 ns 1504938187.5 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22737355 ns 22745060.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/oneAPI 178704645.5 ns
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14749917.5 ns 14695345 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2523536084 ns 2524662875 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1507841417 ns 1518103167 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1511376375 ns 1524361625 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4737339000 ns 4741835375 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 365203370 ns 366822106 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI 918052378 ns
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88648218 ns 88277685 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 80375 ns 76417 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77208 ns 76792 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79437.5 ns 80333 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76771 ns 77208 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 205797 ns 206105.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 5489913 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106841 ns 118901 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 288667 ns 191562.5 ns 1.51
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 201604 ns 287750 ns 0.70
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 193375 ns 209417 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 284041 ns 253812.5 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1042751.5 ns 1033097.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 41466377.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 632444 ns 658411 ns 0.96
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 200199833.5 ns 200015521 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 104055854 ns 103790000.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 103865958 ns 104076875 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388748166 ns 389226000 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5829846 ns 5819295 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI 77852403 ns
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3563321 ns 3575713 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 621227396 ns 621801500 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 353126562.5 ns 353125646 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 356165583.5 ns 354434874.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1186170667 ns 1181638875 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26447704 ns 26630294 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI 279065786 ns
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21993454.5 ns 22185623 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5375 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10500 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27665.5 ns 27436 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1245534 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48550 ns 46631 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213584 ns 212500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221000 ns 220750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221521 ns 220458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205853.5 ns 206104.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 221500 ns 220558 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26571515 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 522663 ns 523545 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8500 ns 10541.5 ns 0.81
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7959 ns 9541.5 ns 0.83
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10834 ns 10875 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7729 ns 8312 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 114880 ns 117824.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3278469.5 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 69301 ns 70451 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8959 ns 7583.5 ns 1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 12583 ns 9792 ns 1.29
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 8187.5 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7354 ns 7562.5 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 516223 ns 515354.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19023334 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 315512 ns 318733 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25923 ns 26054 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1283315 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46841 ns 46610 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10250 ns 9083 ns 1.13
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 12458 ns 9604 ns 1.30
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13521 ns 8958 ns 1.51
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8834 ns 9166 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 250993 ns 252407.5 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23768700 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387937 ns 388539 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106500 ns 107458.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 84042 ns 84708 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 85500 ns 86000 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146479 ns 146750 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24236.5 ns 23950.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 772749.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190921 ns 191282 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 478166.5 ns 516625 ns 0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 480562.5 ns 502312.5 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 479083 ns 478354.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478000 ns 498167 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 230930 ns 232559 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11531138 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 609234 ns 606451 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5209 ns 5250 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 7167 ns 6500 ns 1.10
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7521 ns 7749.5 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6208.5 ns 5687.5 ns 1.09
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16010 ns 16126.5 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/oneAPI 69713523 ns
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79920 ns 85781 ns 0.93
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12500 ns 11625 ns 1.08
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10500 ns 9917 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11167 ns 10104.5 ns 1.11
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16562.5 ns 16584 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 211835 ns 215162.5 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI 94286099.5 ns
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 367732.5 ns 378354 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39292 ns 38708 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50167 ns 51125 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 51187.5 ns 52146 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 15917 ns 14417 ns 1.10
batchedmm(16, Bsize=128)/forward/GPU/CUDA 21272 ns 19504 ns 1.09
batchedmm(16, Bsize=128)/forward/GPU/oneAPI 74117102.5 ns
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 87240 ns 93401 ns 0.93
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36834 ns 36334 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 28250 ns 28167 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 29229 ns 28625 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57916.5 ns 56895.5 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 190232 ns 190765 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI 107961163 ns
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 411983 ns 410848.5 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1958.5 ns 1666.5 ns 1.18
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1709 ns 2000 ns 0.85
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2209 ns 2167 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1812.5 ns 1667 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20515 ns 20338 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1206257 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 32750 ns 32440 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2187.5 ns 2042 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2375 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2667 ns 2417 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2083 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 202549.5 ns 202489 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9048897 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 137331 ns 136411 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4770.5 ns 6750 ns 0.71
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5125 ns 4833 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6458 ns 5896 ns 1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5250 ns 4916.5 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 142988.5 ns 142403 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5974748 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 59740 ns 69051 ns 0.87
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8583.5 ns 8395.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9209 ns 8625 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9166.5 ns 8542 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns 8292 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 876984.5 ns 858082 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 37578358.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387002.5 ns 388048.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56917 ns 56834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56916 ns 56916 ns 1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57000 ns 56917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58250 ns 58291 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37250 ns 37048 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1219604 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 204526.5 ns 204772 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 507437.5 ns 484583.5 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465749.5 ns 475541.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 468750 ns 465562.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 440500 ns 445666 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 265836 ns 263380 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26749125.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 825546 ns 819218 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3335229 ns 3332458 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1771604 ns 1767958 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1766375 ns 1766125 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6314750 ns 6295583.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204661 ns 206330 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/oneAPI 79139898 ns
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 210542 ns 212392 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11539645.5 ns 11495438 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6559708 ns 6565688 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6570000 ns 6570438 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21182291.5 ns 21167562.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 740843 ns 737845 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI 121795434.5 ns
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1045497 ns 1062630 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5166 ns 4833 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 5583 ns 0.84
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6354.5 ns 7333 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4875 ns 4500 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 136797.5 ns 136011 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5676835 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 57551 ns 56600 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9458 ns 7125 ns 1.33
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 7500 ns 1.16
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9292 ns 7541.5 ns 1.23
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7292 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 759042 ns 746443 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 34120894 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 365442 ns 370888 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 126917 ns 155000 ns 0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 99542 ns 124709 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102958 ns 98541 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122917 ns 98709 ns 1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 150316.5 ns 150159 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5774988 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204316 ns 204262 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033333 ns 2031188 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2041041.5 ns 2031500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2033645.5 ns 2037125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999500 ns 2033000 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701696.5 ns 697162 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31730226 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1112818 ns 1208931 ns 0.92
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33625 ns 33209 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35541 ns 34833 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 33479 ns 33042 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 687.5 ns 541 ns 1.27
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15262 ns 15393 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/oneAPI 73822850 ns
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 78871 ns 79290 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3042 ns 2583 ns 1.18
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3333 ns 3083 ns 1.08
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3792 ns 3209 ns 1.18
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2375 ns 2125 ns 1.12
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 138789.5 ns 138753 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI 93638716 ns
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 342702 ns 341213 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5333 ns 5416 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5416 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10417 ns 10458 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36210 ns 36086 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1257374 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48570 ns 49460 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220458 ns 213395.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221958 ns 227750 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223709 ns 220792 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206083 ns 205667 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243944 ns 240787.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26586526.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 575589 ns 569246 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3959 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21476 ns 21637 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2064949.5 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 44060 ns 42161 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14625 ns 14625 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14667 ns 14750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14708 ns 14667 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14667 ns 14625 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 307143 ns 307620 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11739084 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 199661 ns 192746.5 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148416 ns 100834 ns 1.47
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 110916 ns 118500 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 106500 ns 101833 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 118145.5 ns 102417 ns 1.15
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 136427 ns 136873 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5833156 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204897 ns 205777 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922624.5 ns 1916625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1939667 ns 1916542 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1935687.5 ns 1926979 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1921833.5 ns 1898334 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 686994 ns 683667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34938407 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1214953.5 ns 1215256.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21500 ns 19000 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19167 ns 19000 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21167 ns 22250 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17500 ns 16916 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107557 ns 107183.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3518733 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78745.5 ns 78581 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229541 ns 217813 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220625 ns 222833 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217583.5 ns 217417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221395.5 ns 216770.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 514096.5 ns 512086.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20488413 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477244 ns 476669.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24583.5 ns 24750 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28917 ns 28937.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 27000 ns 26875 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1417 ns 1083 ns 1.31
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15648 ns 16054 ns 0.97
batchedmm(16, Bsize=4)/forward/GPU/oneAPI 71452995 ns
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82161 ns 81581 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4271 ns 4896.5 ns 0.87
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5292 ns 4917 ns 1.08
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5250 ns 5333 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4584 ns 4229 ns 1.08
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 205650 ns 206611 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI 89733904.5 ns
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 381703 ns 377863 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 308167 ns 306208 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 305708.5 ns 305084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 309250 ns 309729.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 307541 ns 307625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 225957 ns 224320 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7672365 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 272732 ns 274612 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 541542 ns 531959 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 535334 ns 543458 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 532770.5 ns 535333.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 540250 ns 542209 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1073264 ns 1058263 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43089373.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 853166 ns 853108 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22291 ns 22084 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19458 ns 21083 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22709 ns 23542 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19209 ns 19459 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112604 ns 112165.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3598747 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78801 ns 78361 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222854.5 ns 221750 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218771 ns 217666.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214958 ns 224750 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213833 ns 222416 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 736294 ns 732048.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25058593 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 532474 ns 533125 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7042 ns 6958 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6584 ns 6958 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 9208 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6583 ns 6417 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 138568.5 ns 137815 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5792248 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65790 ns 65160 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10000 ns 9958 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11958 ns 10792 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11041 ns 10541 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10125 ns 9875 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 827774.5 ns 815812 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 38201577.5 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 389552 ns 385314 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5166 ns 4750 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 5208 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6750 ns 6271 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4750 ns 5000 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 142238 ns 141314 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5605547 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 68521 ns 66780 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7541 ns 7709 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 7916 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7937.5 ns 7875 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458 ns 7959 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 787250 ns 775695 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39098967 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 395683 ns 388324 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14589625 ns 14550291 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7712583 ns 7721375 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7720979 ns 7712187.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27855709 ns 27857958 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 530491 ns 529799 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI 97041475.5 ns
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 394933 ns 389819 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46610729 ns 46686916.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26604584 ns 26553583 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26681396 ns 26597104.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85852583 ns 85700209 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2667330 ns 2648481 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI 191567319 ns
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3286084 ns 3297251 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66667 ns 66125 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67042 ns 68667 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 69708.5 ns 70437.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 65625 ns 66917 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 117462 ns 117160.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3751126 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 230716.5 ns 233212 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 441271 ns 455375 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 442500 ns 452500 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 442875 ns 453833.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 452458 ns 441375 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 725359.5 ns 721437 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27248402.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 795126 ns 786047 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32377 ns 32085 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1182026 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47370 ns 47371 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8416 ns 8667 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10958 ns 9042 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9646 ns 10000 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8708 ns 8458 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 282941 ns 282627 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21204124 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 376463 ns 375423.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9791.5 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9792 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9833 ns 9833 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 22665 ns 22901 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2028098 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 209912 ns 208212 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45583 ns 45625 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46584 ns 45958 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46208 ns 45875 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46458 ns 45917 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 288297 ns 288260 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 9932924.5 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 607414 ns 607426 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56458 ns 56625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56458 ns 56833 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56500 ns 56834 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58000 ns 58250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28171 ns 28250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1190493 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202902 ns 202042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449292 ns 496854 ns 0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 507333 ns 504833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 465354 ns 482959 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 440562.5 ns 434145.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 245614.5 ns 242768 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31644665 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 878146 ns 877308 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 664042 ns 642729 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 637000 ns 659250 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 629167 ns 650437.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 654396 ns 609291.5 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 202857 ns 203473.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8419099 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 306212 ns 309673 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2254395.5 ns 2253979 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2240792 ns 2246042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2250750 ns 2231375 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2243375 ns 2238292 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 969029 ns 956636.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48566408 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1353880 ns 1324473 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19792 ns 20292 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20520.5 ns 23500 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23417 ns 24250 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19708 ns 19333 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112196.5 ns 111824.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3630993 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79390 ns 80571 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223854 ns 271000 ns 0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221437.5 ns 258000 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221708 ns 231875 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220916 ns 221125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 723806.5 ns 720921 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24902123 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 550509.5 ns 554706 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22882 ns 22764 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1265621.5 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47960 ns 47580 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10042 ns 9541 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9708 ns 9625 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10250 ns 10208 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9854.5 ns 9333 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 265608 ns 264550 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 26022316.5 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400258 ns 398354 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8250 ns 10750 ns 0.77
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7896.5 ns 8875 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11875 ns 11125 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8167 ns 8917 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 118304 ns 117075.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3546887 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 70261 ns 69781 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7583.5 ns 7500 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7417 ns 7750 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 8083 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7750 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 505629.5 ns 498929 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17345271 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 319743 ns 322428 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1500 ns 1584 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2000 ns 2000 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1459 ns 1541 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21040 ns 20430 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1140242 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 189601 ns 188361 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3292 ns 3292 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3208 ns 3458 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3604.5 ns 3541 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3208 ns 3208 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 220407.5 ns 218522.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10098546 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 582305 ns 578345 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 147937.5 ns 148312.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 105916.5 ns 105937.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 106750 ns 108125 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 231896 ns 226084 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23931 ns 23769 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1227265 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 41170 ns 40471 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 163229 ns 173291.5 ns 0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 88104 ns 104500 ns 0.84
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 87374.5 ns 105208 ns 0.83
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 270771 ns 287062 ns 0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 216773 ns 215904 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10565835 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 267432 ns 268567 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7209 ns 7250 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 5333 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 5416 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10209 ns 10416 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32955 ns 32778 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1203684 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50371 ns 48640 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222250 ns 226583 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229125 ns 229645.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228208.5 ns 238083 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219667 ns 213229.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263890 ns 258784 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29607614.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 595145 ns 595636 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15708 ns 15375 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14541.5 ns 15125 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 17083.5 ns 16959 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15208 ns 15083 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 138261.5 ns 137028 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5487375.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 231261 ns 230152 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24292 ns 23500 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24750 ns 24208 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24500 ns 24500 ns 1
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24542 ns 24375 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 869603.5 ns 858623.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40457803 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 681295 ns 679476 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9167 ns 9750 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9083.5 ns 10104.5 ns 0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12188 ns 11000 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9291 ns 9084 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 122425.5 ns 120301.5 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 4079012 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 74250 ns 74161 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13917 ns 13875 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13833 ns 14646 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14583 ns 15000 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 13958 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 663795 ns 655428 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22251484 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 367348 ns 366138.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8708 ns 10250 ns 0.85
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9521 ns 10625.5 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10875 ns 11792 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8958 ns 9125 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 120869 ns 119866.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3403090 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73351 ns 72421 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12833 ns 12208 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12625 ns 12791.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13354 ns 13084 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13125 ns 12875 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 549586 ns 541791 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19602252 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 341748 ns 341643 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 28437.5 ns 30750 ns 0.92
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34020.5 ns 32333 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 29833 ns 29792 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1875 ns 1625 ns 1.15
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16088 ns 16024 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/oneAPI 76292652.5 ns
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 81070 ns 80551 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5187.5 ns 5042 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4812.5 ns 5458 ns 0.88
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5166.5 ns 5083 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 7084 ns 6209 ns 1.14
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138637 ns 139561 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI 106277566 ns
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 373178 ns 368314 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25151 ns 25032 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1206450.5 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48171 ns 46980 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns 6167 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 6666.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833.5 ns 6958 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6333 ns 6125 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 186344.5 ns 184207 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24314539 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390138 ns 388954 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 2000 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 1959 ns 2042 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2042 ns 2083 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 1959 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26102 ns 26042 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1240500 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 206142 ns 204582 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16666.5 ns 17083 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16334 ns 16875 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16667 ns 16896 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16667 ns 16584 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 272991 ns 271146.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 28009585 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 704126 ns 701017 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 151500 ns 147458 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 171687.5 ns 175562.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 153458 ns 153292 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 175417 ns 152541 ns 1.15
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 199266 ns 195620 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8069586.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 216712 ns 226692 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1323833 ns 1323500 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1319020.5 ns 1327791 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1331875 ns 1331125 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1323229 ns 1301042 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 902946 ns 891045 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45771076 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 999218 ns 1116140.5 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25000 ns 25000 ns 1
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24770.5 ns 24437.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27187 ns 28250 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25000 ns 25979.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235086.5 ns 231362.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8000668 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 113461 ns 115561 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 124708.5 ns 178562 ns 0.70
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 129270.5 ns 126166 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 129667 ns 178437.5 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117645.5 ns 157500 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1071982 ns 1053949 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47350908.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 608575 ns 608216 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 291 ns 334 ns 0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22745 ns 22518 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1191968 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 47400 ns 47580 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6416 ns 6416 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6834 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 7020.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6333 ns 6417 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 202054 ns 200663 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26514268.5 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 387828 ns 396354 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5875 ns 7062.5 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 5874.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7542 ns 7791 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6187.5 ns 6791 ns 0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 142993.5 ns 142964.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5749081 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 232471 ns 231792 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10083 ns 10208.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10187.5 ns 10250 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10500 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10334 ns 10333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 907275.5 ns 887713 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 42579731 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 671686 ns 669276 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 666 ns 667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 667 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 708 ns 667 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22091 ns 22120 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2176757 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 205902 ns 205382 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4667 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4542 ns 4833 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4833 ns 4833 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4542 ns 4584 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 224542 ns 224988.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9538256 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 583495 ns 575835.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7292 ns 8167 ns 0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8000 ns 8437 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9896 ns 9833 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7708 ns 7958 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 120756.5 ns 119167.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3471791 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 75931 ns 74331 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 8416 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8812.5 ns 8938 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8958.5 ns 9625 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8458 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 590021 ns 578635 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20094165 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 343053 ns 344473 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126000 ns 126875 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 95333 ns 97229 ns 0.98
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 96187.5 ns 97333.5 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183083 ns 183291.5 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45611 ns 45455.5 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/oneAPI 73061834 ns
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 102841 ns 101051 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 340542 ns 340292 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 182166 ns 182250 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 179000 ns 191959 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 619562.5 ns 612416.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 190070 ns 191737 ns 0.99
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI 89832396 ns
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 515839.5 ns 516500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398625 ns 399042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215375 ns 215417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215166 ns 215333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756916 ns 756333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 42770 ns 43626 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1389564 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80551 ns 81280 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1414208.5 ns 1398374.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 864917 ns 864000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 863417 ns 864270.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2359166.5 ns 2358708.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 247351 ns 253991.5 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10979753.5 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 353233 ns 350903.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 662854 ns 653500 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 673021.5 ns 655916 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 660875 ns 653041.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 686750 ns 622146 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 201573 ns 201217.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8278348.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 311267.5 ns 306973 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2467270.5 ns 2461125.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2476708 ns 2469625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2475333 ns 2481375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2444125 ns 2480333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 988382.5 ns 998464.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 52533110.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1304020.5 ns 1392463.5 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 31771.5 ns 32521 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34833.5 ns 34291 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32917 ns 33084 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 875 ns 833 ns 1.05
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15323 ns 15542.5 ns 0.99
batchedmm(2, Bsize=32)/forward/GPU/oneAPI 70325477 ns
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 83911 ns 78871 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3062.5 ns 3000 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3167 ns 3417 ns 0.93
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3417 ns 3500 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3208 ns 3042 ns 1.05
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136612 ns 141700 ns 0.96
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI 91300478 ns
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 354383 ns 337663 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 407500 ns 408916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 401709 ns 403770.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 402333 ns 404375 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 422000 ns 423959 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43203 ns 43511.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1412474 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 241512 ns 237932 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3873791.5 ns 3878166.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3988979.5 ns 3999042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3992333 ns 4003416 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3780562 ns 3792395.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 242608 ns 245738 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 38123730 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1241085 ns 1432279 ns 0.87
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33762 ns 34288 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1274592 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 38050 ns 37921 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15416 ns 15459 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15458 ns 15666 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15625 ns 15666 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15459 ns 15458 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 251910 ns 258924 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 6676022 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 168912 ns 173651.5 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404875 ns 404583 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221020.5 ns 220833 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 220958 ns 221125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760834 ns 760833 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112978 ns 113269 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 979708 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 87801 ns 87641 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1428104 ns 1424020.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 889125 ns 888041.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 887167 ns 888875 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2382770.5 ns 2382770.5 ns 1
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 238693 ns 245573 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 10268219 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 355883 ns 354303 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25472 ns 25789 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1196934 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 206612 ns 204972 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 7459 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7667 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7959 ns 7958 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7166 ns 7250 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 210500 ns 217010.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 27163566 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 685306 ns 692821.5 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 831312.5 ns 832771 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 471083 ns 467416 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 469354 ns 470562.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1551083 ns 1544541 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129725 ns 129883 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI 73418877 ns
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 234712 ns 229272 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2690750 ns 2692000 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1540667 ns 1540000 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1536375.5 ns 1542312.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4950854.5 ns 4931479 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 239460 ns 248014 ns 0.97
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI 99877157 ns
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 760537 ns 809797.5 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 250 ns 250 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32058 ns 32644 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1250598 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 46991 ns 47000 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6041 ns 6208 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6562.5 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6916 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6291 ns 6333 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 222876 ns 226410 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 23165966.5 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 358863 ns 357804 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2413250 ns 2407917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2415084 ns 2401417 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2415646 ns 2386750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2406584 ns 2392333 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 197621 ns 200791 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7999137 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 374664 ns 374543.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4645167 ns 4663875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4655833.5 ns 4666063 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4666958 ns 4675291 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4630000 ns 4670208 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 897908 ns 902618 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49208090 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1382441 ns 1376633 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7541.5 ns 6875 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 22875 ns 7542 ns 3.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7292 ns 7250 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6833 ns 6917 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22723 ns 23477 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1180199 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 40911 ns 39221 ns 1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 66646 ns 32313 ns 2.06
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 43750 ns 49125 ns 0.89
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48209 ns 49583 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 69604 ns 52291.5 ns 1.33
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 215130 ns 219072.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10574968 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 267982 ns 262272 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20625 ns 21666.5 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24770.5 ns 24541.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22917 ns 22416.5 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 6084 ns 5166 ns 1.18
batchedmm(2, Bsize=512)/forward/GPU/CUDA 17467 ns 18191 ns 0.96
batchedmm(2, Bsize=512)/forward/GPU/oneAPI 87440172.5 ns
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 84121 ns 82841 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12458 ns 11979 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 8895.5 ns 9645.5 ns 0.92
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9666 ns 9541.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 17792 ns 18062.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 226861 ns 231197.5 ns 0.98
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI 140762273.5 ns
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 388443 ns 365714 ns 1.06
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406916 ns 406041 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223292 ns 223459 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 223417 ns 223375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762917 ns 762584 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 45844 ns 46689.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1352402 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 90961 ns 87501 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1428229 ns 1427542 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 891833 ns 894125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 894708.5 ns 896417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386770.5 ns 2384229 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 286502 ns 287677.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11860453.5 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 379253 ns 377703 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 434958 ns 434334 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 430166 ns 430229.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 430250 ns 430333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448375 ns 447583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54142 ns 55000 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1033716.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236092 ns 233247 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3910542 ns 3915625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4015625 ns 4018146 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4017125 ns 4025959 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3810042 ns 3782667 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 263657 ns 265792.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30657851 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1222870 ns 1207206.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8708 ns 8750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 6833 ns 6875 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 6875 ns 6875 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12375 ns 12416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24200 ns 24680 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2206428 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 208602 ns 210232 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44792 ns 44583 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 44958 ns 44959 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 44833 ns 44875 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45166 ns 44667 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 344948 ns 349913 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11784282 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 652455 ns 651936 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 108167 ns 119750.5 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82916.5 ns 123750 ns 0.67
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 87833.5 ns 89667 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 106145.5 ns 81771 ns 1.30
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190064 ns 189502 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5997800 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 219032 ns 218452 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024375 ns 2022125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2011813 ns 2026083 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2008999.5 ns 2027729 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2027292 ns 2023895.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 532552.5 ns 540867 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27552313 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086649 ns 1089800 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/more_dev branch 2 times, most recently from e8b9675 to e932ee4 Compare September 21, 2024 16:22
src/impl/conv.jl Outdated Show resolved Hide resolved
Copy link

codecov bot commented Sep 21, 2024

Codecov Report

Attention: Patch coverage is 34.54545% with 36 lines in your changes missing coverage. Please review.

Project coverage is 79.03%. Comparing base (a6c4a16) to head (0fa961d).

Files with missing lines Patch % Lines
src/impl/conv.jl 12.00% 22 Missing ⚠️
src/impl/batched_mul.jl 51.85% 13 Missing ⚠️
src/impl/dropout.jl 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #165       +/-   ##
===========================================
+ Coverage   59.34%   79.03%   +19.68%     
===========================================
  Files          38       38               
  Lines        2022     2065       +43     
===========================================
+ Hits         1200     1632      +432     
+ Misses        822      433      -389     
Flag Coverage Δ
58.54% <34.54%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@avik-pal avik-pal merged commit 79ed8fe into main Sep 21, 2024
55 of 63 checks passed
@avik-pal avik-pal deleted the ap/more_dev branch September 21, 2024 22:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant