Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

perf: fusing activation functions and other misc perf improvements #126

Merged
merged 9 commits into from
Aug 14, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Aug 12, 2024

Changes

  • bias_activation perf has been fixed.
  • regression in batchnorm performance has been fixed.
    • check LV for 4d inputs -- not needed, simply batch suffices
    • reverse mode -- can't do parallel here. will lead to race conditions
    • Enzyme patch for polyester
  • restore running all benchmarks before merging

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 19be2ca Previous: 6426043 Ratio
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56875 ns 35792 ns 1.59
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57791 ns 29709 ns 1.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57792 ns 29583 ns 1.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58250 ns 55000 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 83410 ns 39146 ns 2.13
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 728971 ns 1178312 ns 0.62
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 506125 ns 661250 ns 0.77
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 356904 ns 206667.5 ns 1.73
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 450459 ns 234833 ns 1.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465709 ns 249584 ns 1.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498584 ns 202541 ns 2.46
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 434625 ns 345937.5 ns 1.26
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 215832 ns 280223 ns 0.77
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 13199361 ns 27356049 ns 0.48
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 4192500 ns 7836709 ns 0.53
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 817938 ns 808758 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 542 ns 0.54
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 667 ns 0.56
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 292 ns 667 ns 0.44
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 667 ns 0.44
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26705 ns 26727 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1504569 ns 1153992 ns 1.30
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 340250 ns 454437 ns 0.75
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 49601 ns 48591 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6709 ns 10000 ns 0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7834 ns 10375 ns 0.76
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7125 ns 10667 ns 0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7167 ns 10083 ns 0.71
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 164243.5 ns 201412.5 ns 0.82
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 10082424 ns 24286330 ns 0.42
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 3891979.5 ns 5574313 ns 0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 459015 ns 400079.5 ns 1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 55583 ns 94583 ns 0.59
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46542 ns 97125 ns 0.48
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46708 ns 95625 ns 0.49
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82791 ns 135875 ns 0.61
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40421 ns 41104 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1487630 ns 1318240 ns 1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1157479.5 ns 1152708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 84251 ns 79551 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922792 ns 1216750 ns 1.58
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974166.5 ns 1149500 ns 1.72
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1956146 ns 1236292 ns 1.58
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1900375 ns 1131000 ns 1.68
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 178262.5 ns 239669 ns 0.74
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32212056 ns 32036283 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11372708 ns 10989604 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1021771 ns 1025085.5 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458 ns 14875 ns 0.09801680672268907
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1333.5 ns 15875 ns 0.084
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2042 ns 15458 ns 0.13
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1750 ns 11042 ns 0.16
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21663 ns 22126 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1152516.5 ns 1470687 ns 0.78
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 190000 ns 207208 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31531 ns 31530 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4333 ns 14875 ns 0.29
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3687.5 ns 14792 ns 0.25
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns 14958 ns 0.28
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4458 ns 14625 ns 0.30
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 142909.5 ns 148820 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9336728 ns 8661230 ns 1.08
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1457791 ns 1532875 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 146772 ns 151851 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 55333 ns 82708.5 ns 0.67
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46375 ns 81958 ns 0.57
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46250 ns 81500 ns 0.57
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82583.5 ns 136166 ns 0.61
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37284 ns 38471 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 890940 ns 567295 ns 1.57
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1140958 ns 1074604 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78811 ns 85291 ns 0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024874.5 ns 1238583 ns 1.63
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2091187.5 ns 1220375 ns 1.71
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2081375 ns 1219041.5 ns 1.71
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1990834 ns 1403875 ns 1.42
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 210677 ns 239045 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28000217 ns 8329149 ns 3.36
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11050834 ns 4579959 ns 2.41
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 955840 ns 1418844 ns 0.67
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 16083 ns 0.45
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 16209 ns 0.38
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 16084 ns 0.38
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 16125 ns 0.63
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37274 ns 37816 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1282363.5 ns 1234872.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 358750 ns 390292 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46900 ns 48940 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 242041 ns 116667 ns 2.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 236187.5 ns 127250 ns 1.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 249417 ns 117375 ns 2.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 246000 ns 112125 ns 2.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 186212.5 ns 258303 ns 0.72
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25577862.5 ns 26347832 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7941042 ns 7806875 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 492325 ns 517565 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1875 ns 3667 ns 0.51
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 1958 ns 3791 ns 0.52
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 1958 ns 3792 ns 0.52
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1834 ns 4125 ns 0.44
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27186 ns 27825 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1249457 ns 1170474 ns 1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 310625 ns 467792 ns 0.66
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 209702 ns 209383 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16750 ns 28167 ns 0.59
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18375 ns 28646 ns 0.64
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18000 ns 29021 ns 0.62
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17417 ns 30021 ns 0.58
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 195070 ns 286300.5 ns 0.68
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24793786 ns 23897802 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5385125 ns 5772208 ns 0.93
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 708967 ns 718348 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 55917 ns 82625 ns 0.68
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46167 ns 82020.5 ns 0.56
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46125 ns 81625 ns 0.57
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82750 ns 136209 ns 0.61
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28302 ns 28938 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1451723 ns 999533 ns 1.45
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1147500.5 ns 1151208 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77140 ns 78011 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020520.5 ns 1323604.5 ns 1.53
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2093458.5 ns 1313958 ns 1.59
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2095271 ns 1307041 ns 1.60
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1996708 ns 1534208.5 ns 1.30
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191196 ns 244351 ns 0.78
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35662214 ns 36964728 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11437833 ns 11304583 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1046990 ns 1046480 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 542 ns 0.54
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 667 ns 0.56
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 666 ns 0.56
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 625 ns 0.47
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22574 ns 23144 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1305770.5 ns 1208983 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 446333 ns 342749.5 ns 1.30
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 50210 ns 49360 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 10708 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8292 ns 11542 ns 0.72
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7917 ns 11874.5 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 10250 ns 0.74
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 167233.5 ns 204673 ns 0.82
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23772286.5 ns 24111144 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5480125 ns 6107958 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 392159 ns 403004 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56083 ns 98375 ns 0.57
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46416 ns 95062.5 ns 0.49
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46479.5 ns 94875 ns 0.49
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82709 ns 136292 ns 0.61
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 49641 ns 51489 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 636733 ns 785195 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1102375 ns 1104458 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80306 ns 69496 ns 1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1909916 ns 1136375 ns 1.68
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1972708 ns 1124416 ns 1.75
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1974687.5 ns 1151875 ns 1.71
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890959 ns 1203396 ns 1.57
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194221 ns 252720 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30315655 ns 18838190 ns 1.61
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9917000 ns 9662667 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 921639 ns 929214.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 584 ns 1666 ns 0.35
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 812.5 ns 1750 ns 0.46
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 1042 ns 2167 ns 0.48
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 2042 ns 0.31
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20990 ns 20791 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1243750 ns 1175759 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 291979 ns 292959 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 33800 ns 33140 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458.5 ns 2125 ns 0.69
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1500 ns 2458 ns 0.61
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 2417 ns 0.65
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1458 ns 2000 ns 0.73
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 128883 ns 127770.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 9448663 ns 8845531 ns 1.07
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1694437 ns 1561146 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 132216.5 ns 128606.5 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns 583 ns 0.50
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 625 ns 0.67
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 666 ns 0.56
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 542 ns 0.54
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35059 ns 36022 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1298446.5 ns 1250216.5 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 284375 ns 377395.5 ns 0.75
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48351 ns 48550 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 9083 ns 0.79
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8333 ns 9000 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7541 ns 9666 ns 0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7479.5 ns 8812.5 ns 0.85
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 174785 ns 214920 ns 0.81
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20938075 ns 20414049.5 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5002771 ns 4631333.5 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 378793 ns 375684 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 16500 ns 0.43
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 16584 ns 0.37
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 16459 ns 0.37
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 16500 ns 0.60
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23823 ns 24522 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1334936 ns 1265913 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 604375 ns 645041.5 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49355.5 ns 47030 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 262667 ns 143416.5 ns 1.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 265042 ns 173208 ns 1.53
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 269000 ns 137208 ns 1.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 259417 ns 147333 ns 1.76
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 188465.5 ns 189286 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 35995208 ns 28874599.5 ns 1.25
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8975291.5 ns 8789250 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 567896 ns 615281 ns 0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 14250 ns 19791 ns 0.72
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 17375 ns 19020.5 ns 0.91
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17292 ns 18520.5 ns 0.93
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15083 ns 20791.5 ns 0.73
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21895 ns 20764 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1263662.5 ns 1131585 ns 1.12
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 221125 ns 220479 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 31490 ns 26310 ns 1.20
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10917 ns 18916 ns 0.58
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9000 ns 18250 ns 0.49
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9250 ns 18292 ns 0.51
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17167 ns 23770.5 ns 0.72
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 175236.5 ns 298346 ns 0.59
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10227843.5 ns 9570516 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1611666.5 ns 1582250 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 153642 ns 154912 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 405875 ns 225084 ns 1.80
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 409042 ns 179542 ns 2.28
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 408458 ns 178875 ns 2.28
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421041 ns 413125 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43509.5 ns 44222 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1529291 ns 1389820 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1176145.5 ns 1171416 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 243032.5 ns 240732 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3868354.5 ns 2228062.5 ns 1.74
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3997334 ns 1923687.5 ns 2.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3989791 ns 1926520.5 ns 2.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3776125 ns 3170083 ns 1.19
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 237696 ns 250699.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34681235 ns 36821504.5 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11474291 ns 11786563 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1243523 ns 1238162.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 416 ns 2916 ns 0.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 541 ns 2916 ns 0.19
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 500 ns 3000 ns 0.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 458 ns 2875 ns 0.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35793 ns 36899 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1251727 ns 1166121 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 284417 ns 277167 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 49510 ns 46391 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9416 ns 18666.5 ns 0.50
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10125 ns 19459 ns 0.52
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10416.5 ns 21333 ns 0.49
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9458 ns 19250.5 ns 0.49
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 253222 ns 269415 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 22069562 ns 17400566.5 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5234959 ns 5085437 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 377774 ns 377589 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 417 ns 3042 ns 0.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 3167 ns 0.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 3208 ns 0.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 458 ns 2917 ns 0.16
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31907 ns 33391.5 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1262816 ns 1149211 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 352604 ns 291209 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47470 ns 49490 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10375 ns 19791 ns 0.52
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9875 ns 21125 ns 0.47
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11250 ns 21875 ns 0.51
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10542 ns 19208 ns 0.55
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 276016 ns 302281 ns 0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 22141088 ns 22137697 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5497041.5 ns 5389542 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 394104 ns 393329 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 11750 ns 17416.5 ns 0.67
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 17167 ns 17562.5 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 15916 ns 17458 ns 0.91
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10375 ns 16208 ns 0.64
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21529 ns 22007 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1147174 ns 1195107 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 214812.5 ns 212083 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 190712 ns 190302 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32083 ns 27750 ns 1.16
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 31750 ns 24209 ns 1.31
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 31959 ns 24417 ns 1.31
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 31708 ns 40479 ns 0.78
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 287170.5 ns 314835.5 ns 0.91
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11042817 ns 11155150.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1824374.5 ns 1711854 ns 1.07
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 606726 ns 608191.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 708 ns 0.71
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 584 ns 792 ns 0.74
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 792 ns 0.74
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 792 ns 0.63
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35515 ns 36290 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1229283 ns 1204951 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 275209 ns 278645.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 209282 ns 208802 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8541 ns 10208 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9270.5 ns 10667 ns 0.87
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8958 ns 10959 ns 0.82
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 10042 ns 0.80
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 235494 ns 239441 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21421544.5 ns 21307139.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4941770.5 ns 4985250 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 681512 ns 675322 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 459 ns 750 ns 0.61
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 584 ns 833 ns 0.70
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 792 ns 0.74
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 459 ns 792 ns 0.58
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26600 ns 26516 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1249818.5 ns 1204061.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 459896 ns 406521 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 207602 ns 209672 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8229.5 ns 11250 ns 0.73
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9208 ns 12000 ns 0.77
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9041 ns 12458 ns 0.73
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8541 ns 11583 ns 0.74
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 213419 ns 212072 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25815525 ns 25515071 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6065250 ns 5682813 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 712268 ns 706477 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56500 ns 35333 ns 1.60
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57166 ns 29500 ns 1.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57125 ns 29333 ns 1.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57916 ns 54500 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29232 ns 29741 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1182325 ns 1192558 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 683834 ns 675459 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206402 ns 206787.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 488792 ns 265687.5 ns 1.84
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 515042 ns 241187.5 ns 2.14
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 473542 ns 218666.5 ns 2.17
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 470250.5 ns 410125 ns 1.15
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 253111 ns 259339 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 35487764 ns 31700012 ns 1.12
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9155875 ns 9604021 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 861719 ns 854999 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 250 ns 542 ns 0.46
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 625 ns 0.67
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 625 ns 0.60
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 625 ns 0.47
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32164 ns 32274 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1256267 ns 1211786 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 326770.5 ns 420709 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 49541 ns 52000 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 9500 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7959 ns 10312.5 ns 0.77
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 10687 ns 0.74
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 9291.5 ns 0.82
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 227699 ns 229200 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22647102 ns 21880902 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4872083.5 ns 5109250 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 371894 ns 376694 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7166 ns 181375 ns 0.039509303928325294
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7250 ns 187729 ns 0.03861949938475143
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7416 ns 188875 ns 0.03926406353408339
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6750 ns 143459 ns 0.047051770889243616
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 24162 ns 23468 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1203777 ns 1196290 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 301979 ns 273792 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 33170 ns 39211 ns 0.85
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 33625 ns 193458 ns 0.17
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 66396 ns 191708.5 ns 0.35
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33500 ns 204937.5 ns 0.16
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 33000 ns 226500 ns 0.15
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 233166.5 ns 238377 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10575996 ns 10900737.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2017104 ns 2034417 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 234692.5 ns 226092 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 15792 ns 0.45
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 16209 ns 0.38
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 15834 ns 0.38
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10229.5 ns 15667 ns 0.65
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28701.5 ns 28554 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1196816.5 ns 1218212.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 592416.5 ns 596542 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49390.5 ns 50301 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219000 ns 124792 ns 1.75
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 225084 ns 126229 ns 1.78
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224750 ns 127334 ns 1.77
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 246625 ns 175500.5 ns 1.41
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 230047 ns 242364.5 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32871289 ns 30677520 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9026938 ns 9028500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 555871 ns 534385 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 432917 ns 240250 ns 1.80
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436750 ns 190333 ns 2.29
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436334 ns 190709 ns 2.29
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 446875 ns 442042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54568 ns 55862 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 635923 ns 1016111.5 ns 0.63
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1100979.5 ns 1126667 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 232652 ns 237223 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3899020.5 ns 2144084 ns 1.82
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4028958 ns 1859750 ns 2.17
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4022646 ns 1848416 ns 2.18
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3809125 ns 2959083.5 ns 1.29
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 269010 ns 270815 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31746518 ns 31291796.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10010333 ns 10183875 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1235103 ns 1241163 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 458 ns 2833 ns 0.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 2916 ns 0.19
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 459 ns 2917 ns 0.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 2917 ns 0.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26952 ns 27379 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1263081 ns 1192304 ns 1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 466396.5 ns 454792 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 50560 ns 48561 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10250 ns 23625.5 ns 0.43
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10125 ns 23375 ns 0.43
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10208 ns 23750 ns 0.43
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10250 ns 23083.5 ns 0.44
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 269464 ns 273919 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 24949123 ns 22736189.5 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5831896 ns 5829834 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 390684 ns 399054 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 104334 ns 212479.5 ns 0.49
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 99271 ns 214125 ns 0.46
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 100542 ns 214333 ns 0.47
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146750 ns 221166.5 ns 0.66
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24354 ns 24897 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1176994 ns 1188077 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 306874.5 ns 265291 ns 1.16
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190281.5 ns 190962 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 478687.5 ns 417917 ns 1.15
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 527979 ns 341625 ns 1.55
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 520333 ns 334687.5 ns 1.55
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 529708.5 ns 602250 ns 0.88
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 251460.5 ns 255547 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11632878 ns 12372769.5 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2160750 ns 2125916.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 591085 ns 625836 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 3083 ns 0.16
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 3208 ns 0.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 3209 ns 0.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 459 ns 3000 ns 0.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22776 ns 23558 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1211728 ns 1247062 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 473270.5 ns 331458 ns 1.43
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 50160 ns 50101 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10334 ns 24625 ns 0.42
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 11625 ns 25333 ns 0.46
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11042 ns 26791 ns 0.41
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10937.5 ns 23854.5 ns 0.46
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 275538 ns 279705 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 28784070 ns 24035981 ns 1.20
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6178625 ns 5959937.5 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400298 ns 416944 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1416 ns 2208.5 ns 0.64
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1584 ns 2292 ns 0.69
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 2500 ns 0.75
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1417 ns 2250 ns 0.63
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20996 ns 21285 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1185953 ns 1169943.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 321396 ns 299791 ns 1.07
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 192441 ns 192672 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3250 ns 3250 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3417 ns 3292 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3458 ns 3250 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3291.5 ns 3750 ns 0.88
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 240693 ns 241487 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10013228.5 ns 10275500 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1819625 ns 1643396 ns 1.11
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596385 ns 596106 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148709 ns 250417 ns 0.59
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 128521 ns 243208.5 ns 0.53
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129896 ns 239875 ns 0.54
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 241333 ns 287979 ns 0.84
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24268 ns 24789 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1208779 ns 1167987 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 304437.5 ns 269958 ns 1.13
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 33520 ns 36631 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143459 ns 265312.5 ns 0.54
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 126458 ns 242729 ns 0.52
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 110708 ns 240334 ns 0.46
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 270187.5 ns 353541 ns 0.76
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 232829 ns 241464 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10416372 ns 10648580.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2083666 ns 1971396 ns 1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 235677 ns 223693 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 16792 ns 0.43
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 16750 ns 0.37
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 16750 ns 0.36
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 16208 ns 0.63
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32749 ns 33837 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1189766 ns 1266052 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 692875 ns 614000 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49550 ns 50530 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221312.5 ns 125750 ns 1.76
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 241041 ns 152812 ns 1.58
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267792 ns 127979.5 ns 2.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214187.5 ns 134208.5 ns 1.60
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 267710.5 ns 273789 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28987559 ns 27569153 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7979708 ns 8173209 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 513334 ns 534905 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 3666 ns 0.53
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2041 ns 3750 ns 0.54
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2042 ns 3791 ns 0.54
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 1917 ns 4125 ns 0.46
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36286 ns 37434 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1191308 ns 1210446.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 445584 ns 432542 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 209522 ns 209622 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 16750 ns 23667 ns 0.71
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17000 ns 23709 ns 0.72
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17375 ns 24479.5 ns 0.71
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16792 ns 25542 ns 0.66
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 305312 ns 309240 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 23922335 ns 20107132 ns 1.19
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5404354.5 ns 4810166 ns 1.12
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 694705.5 ns 692227 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1791 ns 2250 ns 0.80
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1958 ns 2542 ns 0.77
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2625 ns 0.81
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1750 ns 2375 ns 0.74
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 21206.5 ns 20498 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1124010 ns 1138479 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 314500 ns 312375 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 28431 ns 29111 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2166 ns 2667 ns 0.81
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 3145.5 ns 0.69
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 3125 ns 0.72
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2667 ns 0.80
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 223080 ns 224596.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9277099 ns 9000358 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1575354.5 ns 1476583.5 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 141301 ns 138381.5 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 88.19876% with 19 lines in your changes missing coverage. Please review.

Project coverage is 82.78%. Comparing base (6a33ba1) to head (fed6eac).

Files Patch % Lines
src/impl/batchnorm.jl 86.79% 14 Missing ⚠️
src/impl/bias_activation.jl 88.37% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
+ Coverage   80.59%   82.78%   +2.18%     
==========================================
  Files          37       37              
  Lines        1737     1795      +58     
==========================================
+ Hits         1400     1486      +86     
+ Misses        337      309      -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@avik-pal avik-pal force-pushed the ap/act_fuse2 branch 3 times, most recently from b88a219 to 1cee769 Compare August 13, 2024 15:00
@avik-pal avik-pal marked this pull request as ready for review August 14, 2024 05:31
@avik-pal avik-pal merged commit 8caed34 into main Aug 14, 2024
59 of 68 checks passed
@avik-pal avik-pal deleted the ap/act_fuse2 branch August 14, 2024 06:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant