This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: use fallback GPU implementations with warnings #165
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
changed the title
ci: update buildkite settings
feat: use fallback GPU implementations with warnings
Sep 21, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 0fa961d | Previous: a6c4a16 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7000 ns |
5666 ns |
1.24 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5667 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8125 ns |
7062.5 ns |
1.15 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5541.5 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
114370 ns |
117778 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2728634 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
403974 ns |
404275 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9854.5 ns |
9937.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10062 ns |
10041 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10083.5 ns |
10291 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10021 ns |
9875 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
542155 ns |
544239 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18579997 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
688097 ns |
11501326 ns |
0.059827623354037615 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1895.5 ns |
1416.5 ns |
1.34 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
2458 ns |
1479 ns |
1.66 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1625 ns |
1.13 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3312.5 ns |
1542 ns |
2.15 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21595 ns |
21518 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1311721 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
29250 ns |
29030 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4375 ns |
4250 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4104 ns |
4333 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4313 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4500 ns |
4459 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
144308.5 ns |
145904.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9573121.5 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
150842 ns |
145511 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58292 ns |
58625 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
40667 ns |
39750 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
40084 ns |
40042 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83708 ns |
83395.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37142 ns |
37436 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
554486 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78351 ns |
80685.5 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2029125 ns |
2046125 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2082875 ns |
2077896 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2076625 ns |
2083625.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1999854 ns |
1999104 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
226507.5 ns |
229936 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
7112064 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1126831 ns |
1490545 ns |
0.76 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
181729 ns |
162312.5 ns |
1.12 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146124.5 ns |
164083 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
174625 ns |
174959 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
145917 ns |
153854 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165233.5 ns |
166305 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7170202 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204932 ns |
198262 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1123062 ns |
1121458.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1118333 ns |
1114979 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1115334 ns |
1119209 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1114333.5 ns |
1123521 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
696646 ns |
696644 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34659049.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1026230 ns |
1026480.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4625 ns |
4875 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4334 ns |
4916 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6583 ns |
5875 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4584 ns |
5375 ns |
0.85 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
90392 ns |
92112 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5213655 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67531 ns |
69791 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8625 ns |
8875 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8583 ns |
8917 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8875 ns |
8959 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
8625 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
607432 ns |
596620 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
33741334 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
385194 ns |
389954 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20687.5 ns |
18312 ns |
1.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18167 ns |
18104.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21166 ns |
20021 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17375 ns |
17771 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
65601 ns |
67875.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2651220 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76505.5 ns |
77581 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212250 ns |
235917 ns |
0.90 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222167 ns |
212458 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214083 ns |
213667 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217000 ns |
225292 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
349996 ns |
353373 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
11608614 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
465179.5 ns |
470510 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
708 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
708 ns |
625 ns |
1.13 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
833 ns |
959 ns |
0.87 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
729.5 ns |
0.91 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20444 ns |
20362 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1169655 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
31240 ns |
32440 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1521 ns |
1375 ns |
1.11 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1416 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1562.5 ns |
1459 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1479.5 ns |
1375 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
124121.5 ns |
125347.5 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8377334 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
136902 ns |
135651 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7458 ns |
7458 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5417 ns |
5292 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5459 ns |
5458 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10416 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23727 ns |
24280.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1272526 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47160 ns |
48481 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219666 ns |
256833 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
266979 ns |
268834 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
240208 ns |
238167 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213500 ns |
213521 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
191900 ns |
190543 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
30716086 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
641871 ns |
644671.5 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23269 ns |
23269 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1962626 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
46830 ns |
48260 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16583 ns |
16542 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16417 ns |
16542 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16834 ns |
16833 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16583 ns |
16583 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
192011.5 ns |
195985.5 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10074326 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
172192 ns |
174616.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
511750 ns |
511667 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
332166 ns |
331875 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
332333 ns |
332042 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
864834 ns |
865458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113501.5 ns |
113196 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
392820.5 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
243122 ns |
243182 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2281875 ns |
2277833 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1757125 ns |
1758208 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1748833 ns |
1758041.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3194666 ns |
3193625 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
236911 ns |
242653 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
9329632 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
745737 ns |
741122 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6583 ns |
6396 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6500 ns |
7021 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6896 ns |
7583 ns |
0.91 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6208 ns |
6084 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
90670 ns |
90386 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5216804 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65650 ns |
65841 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11146 ns |
11812 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12125 ns |
11729.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12125 ns |
12250 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11792 ns |
10125 ns |
1.16 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
654623 ns |
626387 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38049525 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
410264 ns |
405759 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
541 ns |
542 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23122 ns |
23421 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2086193.5 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
47250 ns |
46570 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2083 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2208 ns |
0.94 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2167 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2083 ns |
2084 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
220521 ns |
221475.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11675105 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
174296.5 ns |
174101.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8583 ns |
9041 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9124.5 ns |
9292 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10375 ns |
10375 ns |
1 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8625 ns |
9000 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
108876.5 ns |
94379 ns |
1.15 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
2938947.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73720 ns |
72281 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17958 ns |
17375 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18021.5 ns |
17729 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
20166.5 ns |
19209 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16792 ns |
17562.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
610846 ns |
576225.5 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
16134747 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
379093 ns |
378363 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34896 ns |
35667 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1175218 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
46220 ns |
46061 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10292 ns |
10687.5 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9333 ns |
9083.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9520.5 ns |
9750 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9937.5 ns |
8666.5 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
263509 ns |
258995 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18433346 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
369503 ns |
366948.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398417 ns |
399292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215333 ns |
215291 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215708.5 ns |
215292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756042 ns |
756083 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112043 ns |
113061 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
324163 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
75161 ns |
74731 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1403583 ns |
1407958 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
860500 ns |
860333 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
860292 ns |
860854 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2356250 ns |
2357500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
204832 ns |
211180.5 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10824254 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
324103 ns |
323393 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7083.5 ns |
7125 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7208.5 ns |
7542 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8750 ns |
9000 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6937.5 ns |
7250.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
144097 ns |
143379.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5731995 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
66151 ns |
66420 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14270.5 ns |
15250 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13167 ns |
14959 ns |
0.88 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15750 ns |
13687.5 ns |
1.15 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14646.5 ns |
12333.5 ns |
1.19 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
965886 ns |
942342 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41096887.5 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
426274 ns |
425844 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26250 ns |
24646 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
28521 ns |
28000 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27312.5 ns |
26666 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
28791.5 ns |
28334 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
197086.5 ns |
199235 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7328154 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
114161 ns |
114286.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
156458 ns |
153084 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
147291 ns |
157166.5 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
116187.5 ns |
145958.5 ns |
0.80 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
118500 ns |
153417 ns |
0.77 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1069974 ns |
1075111 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42550755 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
588095.5 ns |
585190.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73792 ns |
76625 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77000 ns |
76729 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
77083 ns |
81229 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
78959 ns |
79750 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
203286 ns |
206416.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
6928996 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
128061 ns |
129541 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
298125 ns |
307729 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
317958 ns |
294250 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
304042 ns |
290520.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
318041 ns |
291458 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1106723 ns |
1105738.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
39233924 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
690307 ns |
696697 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16875 ns |
16875 ns |
1 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
16750 ns |
16500 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18917 ns |
18375 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16709 ns |
17584 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
144249.5 ns |
145532.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5646190 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
233552 ns |
232517.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27208.5 ns |
27125 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27896 ns |
26750 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27812.5 ns |
27208 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
28166.5 ns |
26604 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
976554 ns |
980431.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
43689238 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
694906 ns |
686517 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11042 ns |
11625 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
10937.5 ns |
12250 ns |
0.89 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13458 ns |
13875 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11125 ns |
10458 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
122570.5 ns |
123683.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3608787 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
235652 ns |
236852 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22292 ns |
22709 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21958 ns |
22063 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22333 ns |
23083 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21541 ns |
21833 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
698612.5 ns |
703893 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21406667.5 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
662166 ns |
673557 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
64250 ns |
64250 ns |
1 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
69042 ns |
69208 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67209 ns |
65937.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
65375 ns |
63250 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105516 ns |
107264.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3470032.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
232633 ns |
232543 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
482000 ns |
457334 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
450374.5 ns |
450791 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
438729.5 ns |
449333.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
443791 ns |
488708 ns |
0.91 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
510352 ns |
515904.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
22832245.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
715037 ns |
701456.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7062.5 ns |
7333.5 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7292 ns |
7750 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8834 ns |
9208 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7500 ns |
6979 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
142341 ns |
144382.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5428957 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
65041 ns |
65051 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13395.5 ns |
14354.5 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16062.5 ns |
15459 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15208 ns |
15000 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14145.5 ns |
15604 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
944536 ns |
949171 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
37204219 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
399994 ns |
399874 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6182312.5 ns |
6153958.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
3226458 ns |
3225750 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
3224104 ns |
3225687.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11924667 ns |
11912750 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
351177 ns |
350232.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/oneAPI |
50010265 ns |
||
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
318033 ns |
320283 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19136666.5 ns |
19165042 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
11136625 ns |
11087125 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
11100292 ns |
11132791 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36502291.5 ns |
36531187.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1100585.5 ns |
1015711 ns |
1.08 |
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI |
76536164 ns |
||
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1166521.5 ns |
1168797 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
958 ns |
1000 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
1000 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
958 ns |
917 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
22979 ns |
23879 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1957397.5 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
208262 ns |
206962 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3666 ns |
3750 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3709 ns |
3709 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
278006 ns |
284113 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10709575 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
625756 ns |
623016 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7937.5 ns |
8312.5 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8229.5 ns |
8604.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9375 ns |
10083 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7750 ns |
8146 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
119006.5 ns |
119881.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3406755 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
74061 ns |
71901 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12187.5 ns |
12166.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
12750 ns |
12145.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
14083 ns |
13313 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11333 ns |
11395.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
637337.5 ns |
642520 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21563096 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
355258.5 ns |
357894 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
21948 ns |
22935 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2084333.5 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46950 ns |
46631 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2833 ns |
2917 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2917 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3250 ns |
3167 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2833 ns |
2958 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
200292.5 ns |
206899.5 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9839820 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
155592 ns |
161012 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12792 ns |
12500 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11834 ns |
11354 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12833 ns |
13083 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11167 ns |
10958.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
119893 ns |
121271 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3241253.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
236063 ns |
233822 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20708.5 ns |
20291.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21292 ns |
21083 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21250 ns |
22187.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21312.5 ns |
20104.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
590891.5 ns |
597659.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20607237 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
650536 ns |
638656 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4375 ns |
4417 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4417 ns |
4417 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4416 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
23788 ns |
24156 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2147602 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
49121 ns |
47331 ns |
1.04 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16584 ns |
16167 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16416 ns |
16375 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16416 ns |
16333 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16417 ns |
16333 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
327093.5 ns |
333657 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12732239 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
208292 ns |
207757 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
2125 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
1958 ns |
2125 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2125 ns |
2084 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2041 ns |
2041 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35579 ns |
36462 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1202245.5 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
204436.5 ns |
202982 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
17042 ns |
17021 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
18083.5 ns |
17625 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
19125 ns |
16667 ns |
1.15 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16333 ns |
17083.5 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
291105 ns |
296284 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19886230 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
689131 ns |
684797 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59229 ns |
59562.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
60125 ns |
61667 ns |
0.97 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
61459 ns |
61875 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51375 ns |
50958 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66363 ns |
66679 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/oneAPI |
86637572 ns |
||
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
115381 ns |
117392 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
189770.5 ns |
190771 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
148917 ns |
149541 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
97708.5 ns |
116312.5 ns |
0.84 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
309208 ns |
298166 ns |
1.04 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
214334 ns |
219498 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI |
148722833 ns |
||
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
614056 ns |
614646 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
86604.5 ns |
83166.5 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
84500 ns |
83395.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86375 ns |
110041.5 ns |
0.78 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
85708 ns |
83020.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192894.5 ns |
190710.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5258216.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204282 ns |
206032 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1905792 ns |
1873645.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1892750 ns |
1919416 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1900750 ns |
1920792 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1870625 ns |
1919291.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526027 ns |
533490 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26753507.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1068255.5 ns |
1074210 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21314 ns |
21800 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2080980.5 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
41780 ns |
43000 ns |
0.97 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1791 ns |
1792 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
250042 ns |
256181.5 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9716773.5 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
181572 ns |
182412 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8500 ns |
8458 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8792 ns |
9958 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11521 ns |
11708 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9042 ns |
7583 ns |
1.19 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116760 ns |
119063.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3339276 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
234131 ns |
234272.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9000 ns |
9208 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
9854 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9166 ns |
9792 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
8750 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
520709 ns |
528065 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20652008 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
627683 ns |
634101 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58417 ns |
58208 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
40333 ns |
39375 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39542 ns |
39959 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83750 ns |
83291 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38752 ns |
39916.5 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1328063 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75841 ns |
79101 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920770.5 ns |
1906833 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1928521 ns |
1969916.5 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1954583.5 ns |
1979458 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1888000.5 ns |
1901458 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
219080 ns |
221725 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32011918 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1015755 ns |
1161491.5 ns |
0.87 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
415417 ns |
417125 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
417584 ns |
420562.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
420917 ns |
422103.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
422208 ns |
417979 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
206380 ns |
210226 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7709399 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
282222 ns |
283213 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
669542 ns |
680083.5 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
741000 ns |
675125 ns |
1.10 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
669750 ns |
672375 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
769625 ns |
672542 ns |
1.14 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1044128.5 ns |
1049720 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42252360 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
905964 ns |
908698.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3364729 ns |
3405187.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3444583 ns |
3449917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3466021 ns |
3463646 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3450375 ns |
3430687 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167737 ns |
170640 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8364086 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
445672.5 ns |
450759.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6249791.5 ns |
6244167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6232833.5 ns |
6219417 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6233520.5 ns |
6254812 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6214208 ns |
6201688 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
983988.5 ns |
1001354 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
52841091 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1551067 ns |
1637156.5 ns |
0.95 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
474625 ns |
474833 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
254209 ns |
253792 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
253250 ns |
253584 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
902875 ns |
901250 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46316 ns |
47396 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
392289 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242651 ns |
241892 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2261542 ns |
2269791 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1763417 ns |
1760416 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1753000 ns |
1763687.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3203417 ns |
3197937.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
253834.5 ns |
271388 ns |
0.94 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
14983409 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
769568.5 ns |
765898 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58042 ns |
58541 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39750 ns |
39292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39625 ns |
39792 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83875 ns |
84166 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
27776 ns |
28606 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1322576.5 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75170 ns |
73921 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024708.5 ns |
2031396 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2069229.5 ns |
2088958.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2074583 ns |
2084000 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1987416.5 ns |
1977812.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
232148 ns |
235137 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37325860 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1193936 ns |
1110895.5 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58979.5 ns |
58667 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
40042 ns |
39833 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39833 ns |
40000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83416 ns |
83291 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48621.5 ns |
49806.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
819083.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75250 ns |
76691 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1894958.5 ns |
1930083.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1945875 ns |
1967645.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1967333 ns |
1961750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1890354.5 ns |
1797166 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
237723.5 ns |
240260.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17306002 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1038985 ns |
929734.5 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
416 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34420 ns |
35036 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1184520 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
46480 ns |
46470 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7166 ns |
7584 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6875 ns |
6875 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
7458 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6958 ns |
5916 ns |
1.18 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
209806 ns |
213960 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21802968 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
362502 ns |
368994 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32262 ns |
33302 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1231644 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
36940 ns |
36481 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
2959 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2709 ns |
3083 ns |
0.88 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3041 ns |
3042 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2750 ns |
2625 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
186134 ns |
192793 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7394765 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
149051 ns |
151232 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
455666.5 ns |
420458.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
457083.5 ns |
458333.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
425937 ns |
443562.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
449041.5 ns |
454625 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
137203 ns |
138662 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5989310 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
377397 ns |
376564 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3794292 ns |
3808250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3826895.5 ns |
3812458 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3819042 ns |
3814333.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3722562.5 ns |
3779687.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
704611 ns |
712866 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32030017 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1471758 ns |
1464519 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49940500 ns |
49902208 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
26001417 ns |
26041000 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
26005875 ns |
26000917 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97063333 ns |
97099875 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1599184.5 ns |
1600470 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/oneAPI |
55216417.5 ns |
||
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1046316 ns |
1045150 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154651354.5 ns |
154793291.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
89413687.5 ns |
88667041.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
89414417 ns |
89550541 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295512500 ns |
294974291.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6483995 ns |
6495543 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI |
127440010 ns |
||
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5576760 ns |
5606170 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
17208 ns |
18750 ns |
0.92 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
16208 ns |
15666.5 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
15375 ns |
14167 ns |
1.09 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15583 ns |
15270.5 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20249 ns |
20352.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1258826 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
27331 ns |
25851 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11041 ns |
11041 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
7667 ns |
7833 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
8166.5 ns |
7958 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17562 ns |
17083 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
257898 ns |
261162.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9716305 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
149465.5 ns |
148401.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9041.5 ns |
8375 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8916.5 ns |
9083 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9625 ns |
10583 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8250 ns |
7916.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
114056.5 ns |
113294.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3436422 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
234541 ns |
234072 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10083 ns |
10521 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10854.5 ns |
10416.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11041.5 ns |
10042 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10083 ns |
9666.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
616548 ns |
615911 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23702900 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
656253.5 ns |
655506 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9354 ns |
9625 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9146 ns |
9833 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11459 ns |
12042 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9708 ns |
8479 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
119534.5 ns |
120314 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3401114 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72100 ns |
71931 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
15084 ns |
13083 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
17854.5 ns |
15021 ns |
1.19 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
14541 ns |
14542 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16958 ns |
13417 ns |
1.26 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
586032 ns |
587303 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18941794 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
342782 ns |
344908.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
541 ns |
459 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34336 ns |
34757 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1224632 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
204531 ns |
201632 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
7333.5 ns |
1.14 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9104 ns |
9270.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
7833 ns |
1.32 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7229.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
229285 ns |
231923.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22054400 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
662954 ns |
657851 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
15333 ns |
15875 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
14875 ns |
14645.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
12833.5 ns |
12167 ns |
1.05 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10459 ns |
10375 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21521.5 ns |
21214 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1153921 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
188256.5 ns |
184672 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
31708 ns |
31375 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32187 ns |
32416 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32375 ns |
32270.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32229.5 ns |
31541 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
273410 ns |
276539 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11009636 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
593353 ns |
588126 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
442750 ns |
444792 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
447875 ns |
484417 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
443250 ns |
448792 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
443417 ns |
443250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194791 ns |
194813 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5825309.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
369882 ns |
367924 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3838166.5 ns |
3843833 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3841000 ns |
3831916.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3835749.5 ns |
3838417 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3829458 ns |
3835042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
539590 ns |
537386 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28901903 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1361922.5 ns |
1358632 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
779827458 ns |
784101083 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
415957437.5 ns |
418358083 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
419640250 ns |
418383604.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1555951979 ns |
1504938187.5 ns |
1.03 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22737355 ns |
22745060.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/oneAPI |
178704645.5 ns |
||
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14749917.5 ns |
14695345 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2523536084 ns |
2524662875 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1507841417 ns |
1518103167 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1511376375 ns |
1524361625 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4737339000 ns |
4741835375 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
365203370 ns |
366822106 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI |
918052378 ns |
||
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88648218 ns |
88277685 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
80375 ns |
76417 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77208 ns |
76792 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79437.5 ns |
80333 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76771 ns |
77208 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
205797 ns |
206105.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
5489913 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
106841 ns |
118901 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
288667 ns |
191562.5 ns |
1.51 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
201604 ns |
287750 ns |
0.70 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
193375 ns |
209417 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
284041 ns |
253812.5 ns |
1.12 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1042751.5 ns |
1033097.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
41466377.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
632444 ns |
658411 ns |
0.96 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
200199833.5 ns |
200015521 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
104055854 ns |
103790000.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
103865958 ns |
104076875 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388748166 ns |
389226000 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5829846 ns |
5819295 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/oneAPI |
77852403 ns |
||
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3563321 ns |
3575713 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
621227396 ns |
621801500 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
353126562.5 ns |
353125646 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
356165583.5 ns |
354434874.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1186170667 ns |
1181638875 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26447704 ns |
26630294 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI |
279065786 ns |
||
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21993454.5 ns |
22185623 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7167 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5375 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5375 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10500 ns |
0.94 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27665.5 ns |
27436 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1245534 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48550 ns |
46631 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213584 ns |
212500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221000 ns |
220750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221521 ns |
220458 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205853.5 ns |
206104.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
221500 ns |
220558 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26571515 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
522663 ns |
523545 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8500 ns |
10541.5 ns |
0.81 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7959 ns |
9541.5 ns |
0.83 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10834 ns |
10875 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7729 ns |
8312 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
114880 ns |
117824.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3278469.5 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69301 ns |
70451 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8959 ns |
7583.5 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
12583 ns |
9792 ns |
1.29 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8667 ns |
8187.5 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7354 ns |
7562.5 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
516223 ns |
515354.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19023334 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
315512 ns |
318733 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25923 ns |
26054 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1283315 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46841 ns |
46610 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10250 ns |
9083 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
12458 ns |
9604 ns |
1.30 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
13521 ns |
8958 ns |
1.51 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8834 ns |
9166 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
250993 ns |
252407.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23768700 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
387937 ns |
388539 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
106500 ns |
107458.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
84042 ns |
84708 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
85500 ns |
86000 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146479 ns |
146750 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24236.5 ns |
23950.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
772749.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190921 ns |
191282 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
478166.5 ns |
516625 ns |
0.93 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
480562.5 ns |
502312.5 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
479083 ns |
478354.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478000 ns |
498167 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
230930 ns |
232559 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11531138 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
609234 ns |
606451 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5209 ns |
5250 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
7167 ns |
6500 ns |
1.10 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7521 ns |
7749.5 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6208.5 ns |
5687.5 ns |
1.09 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16010 ns |
16126.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/oneAPI |
69713523 ns |
||
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
79920 ns |
85781 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
12500 ns |
11625 ns |
1.08 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10500 ns |
9917 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11167 ns |
10104.5 ns |
1.11 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16562.5 ns |
16584 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
211835 ns |
215162.5 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI |
94286099.5 ns |
||
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
367732.5 ns |
378354 ns |
0.97 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39292 ns |
38708 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
50167 ns |
51125 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
51187.5 ns |
52146 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
15917 ns |
14417 ns |
1.10 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
21272 ns |
19504 ns |
1.09 |
batchedmm(16, Bsize=128)/forward/GPU/oneAPI |
74117102.5 ns |
||
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
87240 ns |
93401 ns |
0.93 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36834 ns |
36334 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
28250 ns |
28167 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
29229 ns |
28625 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57916.5 ns |
56895.5 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
190232 ns |
190765 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI |
107961163 ns |
||
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
411983 ns |
410848.5 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1958.5 ns |
1666.5 ns |
1.18 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1709 ns |
2000 ns |
0.85 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2209 ns |
2167 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1812.5 ns |
1667 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20515 ns |
20338 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1206257 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
32750 ns |
32440 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2187.5 ns |
2042 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2375 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2667 ns |
2417 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2083 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
202549.5 ns |
202489 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9048897 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
137331 ns |
136411 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4770.5 ns |
6750 ns |
0.71 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5125 ns |
4833 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6458 ns |
5896 ns |
1.10 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5250 ns |
4916.5 ns |
1.07 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
142988.5 ns |
142403 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5974748 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
59740 ns |
69051 ns |
0.87 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8583.5 ns |
8395.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9209 ns |
8625 ns |
1.07 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9166.5 ns |
8542 ns |
1.07 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8417 ns |
8292 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
876984.5 ns |
858082 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
37578358.5 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
387002.5 ns |
388048.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56917 ns |
56834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56916 ns |
56916 ns |
1 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57000 ns |
56917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58250 ns |
58291 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37250 ns |
37048 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1219604 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
204526.5 ns |
204772 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
507437.5 ns |
484583.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
465749.5 ns |
475541.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
468750 ns |
465562.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
440500 ns |
445666 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
265836 ns |
263380 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26749125.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
825546 ns |
819218 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3335229 ns |
3332458 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1771604 ns |
1767958 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
1766375 ns |
1766125 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6314750 ns |
6295583.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204661 ns |
206330 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/oneAPI |
79139898 ns |
||
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
210542 ns |
212392 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11539645.5 ns |
11495438 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
6559708 ns |
6565688 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
6570000 ns |
6570438 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21182291.5 ns |
21167562.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
740843 ns |
737845 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI |
121795434.5 ns |
||
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1045497 ns |
1062630 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5166 ns |
4833 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4708 ns |
5583 ns |
0.84 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6354.5 ns |
7333 ns |
0.87 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4875 ns |
4500 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
136797.5 ns |
136011 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5676835 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
57551 ns |
56600 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9458 ns |
7125 ns |
1.33 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
7500 ns |
1.16 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
7541.5 ns |
1.23 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7292 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
759042 ns |
746443 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
34120894 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
365442 ns |
370888 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
126917 ns |
155000 ns |
0.82 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
99542 ns |
124709 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
102958 ns |
98541 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
122917 ns |
98709 ns |
1.25 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
150316.5 ns |
150159 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5774988 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204316 ns |
204262 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2033333 ns |
2031188 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2041041.5 ns |
2031500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2033645.5 ns |
2037125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1999500 ns |
2033000 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701696.5 ns |
697162 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31730226 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1112818 ns |
1208931 ns |
0.92 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
33625 ns |
33209 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35541 ns |
34833 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
33479 ns |
33042 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
687.5 ns |
541 ns |
1.27 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15262 ns |
15393 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/GPU/oneAPI |
73822850 ns |
||
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
78871 ns |
79290 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
3042 ns |
2583 ns |
1.18 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3333 ns |
3083 ns |
1.08 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3792 ns |
3209 ns |
1.18 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2375 ns |
2125 ns |
1.12 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
138789.5 ns |
138753 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI |
93638716 ns |
||
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
342702 ns |
341213 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7250 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5333 ns |
5416 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5416 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10417 ns |
10458 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36210 ns |
36086 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1257374 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48570 ns |
49460 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220458 ns |
213395.5 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221958 ns |
227750 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223709 ns |
220792 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206083 ns |
205667 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
243944 ns |
240787.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26586526.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
575589 ns |
569246 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3959 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21476 ns |
21637 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2064949.5 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
44060 ns |
42161 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14625 ns |
14625 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14667 ns |
14750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14708 ns |
14667 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14667 ns |
14625 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
307143 ns |
307620 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11739084 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
199661 ns |
192746.5 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148416 ns |
100834 ns |
1.47 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
110916 ns |
118500 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
106500 ns |
101833 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
118145.5 ns |
102417 ns |
1.15 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
136427 ns |
136873 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5833156 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204897 ns |
205777 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1922624.5 ns |
1916625 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1939667 ns |
1916542 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1935687.5 ns |
1926979 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1921833.5 ns |
1898334 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
686994 ns |
683667 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34938407 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1214953.5 ns |
1215256.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21500 ns |
19000 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19167 ns |
19000 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21167 ns |
22250 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17500 ns |
16916 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107557 ns |
107183.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3518733 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78745.5 ns |
78581 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229541 ns |
217813 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220625 ns |
222833 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217583.5 ns |
217417 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221395.5 ns |
216770.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
514096.5 ns |
512086.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20488413 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
477244 ns |
476669.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
24583.5 ns |
24750 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
28917 ns |
28937.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
27000 ns |
26875 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1417 ns |
1083 ns |
1.31 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15648 ns |
16054 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/GPU/oneAPI |
71452995 ns |
||
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
82161 ns |
81581 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4271 ns |
4896.5 ns |
0.87 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5292 ns |
4917 ns |
1.08 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5250 ns |
5333 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4584 ns |
4229 ns |
1.08 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
205650 ns |
206611 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI |
89733904.5 ns |
||
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
381703 ns |
377863 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
308167 ns |
306208 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
305708.5 ns |
305084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
309250 ns |
309729.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
307541 ns |
307625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
225957 ns |
224320 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7672365 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
272732 ns |
274612 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
541542 ns |
531959 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
535334 ns |
543458 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
532770.5 ns |
535333.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
540250 ns |
542209 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1073264 ns |
1058263 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43089373.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
853166 ns |
853108 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22291 ns |
22084 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19458 ns |
21083 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22709 ns |
23542 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19209 ns |
19459 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112604 ns |
112165.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3598747 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78801 ns |
78361 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222854.5 ns |
221750 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218771 ns |
217666.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214958 ns |
224750 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213833 ns |
222416 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
736294 ns |
732048.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25058593 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
532474 ns |
533125 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7042 ns |
6958 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6584 ns |
6958 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8125 ns |
9208 ns |
0.88 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6583 ns |
6417 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
138568.5 ns |
137815 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5792248 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65790 ns |
65160 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10000 ns |
9958 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11958 ns |
10792 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11041 ns |
10541 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10125 ns |
9875 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
827774.5 ns |
815812 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38201577.5 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
389552 ns |
385314 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5166 ns |
4750 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4875 ns |
5208 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6750 ns |
6271 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4750 ns |
5000 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
142238 ns |
141314 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5605547 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68521 ns |
66780 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7541 ns |
7709 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
7916 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7937.5 ns |
7875 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7458 ns |
7959 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
787250 ns |
775695 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39098967 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
395683 ns |
388324 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14589625 ns |
14550291 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
7712583 ns |
7721375 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
7720979 ns |
7712187.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27855709 ns |
27857958 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
530491 ns |
529799 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/oneAPI |
97041475.5 ns |
||
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
394933 ns |
389819 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46610729 ns |
46686916.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
26604584 ns |
26553583 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
26681396 ns |
26597104.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85852583 ns |
85700209 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2667330 ns |
2648481 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI |
191567319 ns |
||
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3286084 ns |
3297251 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66667 ns |
66125 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
67042 ns |
68667 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
69708.5 ns |
70437.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
65625 ns |
66917 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
117462 ns |
117160.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3751126 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
230716.5 ns |
233212 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
441271 ns |
455375 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
442500 ns |
452500 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
442875 ns |
453833.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
452458 ns |
441375 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
725359.5 ns |
721437 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27248402.5 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
795126 ns |
786047 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32377 ns |
32085 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1182026 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47370 ns |
47371 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8416 ns |
8667 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10958 ns |
9042 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9646 ns |
10000 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8708 ns |
8458 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
282941 ns |
282627 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21204124 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
376463 ns |
375423.5 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9791.5 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9792 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9833 ns |
9833 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
22665 ns |
22901 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2028098 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
209912 ns |
208212 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45583 ns |
45625 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46584 ns |
45958 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46208 ns |
45875 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46458 ns |
45917 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
288297 ns |
288260 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
9932924.5 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
607414 ns |
607426 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56458 ns |
56625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56458 ns |
56833 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
56500 ns |
56834 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58000 ns |
58250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28171 ns |
28250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1190493 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
202902 ns |
202042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449292 ns |
496854 ns |
0.90 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
507333 ns |
504833 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
465354 ns |
482959 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
440562.5 ns |
434145.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
245614.5 ns |
242768 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31644665 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
878146 ns |
877308 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
664042 ns |
642729 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
637000 ns |
659250 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
629167 ns |
650437.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
654396 ns |
609291.5 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
202857 ns |
203473.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8419099 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
306212 ns |
309673 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2254395.5 ns |
2253979 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2240792 ns |
2246042 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2250750 ns |
2231375 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2243375 ns |
2238292 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
969029 ns |
956636.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48566408 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1353880 ns |
1324473 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19792 ns |
20292 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20520.5 ns |
23500 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23417 ns |
24250 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19708 ns |
19333 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112196.5 ns |
111824.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3630993 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79390 ns |
80571 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223854 ns |
271000 ns |
0.83 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221437.5 ns |
258000 ns |
0.86 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221708 ns |
231875 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220916 ns |
221125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
723806.5 ns |
720921 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24902123 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
550509.5 ns |
554706 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
667 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22882 ns |
22764 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1265621.5 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47960 ns |
47580 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
9541 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9708 ns |
9625 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10208 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9854.5 ns |
9333 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
265608 ns |
264550 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
26022316.5 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
400258 ns |
398354 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8250 ns |
10750 ns |
0.77 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7896.5 ns |
8875 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11875 ns |
11125 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8167 ns |
8917 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
118304 ns |
117075.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3546887 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70261 ns |
69781 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7583.5 ns |
7500 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7417 ns |
7750 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8000 ns |
8083 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7750 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
505629.5 ns |
498929 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17345271 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
319743 ns |
322428 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1500 ns |
1584 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2000 ns |
2000 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1459 ns |
1541 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21040 ns |
20430 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1140242 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
189601 ns |
188361 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3292 ns |
3292 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3208 ns |
3458 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3604.5 ns |
3541 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3208 ns |
3208 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
220407.5 ns |
218522.5 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10098546 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
582305 ns |
578345 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
147937.5 ns |
148312.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
105916.5 ns |
105937.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
106750 ns |
108125 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
231896 ns |
226084 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23931 ns |
23769 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1227265 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
41170 ns |
40471 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
163229 ns |
173291.5 ns |
0.94 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
88104 ns |
104500 ns |
0.84 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
87374.5 ns |
105208 ns |
0.83 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
270771 ns |
287062 ns |
0.94 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
216773 ns |
215904 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10565835 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
267432 ns |
268567 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7209 ns |
7250 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5375 ns |
5333 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5333 ns |
5416 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10209 ns |
10416 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32955 ns |
32778 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1203684 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50371 ns |
48640 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222250 ns |
226583 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229125 ns |
229645.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228208.5 ns |
238083 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219667 ns |
213229.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
263890 ns |
258784 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29607614.5 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
595145 ns |
595636 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15708 ns |
15375 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
14541.5 ns |
15125 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
17083.5 ns |
16959 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15208 ns |
15083 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
138261.5 ns |
137028 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5487375.5 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
231261 ns |
230152 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24292 ns |
23500 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24750 ns |
24208 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24500 ns |
24500 ns |
1 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24542 ns |
24375 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
869603.5 ns |
858623.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40457803 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
681295 ns |
679476 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9167 ns |
9750 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9083.5 ns |
10104.5 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12188 ns |
11000 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9291 ns |
9084 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
122425.5 ns |
120301.5 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
4079012 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
74250 ns |
74161 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13917 ns |
13875 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13833 ns |
14646 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14583 ns |
15000 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14125 ns |
13958 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
663795 ns |
655428 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22251484 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
367348 ns |
366138.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8708 ns |
10250 ns |
0.85 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9521 ns |
10625.5 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10875 ns |
11792 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8958 ns |
9125 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
120869 ns |
119866.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3403090 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73351 ns |
72421 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12833 ns |
12208 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12625 ns |
12791.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13354 ns |
13084 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13125 ns |
12875 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
549586 ns |
541791 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19602252 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
341748 ns |
341643 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
28437.5 ns |
30750 ns |
0.92 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34020.5 ns |
32333 ns |
1.05 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
29833 ns |
29792 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1875 ns |
1625 ns |
1.15 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16088 ns |
16024 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/oneAPI |
76292652.5 ns |
||
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
81070 ns |
80551 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5187.5 ns |
5042 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
4812.5 ns |
5458 ns |
0.88 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5166.5 ns |
5083 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
7084 ns |
6209 ns |
1.14 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
138637 ns |
139561 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI |
106277566 ns |
||
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
373178 ns |
368314 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25151 ns |
25032 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1206450.5 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48171 ns |
46980 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6291 ns |
6167 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6584 ns |
6666.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6833.5 ns |
6958 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6333 ns |
6125 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
186344.5 ns |
184207 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24314539 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
390138 ns |
388954 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
1958 ns |
2000 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
1959 ns |
2042 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2042 ns |
2083 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1958 ns |
1959 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26102 ns |
26042 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1240500 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
206142 ns |
204582 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16666.5 ns |
17083 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16334 ns |
16875 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16667 ns |
16896 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16667 ns |
16584 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
272991 ns |
271146.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
28009585 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
704126 ns |
701017 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
151500 ns |
147458 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
171687.5 ns |
175562.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
153458 ns |
153292 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
175417 ns |
152541 ns |
1.15 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
199266 ns |
195620 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8069586.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
216712 ns |
226692 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1323833 ns |
1323500 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1319020.5 ns |
1327791 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1331875 ns |
1331125 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1323229 ns |
1301042 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
902946 ns |
891045 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45771076 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
999218 ns |
1116140.5 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25000 ns |
25000 ns |
1 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24770.5 ns |
24437.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27187 ns |
28250 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25000 ns |
25979.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
235086.5 ns |
231362.5 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8000668 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
113461 ns |
115561 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
124708.5 ns |
178562 ns |
0.70 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
129270.5 ns |
126166 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
129667 ns |
178437.5 ns |
0.73 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
117645.5 ns |
157500 ns |
0.75 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1071982 ns |
1053949 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47350908.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
608575 ns |
608216 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
291 ns |
334 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22745 ns |
22518 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1191968 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47400 ns |
47580 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6416 ns |
6416 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6834 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
7020.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6333 ns |
6417 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
202054 ns |
200663 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26514268.5 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
387828 ns |
396354 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5875 ns |
7062.5 ns |
0.83 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5874.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7542 ns |
7791 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6187.5 ns |
6791 ns |
0.91 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
142993.5 ns |
142964.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5749081 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
232471 ns |
231792 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10083 ns |
10208.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10187.5 ns |
10250 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10292 ns |
10500 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10334 ns |
10333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
907275.5 ns |
887713 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
42579731 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
671686 ns |
669276 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
666 ns |
667 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
667 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
708 ns |
667 ns |
1.06 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22091 ns |
22120 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2176757 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
205902 ns |
205382 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4583 ns |
4667 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4542 ns |
4833 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4833 ns |
4833 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4542 ns |
4584 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
224542 ns |
224988.5 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9538256 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
583495 ns |
575835.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
8167 ns |
0.89 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8000 ns |
8437 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9896 ns |
9833 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7708 ns |
7958 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
120756.5 ns |
119167.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3471791 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
75931 ns |
74331 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8875 ns |
8416 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8812.5 ns |
8938 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8958.5 ns |
9625 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8458 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
590021 ns |
578635 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20094165 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
343053 ns |
344473 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126000 ns |
126875 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
95333 ns |
97229 ns |
0.98 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
96187.5 ns |
97333.5 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183083 ns |
183291.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45611 ns |
45455.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/oneAPI |
73061834 ns |
||
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
102841 ns |
101051 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
340542 ns |
340292 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
182166 ns |
182250 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
179000 ns |
191959 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
619562.5 ns |
612416.5 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
190070 ns |
191737 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI |
89832396 ns |
||
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
515839.5 ns |
516500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398625 ns |
399042 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215375 ns |
215417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215166 ns |
215333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756916 ns |
756333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
42770 ns |
43626 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1389564 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80551 ns |
81280 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1414208.5 ns |
1398374.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
864917 ns |
864000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
863417 ns |
864270.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2359166.5 ns |
2358708.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
247351 ns |
253991.5 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10979753.5 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
353233 ns |
350903.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
662854 ns |
653500 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
673021.5 ns |
655916 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
660875 ns |
653041.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
686750 ns |
622146 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
201573 ns |
201217.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8278348.5 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
311267.5 ns |
306973 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2467270.5 ns |
2461125.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2476708 ns |
2469625 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2475333 ns |
2481375 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2444125 ns |
2480333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
988382.5 ns |
998464.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
52533110.5 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1304020.5 ns |
1392463.5 ns |
0.94 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
31771.5 ns |
32521 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
34833.5 ns |
34291 ns |
1.02 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
32917 ns |
33084 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
875 ns |
833 ns |
1.05 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15323 ns |
15542.5 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/GPU/oneAPI |
70325477 ns |
||
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
83911 ns |
78871 ns |
1.06 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3062.5 ns |
3000 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3167 ns |
3417 ns |
0.93 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3417 ns |
3500 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3208 ns |
3042 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136612 ns |
141700 ns |
0.96 |
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI |
91300478 ns |
||
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
354383 ns |
337663 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
407500 ns |
408916 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
401709 ns |
403770.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
402333 ns |
404375 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
422000 ns |
423959 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43203 ns |
43511.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1412474 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
241512 ns |
237932 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3873791.5 ns |
3878166.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3988979.5 ns |
3999042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3992333 ns |
4003416 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3780562 ns |
3792395.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
242608 ns |
245738 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
38123730 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1241085 ns |
1432279 ns |
0.87 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33762 ns |
34288 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1274592 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
38050 ns |
37921 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15416 ns |
15459 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15458 ns |
15666 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15625 ns |
15666 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15459 ns |
15458 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
251910 ns |
258924 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
6676022 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
168912 ns |
173651.5 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404875 ns |
404583 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
221020.5 ns |
220833 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
220958 ns |
221125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760834 ns |
760833 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112978 ns |
113269 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
979708 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
87801 ns |
87641 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1428104 ns |
1424020.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
889125 ns |
888041.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
887167 ns |
888875 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2382770.5 ns |
2382770.5 ns |
1 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
238693 ns |
245573 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10268219 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
355883 ns |
354303 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25472 ns |
25789 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1196934 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
206612 ns |
204972 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7125 ns |
7459 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7667 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7959 ns |
7958 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7166 ns |
7250 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
210500 ns |
217010.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
27163566 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
685306 ns |
692821.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
831312.5 ns |
832771 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
471083 ns |
467416 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
469354 ns |
470562.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1551083 ns |
1544541 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129725 ns |
129883 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/oneAPI |
73418877 ns |
||
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
234712 ns |
229272 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2690750 ns |
2692000 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1540667 ns |
1540000 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1536375.5 ns |
1542312.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4950854.5 ns |
4931479 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
239460 ns |
248014 ns |
0.97 |
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI |
99877157 ns |
||
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
760537 ns |
809797.5 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
291 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32058 ns |
32644 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1250598 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
46991 ns |
47000 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6041 ns |
6208 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6562.5 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6916 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6291 ns |
6333 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
222876 ns |
226410 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
23165966.5 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
358863 ns |
357804 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2413250 ns |
2407917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2415084 ns |
2401417 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2415646 ns |
2386750 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2406584 ns |
2392333 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
197621 ns |
200791 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7999137 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
374664 ns |
374543.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4645167 ns |
4663875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4655833.5 ns |
4666063 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4666958 ns |
4675291 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4630000 ns |
4670208 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
897908 ns |
902618 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49208090 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1382441 ns |
1376633 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7541.5 ns |
6875 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
22875 ns |
7542 ns |
3.03 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7292 ns |
7250 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6833 ns |
6917 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
22723 ns |
23477 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1180199 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
40911 ns |
39221 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
66646 ns |
32313 ns |
2.06 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
43750 ns |
49125 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
48209 ns |
49583 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
69604 ns |
52291.5 ns |
1.33 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
215130 ns |
219072.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10574968 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
267982 ns |
262272 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
20625 ns |
21666.5 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
24770.5 ns |
24541.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
22917 ns |
22416.5 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
6084 ns |
5166 ns |
1.18 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
17467 ns |
18191 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/GPU/oneAPI |
87440172.5 ns |
||
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
84121 ns |
82841 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12458 ns |
11979 ns |
1.04 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
8895.5 ns |
9645.5 ns |
0.92 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
9666 ns |
9541.5 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
17792 ns |
18062.5 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
226861 ns |
231197.5 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI |
140762273.5 ns |
||
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
388443 ns |
365714 ns |
1.06 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406916 ns |
406041 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
223292 ns |
223459 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
223417 ns |
223375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762917 ns |
762584 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
45844 ns |
46689.5 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1352402 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90961 ns |
87501 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1428229 ns |
1427542 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
891833 ns |
894125 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
894708.5 ns |
896417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2386770.5 ns |
2384229 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
286502 ns |
287677.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11860453.5 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
379253 ns |
377703 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
434958 ns |
434334 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
430166 ns |
430229.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
430250 ns |
430333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
448375 ns |
447583 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54142 ns |
55000 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1033716.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
236092 ns |
233247 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3910542 ns |
3915625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4015625 ns |
4018146 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4017125 ns |
4025959 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3810042 ns |
3782667 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
263657 ns |
265792.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30657851 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1222870 ns |
1207206.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8708 ns |
8750 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
6833 ns |
6875 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
6875 ns |
6875 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12375 ns |
12416 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24200 ns |
24680 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2206428 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
208602 ns |
210232 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
44792 ns |
44583 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
44958 ns |
44959 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
44833 ns |
44875 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45166 ns |
44667 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
344948 ns |
349913 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11784282 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
652455 ns |
651936 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
108167 ns |
119750.5 ns |
0.90 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82916.5 ns |
123750 ns |
0.67 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
87833.5 ns |
89667 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
106145.5 ns |
81771 ns |
1.30 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190064 ns |
189502 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5997800 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
219032 ns |
218452 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024375 ns |
2022125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2011813 ns |
2026083 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2008999.5 ns |
2027729 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2027292 ns |
2023895.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
532552.5 ns |
540867 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27552313 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1086649 ns |
1089800 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/more_dev
branch
2 times, most recently
from
September 21, 2024 16:22
e8b9675
to
e932ee4
Compare
avik-pal
force-pushed
the
ap/more_dev
branch
from
September 21, 2024 16:31
e932ee4
to
1d266fa
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #165 +/- ##
===========================================
+ Coverage 59.34% 79.03% +19.68%
===========================================
Files 38 38
Lines 2022 2065 +43
===========================================
+ Hits 1200 1632 +432
+ Misses 822 433 -389
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New Additions