Use prescribed thread-block configurations #1969

charleskawczynski · 2024-09-04T17:59:33Z

Based on comparing #1963 against our main branch, this PR removes conversions from linear to cartesian indexes, and instead uses partitioned thread-block configurations in order to improve the performance of some kernels. xref: JuliaGPU/KernelAbstractions.jl#470.

Also, these launch configurations all start with using CUDA's occupancy API, in order to get a safer bound on how many threads to use. I've seen some errors due to launching kernels with too many threads on the main branch (in some of the builds from this PR).

I'll try making the comparison easier, but for now I'm going to just paste the results:

Main

fill!

N reads-writes: 1, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 651 nanoseconds │ 3.43102e-5  │ 0.000699585 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 12 microseconds, 51 nanoseconds  │ 2.61999     │ 53.4216     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 310 nanoseconds │ 0.697853    │ 14.2292     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 440 nanoseconds │ 0.000560059 │ 0.0114196   │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 900 nanoseconds │ 0.000134093 │ 0.00273416  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 960 nanoseconds │ 0.0021004   │ 0.0428272   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 73 microseconds, 639 nanoseconds │ 27.0097     │ 550.727     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 18 microseconds, 620 nanoseconds │ 26.7047     │ 544.509     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘
N reads-writes: 1, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 710 nanoseconds │ 1.7059e-5  │ 0.000347833 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 12 microseconds, 251 nanoseconds │ 1.28861    │ 26.2747     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 760 nanoseconds │ 0.335575   │ 6.84237     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 410 nanoseconds │ 0.00028081 │ 0.00572571  │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 410 nanoseconds │ 7.02024e-5 │ 0.00143143  │ (4, 1, 1, 1, 1)     │
│ VF    │ 11 microseconds, 90 nanoseconds  │ 0.00103789 │ 0.0211626   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 53 microseconds, 541 nanoseconds │ 18.5746    │ 378.736     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 17 microseconds, 141 nanoseconds │ 14.5053    │ 295.763     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

copyto!

N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬───────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                     │ bw %       │ achieved bw │ problem size        │
├───────┼───────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 970 nanoseconds  │ 6.10532e-5 │ 0.00124488  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 18 microseconds, 459 nanoseconds  │ 3.42065    │ 69.747      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 560 nanoseconds  │ 1.16412    │ 23.7364     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 900 nanoseconds  │ 0.00356934 │ 0.0727788   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 114 microseconds, 770 nanoseconds │ 34.6603    │ 706.724     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 22 microseconds, 900 nanoseconds  │ 43.4272    │ 885.48      │ (4, 1, 1, 63, 5400) │
└───────┴───────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
N reads-writes: 2, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 941 nanoseconds │ 2.82383e-5 │ 0.000575779 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 18 microseconds, 490 nanoseconds │ 1.70746    │ 34.815      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 329 nanoseconds │ 0.592146   │ 12.0739     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 971 nanoseconds │ 0.0017749  │ 0.0361902   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 68 microseconds, 559 nanoseconds │ 29.011     │ 591.534     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 21 microseconds, 630 nanoseconds │ 22.9885    │ 468.736     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

stencils

Problem size: (1, 1, 1, 63, 1), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                    │ bw %        │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 13 microseconds, 451 nanoseconds │ 0.00342311  │ 0.0697973   │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 14 microseconds, 31 nanoseconds  │ 0.0032816   │ 0.0669118   │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 14 microseconds, 191 nanoseconds │ 0.0032446   │ 0.0661574   │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 13 microseconds, 801 nanoseconds │ 0.00333629  │ 0.068027    │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 13 microseconds, 740 nanoseconds │ 0.00502629  │ 0.102486    │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 14 microseconds, 201 nanoseconds │ 0.00486347  │ 0.0991662   │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 13 microseconds, 990 nanoseconds │ 0.00493683  │ 0.100662    │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 13 microseconds, 580 nanoseconds │ 0.00339034  │ 0.0691291   │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 13 microseconds, 890 nanoseconds │ 0.00331468  │ 0.0675863   │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 13 microseconds, 871 nanoseconds │ 0.00331946  │ 0.0676837   │ 2              │
│ (op_broadcast_example2!, :none)                               │ 12 microseconds, 500 nanoseconds │ 0.00736654  │ 0.150204    │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 13 microseconds, 479 nanoseconds │ 0.00341575  │ 0.0696471   │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 13 microseconds, 410 nanoseconds │ 0.00343332  │ 0.0700055   │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 13 microseconds, 720 nanoseconds │ 0.00335575  │ 0.0684237   │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 13 microseconds, 801 nanoseconds │ 0.00333629  │ 0.068027    │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 13 microseconds, 631 nanoseconds │ 0.00337791  │ 0.0688755   │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 13 microseconds, 709 nanoseconds │ 0.00335844  │ 0.0684786   │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 14 microseconds, 941 nanoseconds │ 0.00462258  │ 0.0942543   │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 13 microseconds, 631 nanoseconds │ 0.00506686  │ 0.103313    │ 3              │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 67 microseconds, 610 nanoseconds  │ 58.8373  │ 1199.69     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 93 microseconds, 840 nanoseconds  │ 42.391   │ 864.353     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 101 microseconds, 220 nanoseconds │ 39.3002  │ 801.332     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 87 microseconds, 750 nanoseconds  │ 45.3325  │ 924.33      │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 202 microseconds, 548 nanoseconds │ 29.4592  │ 600.672     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 260 microseconds, 169 nanoseconds │ 22.9348  │ 467.64      │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 191 microseconds, 289 nanoseconds │ 31.1931  │ 636.027     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 67 microseconds, 570 nanoseconds  │ 58.8713  │ 1200.38     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 70 microseconds, 140 nanoseconds  │ 56.7141  │ 1156.4      │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 78 microseconds, 459 nanoseconds  │ 50.7008  │ 1033.79     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 129 microseconds, 829 nanoseconds │ 45.9597  │ 937.117     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 248 microseconds, 288 nanoseconds │ 32.0429  │ 653.354     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 248 microseconds, 158 nanoseconds │ 32.0597  │ 653.696     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 66 microseconds, 850 nanoseconds  │ 59.5062  │ 1213.33     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 66 microseconds, 579 nanoseconds  │ 59.7475  │ 1218.25     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 68 microseconds, 119 nanoseconds  │ 58.3968  │ 1190.71     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 67 microseconds, 20 nanoseconds   │ 59.3544  │ 1210.24     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 66 microseconds, 469 nanoseconds  │ 59.8464  │ 1220.27     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 68 microseconds, 700 nanoseconds  │ 57.9029  │ 1180.64     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 437 microseconds, 307 nanoseconds │ 13.6446  │ 278.214     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 143 microseconds, 478 nanoseconds │ 41.5875  │ 847.97      │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 62 microseconds, 569 nanoseconds  │ 31.7883  │ 648.164     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 86 microseconds, 931 nanoseconds  │ 22.8801  │ 466.525     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 92 microseconds, 740 nanoseconds  │ 21.4469  │ 437.303     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 81 microseconds, 360 nanoseconds  │ 24.4465  │ 498.464     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 119 microseconds, 759 nanoseconds │ 24.9121  │ 507.958     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 224 microseconds, 729 nanoseconds │ 13.2758  │ 270.694     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 129 microseconds, 610 nanoseconds │ 23.0188  │ 469.354     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 62 microseconds, 160 nanoseconds  │ 31.9975  │ 652.429     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 64 microseconds, 470 nanoseconds  │ 30.8515  │ 629.062     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 77 microseconds, 920 nanoseconds  │ 25.5257  │ 520.47      │ 2              │
│ (op_broadcast_example0!, :none)                               │ 75 microseconds, 70 nanoseconds   │ 39.7427  │ 810.354     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 134 microseconds, 119 nanoseconds │ 29.6597  │ 604.761     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 134 microseconds, 769 nanoseconds │ 29.5167  │ 601.845     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 58 microseconds, 701 nanoseconds  │ 33.8836  │ 690.886     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 60 microseconds, 410 nanoseconds  │ 32.925   │ 671.34      │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 62 microseconds, 541 nanoseconds  │ 31.8031  │ 648.465     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 61 microseconds, 250 nanoseconds  │ 32.4734  │ 662.133     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 57 microseconds, 510 nanoseconds  │ 34.5847  │ 705.182     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 64 microseconds, 330 nanoseconds  │ 30.9182  │ 630.421     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 281 microseconds, 118 nanoseconds │ 10.6128  │ 216.395     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 115 microseconds, 300 nanoseconds │ 25.8757  │ 527.606     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

This PR

fill!

N reads-writes: 1, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 641 nanoseconds │ 3.43424e-5  │ 0.000700243 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 11 microseconds, 901 nanoseconds │ 2.65301     │ 54.095      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 719 nanoseconds │ 0.673498    │ 13.7326     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 980 nanoseconds │ 0.000532464 │ 0.0108569   │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 930 nanoseconds │ 0.000133725 │ 0.00272665  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 950 nanoseconds │ 0.00210232  │ 0.0428664   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 35 microseconds, 270 nanoseconds │ 56.3925     │ 1149.84     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 15 microseconds, 111 nanoseconds │ 32.9081     │ 670.996     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘
N reads-writes: 1, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 21 nanoseconds  │ 1.65791e-5  │ 0.000338048 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 11 microseconds, 820 nanoseconds │ 1.33549     │ 27.2305     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 800 nanoseconds │ 0.334437    │ 6.81918     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 11 microseconds, 130 nanoseconds │ 0.000262644 │ 0.00535531  │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 861 nanoseconds │ 6.72935e-5  │ 0.00137211  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 910 nanoseconds │ 0.00105502  │ 0.0215118   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 31 microseconds, 491 nanoseconds │ 31.5809     │ 643.935     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 15 microseconds                  │ 16.5747     │ 337.958     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘

copyto!

N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 840 nanoseconds │ 6.17236e-5 │ 0.00125854  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 13 microseconds, 631 nanoseconds │ 4.63256    │ 94.4578     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 12 microseconds, 271 nanoseconds │ 1.28651    │ 26.2319     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 560 nanoseconds │ 0.00366567 │ 0.0747431   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 62 microseconds, 181 nanoseconds │ 63.9744    │ 1304.44     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 19 microseconds, 409 nanoseconds │ 51.2382    │ 1044.75     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
N reads-writes: 2, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 160 nanoseconds │ 3.00496e-5 │ 0.000612712 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 13 microseconds, 501 nanoseconds │ 2.33858    │ 47.6837     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 12 microseconds, 111 nanoseconds │ 0.651752   │ 13.2892     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 11 microseconds, 901 nanoseconds │ 0.00193449 │ 0.0394443   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 47 microseconds, 9 nanoseconds   │ 42.3103    │ 862.707     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 19 microseconds, 80 nanoseconds  │ 26.0609    │ 531.381     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

stencils

Problem size: (1, 1, 1, 63, 1), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                    │ bw %        │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 13 microseconds, 321 nanoseconds │ 0.00345652  │ 0.0704785   │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 13 microseconds, 851 nanoseconds │ 0.00332425  │ 0.0677815   │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 13 microseconds, 960 nanoseconds │ 0.00329806  │ 0.0672474   │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 14 microseconds, 51 nanoseconds  │ 0.00327693  │ 0.0668166   │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 13 microseconds, 681 nanoseconds │ 0.00504834  │ 0.102936    │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 13 microseconds, 970 nanoseconds │ 0.00494354  │ 0.100799    │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 13 microseconds, 780 nanoseconds │ 0.0050117   │ 0.102189    │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 13 microseconds, 331 nanoseconds │ 0.00345393  │ 0.0704256   │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 13 microseconds, 631 nanoseconds │ 0.00337791  │ 0.0688755   │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 13 microseconds, 550 nanoseconds │ 0.00339785  │ 0.0692822   │ 2              │
│ (op_broadcast_example2!, :none)                               │ 13 microseconds, 20 nanoseconds  │ 0.00707233  │ 0.144205    │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 13 microseconds, 530 nanoseconds │ 0.00340287  │ 0.0693846   │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 13 microseconds, 370 nanoseconds │ 0.00344359  │ 0.0702149   │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 13 microseconds, 450 nanoseconds │ 0.00342337  │ 0.0698025   │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 13 microseconds, 440 nanoseconds │ 0.00342566  │ 0.0698492   │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 12 microseconds, 601 nanoseconds │ 0.00365404  │ 0.0745058   │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 13 microseconds, 529 nanoseconds │ 0.00340312  │ 0.0693897   │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 14 microseconds, 181 nanoseconds │ 0.00487033  │ 0.099306    │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 13 microseconds, 710 nanoseconds │ 0.00503729  │ 0.10271     │ 3              │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 68 microseconds, 669 nanoseconds  │ 57.9291  │ 1181.17     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 80 microseconds, 229 nanoseconds  │ 49.5822  │ 1010.98     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 92 microseconds, 329 nanoseconds  │ 43.0843  │ 878.489     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 80 microseconds, 90 nanoseconds   │ 49.6683  │ 1012.74     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 203 microseconds, 498 nanoseconds │ 29.3216  │ 597.868     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 216 microseconds, 259 nanoseconds │ 27.5916  │ 562.592     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 197 microseconds, 838 nanoseconds │ 30.1605  │ 614.973     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 68 microseconds, 429 nanoseconds  │ 58.1322  │ 1185.32     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 71 microseconds, 69 nanoseconds   │ 55.9728  │ 1141.29     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 74 microseconds, 249 nanoseconds  │ 53.5755  │ 1092.41     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 69 microseconds, 829 nanoseconds  │ 85.4501  │ 1742.33     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 165 microseconds, 438 nanoseconds │ 48.0897  │ 980.549     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 164 microseconds, 739 nanoseconds │ 48.294   │ 984.715     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 67 microseconds, 589 nanoseconds  │ 58.8547  │ 1200.05     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 68 microseconds, 69 nanoseconds   │ 58.4397  │ 1191.59     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 68 microseconds, 769 nanoseconds  │ 57.8448  │ 1179.46     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 67 microseconds, 839 nanoseconds  │ 58.6378  │ 1195.62     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 68 microseconds, 689 nanoseconds  │ 57.9122  │ 1180.83     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 69 microseconds, 730 nanoseconds  │ 57.0484  │ 1163.22     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 288 microseconds, 927 nanoseconds │ 20.6519  │ 421.093     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 145 microseconds, 699 nanoseconds │ 40.9536  │ 835.043     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 53 microseconds, 300 nanoseconds  │ 37.3164  │ 760.882     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 70 microseconds, 899 nanoseconds  │ 28.0535  │ 572.011     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 80 microseconds, 59 nanoseconds   │ 24.8437  │ 506.564     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 72 microseconds, 870 nanoseconds  │ 27.2951  │ 556.547     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 114 microseconds, 439 nanoseconds │ 26.0702  │ 531.571     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 162 microseconds, 79 nanoseconds  │ 18.4075  │ 375.329     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 124 microseconds, 449 nanoseconds │ 23.9733  │ 488.815     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 54 microseconds, 950 nanoseconds  │ 36.1966  │ 738.048     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 62 microseconds, 750 nanoseconds  │ 31.6972  │ 646.305     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 71 microseconds, 350 nanoseconds  │ 27.8762  │ 568.395     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 66 microseconds, 101 nanoseconds  │ 45.1354  │ 920.31      │ 3              │
│ (op_broadcast_example1!, :none)                               │ 94 microseconds, 239 nanoseconds  │ 42.2111  │ 860.684     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 93 microseconds, 700 nanoseconds  │ 42.4544  │ 865.644     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 60 microseconds, 90 nanoseconds   │ 33.0998  │ 674.904     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 51 microseconds, 369 nanoseconds  │ 38.7192  │ 789.484     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 58 microseconds, 879 nanoseconds  │ 33.7806  │ 688.785     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 59 microseconds, 200 nanoseconds  │ 33.5974  │ 685.051     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 50 microseconds, 309 nanoseconds  │ 39.535   │ 806.118     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 59 microseconds, 49 nanoseconds   │ 33.6833  │ 686.802     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 189 microseconds, 938 nanoseconds │ 15.7075  │ 320.276     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 112 microseconds, 570 nanoseconds │ 26.5033  │ 540.402     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

charleskawczynski · 2024-09-05T15:42:36Z

Here is a compressed summary of the results.

Global specs:

Problem size, matched across all datalayouts: (4, 4, 1, 1, 5400)
Device bandwidth (GBs): 2039

fill!

N reads-writes: 1
        |----------- Main ---------|----------- this PR -------|
┌───────┬─────────────┬────────────┬─────────────┬─────────────┐
│ funcs │ bw % F64    │ bw % F32   │ bw % F64    │ bw % F32    │
├───────┼─────────────┼────────────┼─────────────┼─────────────┤
│ DataF │ 3.43102e-5  │ 1.7059e-5  │ 3.43424e-5  │ 1.65791e-5  │
│ IJFH  │ 2.61999     │ 1.28861    │ 2.65301     │ 1.33549     │
│ IFH   │ 0.697853    │ 0.335575   │ 0.673498    │ 0.334437    │
│ IJF   │ 0.000560059 │ 0.00028081 │ 0.000532464 │ 0.000262644 │
│ IF    │ 0.000134093 │ 7.02024e-5 │ 0.000133725 │ 6.72935e-5  │
│ VF    │ 0.0021004   │ 0.00103789 │ 0.00210232  │ 0.00105502  │
│ VIJFH │ 27.0097     │ 18.5746    │ 56.3925     │ 31.5809     │
│ VIFH  │ 26.7047     │ 14.5053    │ 32.9081     │ 16.5747     │
└───────┴─────────────┴────────────┴─────────────┴─────────────┘

copyto!

N reads-writes: 2
        |----------- Main --------|----------- this PR -----|
┌───────┬────────────┬────────────┬────────────┬────────────┐
│ funcs │ bw % F64   │ bw % F32   │ bw % F64   │ bw % F32   │
├───────┼────────────┼────────────┼────────────┼────────────┤
│ DataF │ 6.10532e-5 │ 2.82383e-5 │ 6.17236e-5 │ 3.00496e-5 │
│ IJFH  │ 3.42065    │ 1.70746    │ 4.63256    │ 2.33858    │
│ IFH   │ 1.16412    │ 0.592146   │ 1.28651    │ 0.651752   │
│ VF    │ 0.00356934 │ 0.0017749  │ 0.00366567 │ 0.00193449 │
│ VIJFH │ 34.6603    │ 29.011     │ 63.9744    │ 42.3103    │
│ VIFH  │ 43.4272    │ 22.9885    │ 51.2382    │ 26.0609    │
└───────┴────────────┴────────────┴────────────┴────────────┘

stencils

                                                         |------- Main --------|------- this PR -----|
┌────────────────────────────────────────────────────────┬──────────┬──────────┬──────────┬──────────┐
│ funcs                                                  │ bw % F64 │ bw % F32 │ bw % F64 │ bw % F32 │
├────────────────────────────────────────────────────────┼──────────┼──────────┼──────────┼──────────┤
│ (op_GradientF2C!, :none)                               │ 58.8373  │ 31.7883  │ 57.9291  │ 37.3164  │
│ (op_GradientF2C!, :SetValue, :SetValue)                │ 42.391   │ 22.8801  │ 49.5822  │ 28.0535  │
│ (op_GradientC2F!, :SetGradient, :SetGradient)          │ 39.3002  │ 21.4469  │ 43.0843  │ 24.8437  │
│ (op_GradientC2F!, :SetValue, :SetValue)                │ 45.3325  │ 24.4465  │ 49.6683  │ 27.2951  │
│ (op_DivergenceF2C!, :none)                             │ 29.4592  │ 24.9121  │ 29.3216  │ 26.0702  │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)        │ 22.9348  │ 13.2758  │ 27.5916  │ 18.4075  │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)    │ 31.1931  │ 23.0188  │ 30.1605  │ 23.9733  │
│ (op_InterpolateF2C!, :none)                            │ 58.8713  │ 31.9975  │ 58.1322  │ 36.1966  │
│ (op_InterpolateC2F!, :SetValue, :SetValue)             │ 56.7141  │ 30.8515  │ 55.9728  │ 31.6972  │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)       │ 50.7008  │ 25.5257  │ 53.5755  │ 27.8762  │
│ (op_broadcast_example0!, :none)                        │ 45.9597  │ 39.7427  │ 85.4501  │ 45.1354  │
│ (op_broadcast_example1!, :none)                        │ 32.0429  │ 29.6597  │ 48.0897  │ 42.2111  │
│ (op_broadcast_example2!, :none)                        │ 32.0597  │ 29.5167  │ 48.294   │ 42.4544  │
│ (op_LeftBiasedC2F!, :SetValue)                         │ 59.5062  │ 33.8836  │ 58.8547  │ 33.0998  │
│ (op_LeftBiasedF2C!, :none)                             │ 59.7475  │ 32.925   │ 58.4397  │ 38.7192  │
│ (op_LeftBiasedF2C!, :SetValue)                         │ 58.3968  │ 31.8031  │ 57.8448  │ 33.7806  │
│ (op_RightBiasedC2F!, :SetValue)                        │ 59.3544  │ 32.4734  │ 58.6378  │ 33.5974  │
│ (op_RightBiasedF2C!, :none)                            │ 59.8464  │ 34.5847  │ 57.9122  │ 39.535   │
│ (op_RightBiasedF2C!, :SetValue)                        │ 57.9029  │ 30.9182  │ 57.0484  │ 33.6833  │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)          │ 13.6446  │ 10.6128  │ 20.6519  │ 15.7075  │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)│ 41.5875  │ 25.8757  │ 40.9536  │ 26.5033  │
└────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┘

So, the notable improvements here will be on the fill! calls (e.g, @. x = 1) and copyto! (all pointwise) kernels. The stencils don't see much of an improvement. Perhaps there's a different issue with the stencils.

charleskawczynski · 2024-09-06T01:20:30Z

Closes #1854.

sriharshakandala · 2024-09-09T16:18:34Z

Can we also add the usual benchmarks for these kernels with a sync statement at the end? Might be easier to use benchmark tools for gathering timing information!

charleskawczynski · 2024-09-09T17:31:40Z

Can we also add the usual benchmarks for these kernels with a sync statement at the end? Might be easier to use benchmark tools for gathering timing information!

The fill!, copyto!, and stencil benchmarks are all using BenchmarkTools.@benchmark CUDA.@cuda_sync to gather timings, is that what you're referring to?

sriharshakandala · 2024-09-09T17:54:56Z

Overall, looks good to me!

charleskawczynski requested a review from sriharshakandala September 4, 2024 17:59

charleskawczynski mentioned this pull request Sep 4, 2024

Fix fill benchmark n-reads-writes #1970

Merged

charleskawczynski force-pushed the ck/thread_blocks branch 9 times, most recently from 78058a0 to 4417516 Compare September 6, 2024 01:00

charleskawczynski force-pushed the ck/thread_blocks branch 2 times, most recently from 226bec3 to d34f647 Compare September 6, 2024 12:17

charleskawczynski added the performance label Sep 6, 2024

Use prescribed thread-block configurations

7ed62c9

charleskawczynski force-pushed the ck/thread_blocks branch from d34f647 to 7ed62c9 Compare September 9, 2024 17:32

sriharshakandala approved these changes Sep 9, 2024

View reviewed changes

charleskawczynski merged commit 3bc75d1 into main Sep 9, 2024
22 of 23 checks passed

charleskawczynski deleted the ck/thread_blocks branch September 9, 2024 20:16

This was referenced Sep 9, 2024

Higher resolution column cases cannot be run on GPU #1854

Closed

Fix uncoalesced memory reads #1910

Closed

charleskawczynski mentioned this pull request Sep 23, 2024

Define a linear partition, and use in FD stencils #2002

Closed

charleskawczynski mentioned this pull request Oct 31, 2024

Revert most of 1969 #2066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use prescribed thread-block configurations #1969

Use prescribed thread-block configurations #1969

charleskawczynski commented Sep 4, 2024 •

edited

Loading

charleskawczynski commented Sep 5, 2024

charleskawczynski commented Sep 6, 2024

sriharshakandala commented Sep 9, 2024

charleskawczynski commented Sep 9, 2024

sriharshakandala commented Sep 9, 2024

Use prescribed thread-block configurations #1969

Use prescribed thread-block configurations #1969

Conversation

charleskawczynski commented Sep 4, 2024 • edited Loading

Main

fill!

copyto!

stencils

This PR

fill!

copyto!

stencils

charleskawczynski commented Sep 5, 2024

fill!

copyto!

stencils

charleskawczynski commented Sep 6, 2024

sriharshakandala commented Sep 9, 2024

charleskawczynski commented Sep 9, 2024

sriharshakandala commented Sep 9, 2024

charleskawczynski commented Sep 4, 2024 •

edited

Loading