Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use prescribed thread-block configurations #1969

Merged
merged 1 commit into from
Sep 9, 2024
Merged

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Sep 4, 2024

Based on comparing #1963 against our main branch, this PR removes conversions from linear to cartesian indexes, and instead uses partitioned thread-block configurations in order to improve the performance of some kernels. xref: JuliaGPU/KernelAbstractions.jl#470.

Also, these launch configurations all start with using CUDA's occupancy API, in order to get a safer bound on how many threads to use. I've seen some errors due to launching kernels with too many threads on the main branch (in some of the builds from this PR).

I'll try making the comparison easier, but for now I'm going to just paste the results:

Main

fill!

N reads-writes: 1, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 651 nanoseconds │ 3.43102e-5  │ 0.000699585 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 12 microseconds, 51 nanoseconds  │ 2.61999     │ 53.4216     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 310 nanoseconds │ 0.697853    │ 14.2292     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 440 nanoseconds │ 0.000560059 │ 0.0114196   │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 900 nanoseconds │ 0.000134093 │ 0.00273416  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 960 nanoseconds │ 0.0021004   │ 0.0428272   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 73 microseconds, 639 nanoseconds │ 27.0097     │ 550.727     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 18 microseconds, 620 nanoseconds │ 26.7047     │ 544.509     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘
N reads-writes: 1, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 710 nanoseconds │ 1.7059e-5  │ 0.000347833 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 12 microseconds, 251 nanoseconds │ 1.28861    │ 26.2747     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 760 nanoseconds │ 0.335575   │ 6.84237     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 410 nanoseconds │ 0.00028081 │ 0.00572571  │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 410 nanoseconds │ 7.02024e-5 │ 0.00143143  │ (4, 1, 1, 1, 1)     │
│ VF    │ 11 microseconds, 90 nanoseconds  │ 0.00103789 │ 0.0211626   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 53 microseconds, 541 nanoseconds │ 18.5746    │ 378.736     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 17 microseconds, 141 nanoseconds │ 14.5053    │ 295.763     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

copyto!

N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬───────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                     │ bw %       │ achieved bw │ problem size        │
├───────┼───────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 970 nanoseconds  │ 6.10532e-5 │ 0.00124488  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 18 microseconds, 459 nanoseconds  │ 3.42065    │ 69.747      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 560 nanoseconds  │ 1.16412    │ 23.7364     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 900 nanoseconds  │ 0.00356934 │ 0.0727788   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 114 microseconds, 770 nanoseconds │ 34.6603    │ 706.724     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 22 microseconds, 900 nanoseconds  │ 43.4272    │ 885.48      │ (4, 1, 1, 63, 5400) │
└───────┴───────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
N reads-writes: 2, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 941 nanoseconds │ 2.82383e-5 │ 0.000575779 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 18 microseconds, 490 nanoseconds │ 1.70746    │ 34.815      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 329 nanoseconds │ 0.592146   │ 12.0739     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 971 nanoseconds │ 0.0017749  │ 0.0361902   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 68 microseconds, 559 nanoseconds │ 29.011     │ 591.534     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 21 microseconds, 630 nanoseconds │ 22.9885    │ 468.736     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

stencils

Problem size: (1, 1, 1, 63, 1), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                    │ bw %        │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 13 microseconds, 451 nanoseconds │ 0.00342311  │ 0.0697973   │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 14 microseconds, 31 nanoseconds  │ 0.0032816   │ 0.0669118   │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 14 microseconds, 191 nanoseconds │ 0.0032446   │ 0.0661574   │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 13 microseconds, 801 nanoseconds │ 0.00333629  │ 0.068027    │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 13 microseconds, 740 nanoseconds │ 0.00502629  │ 0.102486    │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 14 microseconds, 201 nanoseconds │ 0.00486347  │ 0.0991662   │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 13 microseconds, 990 nanoseconds │ 0.00493683  │ 0.100662    │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 13 microseconds, 580 nanoseconds │ 0.00339034  │ 0.0691291   │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 13 microseconds, 890 nanoseconds │ 0.00331468  │ 0.0675863   │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 13 microseconds, 871 nanoseconds │ 0.00331946  │ 0.0676837   │ 2              │
│ (op_broadcast_example2!, :none)                               │ 12 microseconds, 500 nanoseconds │ 0.00736654  │ 0.150204    │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 13 microseconds, 479 nanoseconds │ 0.00341575  │ 0.0696471   │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 13 microseconds, 410 nanoseconds │ 0.00343332  │ 0.0700055   │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 13 microseconds, 720 nanoseconds │ 0.00335575  │ 0.0684237   │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 13 microseconds, 801 nanoseconds │ 0.00333629  │ 0.068027    │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 13 microseconds, 631 nanoseconds │ 0.00337791  │ 0.0688755   │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 13 microseconds, 709 nanoseconds │ 0.00335844  │ 0.0684786   │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 14 microseconds, 941 nanoseconds │ 0.00462258  │ 0.0942543   │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 13 microseconds, 631 nanoseconds │ 0.00506686  │ 0.103313    │ 3              │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 67 microseconds, 610 nanoseconds  │ 58.8373  │ 1199.69     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 93 microseconds, 840 nanoseconds  │ 42.391   │ 864.353     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 101 microseconds, 220 nanoseconds │ 39.3002  │ 801.332     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 87 microseconds, 750 nanoseconds  │ 45.3325  │ 924.33      │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 202 microseconds, 548 nanoseconds │ 29.4592  │ 600.672     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 260 microseconds, 169 nanoseconds │ 22.9348  │ 467.64      │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 191 microseconds, 289 nanoseconds │ 31.1931  │ 636.027     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 67 microseconds, 570 nanoseconds  │ 58.8713  │ 1200.38     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 70 microseconds, 140 nanoseconds  │ 56.7141  │ 1156.4      │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 78 microseconds, 459 nanoseconds  │ 50.7008  │ 1033.79     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 129 microseconds, 829 nanoseconds │ 45.9597  │ 937.117     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 248 microseconds, 288 nanoseconds │ 32.0429  │ 653.354     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 248 microseconds, 158 nanoseconds │ 32.0597  │ 653.696     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 66 microseconds, 850 nanoseconds  │ 59.5062  │ 1213.33     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 66 microseconds, 579 nanoseconds  │ 59.7475  │ 1218.25     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 68 microseconds, 119 nanoseconds  │ 58.3968  │ 1190.71     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 67 microseconds, 20 nanoseconds   │ 59.3544  │ 1210.24     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 66 microseconds, 469 nanoseconds  │ 59.8464  │ 1220.27     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 68 microseconds, 700 nanoseconds  │ 57.9029  │ 1180.64     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 437 microseconds, 307 nanoseconds │ 13.6446  │ 278.214     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 143 microseconds, 478 nanoseconds │ 41.5875  │ 847.97      │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 62 microseconds, 569 nanoseconds  │ 31.7883  │ 648.164     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 86 microseconds, 931 nanoseconds  │ 22.8801  │ 466.525     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 92 microseconds, 740 nanoseconds  │ 21.4469  │ 437.303     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 81 microseconds, 360 nanoseconds  │ 24.4465  │ 498.464     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 119 microseconds, 759 nanoseconds │ 24.9121  │ 507.958     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 224 microseconds, 729 nanoseconds │ 13.2758  │ 270.694     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 129 microseconds, 610 nanoseconds │ 23.0188  │ 469.354     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 62 microseconds, 160 nanoseconds  │ 31.9975  │ 652.429     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 64 microseconds, 470 nanoseconds  │ 30.8515  │ 629.062     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 77 microseconds, 920 nanoseconds  │ 25.5257  │ 520.47      │ 2              │
│ (op_broadcast_example0!, :none)                               │ 75 microseconds, 70 nanoseconds   │ 39.7427  │ 810.354     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 134 microseconds, 119 nanoseconds │ 29.6597  │ 604.761     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 134 microseconds, 769 nanoseconds │ 29.5167  │ 601.845     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 58 microseconds, 701 nanoseconds  │ 33.8836  │ 690.886     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 60 microseconds, 410 nanoseconds  │ 32.925   │ 671.34      │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 62 microseconds, 541 nanoseconds  │ 31.8031  │ 648.465     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 61 microseconds, 250 nanoseconds  │ 32.4734  │ 662.133     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 57 microseconds, 510 nanoseconds  │ 34.5847  │ 705.182     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 64 microseconds, 330 nanoseconds  │ 30.9182  │ 630.421     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 281 microseconds, 118 nanoseconds │ 10.6128  │ 216.395     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 115 microseconds, 300 nanoseconds │ 25.8757  │ 527.606     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

This PR

fill!

N reads-writes: 1, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 10 microseconds, 641 nanoseconds │ 3.43424e-5  │ 0.000700243 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 11 microseconds, 901 nanoseconds │ 2.65301     │ 54.095      │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 719 nanoseconds │ 0.673498    │ 13.7326     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 10 microseconds, 980 nanoseconds │ 0.000532464 │ 0.0108569   │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 930 nanoseconds │ 0.000133725 │ 0.00272665  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 950 nanoseconds │ 0.00210232  │ 0.0428664   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 35 microseconds, 270 nanoseconds │ 56.3925     │ 1149.84     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 15 microseconds, 111 nanoseconds │ 32.9081     │ 670.996     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘
N reads-writes: 1, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %        │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 21 nanoseconds  │ 1.65791e-5  │ 0.000338048 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 11 microseconds, 820 nanoseconds │ 1.33549     │ 27.2305     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 11 microseconds, 800 nanoseconds │ 0.334437    │ 6.81918     │ (4, 1, 1, 1, 5400)  │
│ IJF   │ 11 microseconds, 130 nanoseconds │ 0.000262644 │ 0.00535531  │ (4, 4, 1, 1, 1)     │
│ IF    │ 10 microseconds, 861 nanoseconds │ 6.72935e-5  │ 0.00137211  │ (4, 1, 1, 1, 1)     │
│ VF    │ 10 microseconds, 910 nanoseconds │ 0.00105502  │ 0.0215118   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 31 microseconds, 491 nanoseconds │ 31.5809     │ 643.935     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 15 microseconds                  │ 16.5747     │ 337.958     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴─────────────┴─────────────┴─────────────────────┘

copyto!

N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 11 microseconds, 840 nanoseconds │ 6.17236e-5 │ 0.00125854  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 13 microseconds, 631 nanoseconds │ 4.63256    │ 94.4578     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 12 microseconds, 271 nanoseconds │ 1.28651    │ 26.2319     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 560 nanoseconds │ 0.00366567 │ 0.0747431   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 62 microseconds, 181 nanoseconds │ 63.9744    │ 1304.44     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 19 microseconds, 409 nanoseconds │ 51.2382    │ 1044.75     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
N reads-writes: 2, N-reps: 10000,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 160 nanoseconds │ 3.00496e-5 │ 0.000612712 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 13 microseconds, 501 nanoseconds │ 2.33858    │ 47.6837     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 12 microseconds, 111 nanoseconds │ 0.651752   │ 13.2892     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 11 microseconds, 901 nanoseconds │ 0.00193449 │ 0.0394443   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 47 microseconds, 9 nanoseconds   │ 42.3103    │ 862.707     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 19 microseconds, 80 nanoseconds  │ 26.0609    │ 531.381     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

stencils

Problem size: (1, 1, 1, 63, 1), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                    │ bw %        │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 13 microseconds, 321 nanoseconds │ 0.00345652  │ 0.0704785   │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 13 microseconds, 851 nanoseconds │ 0.00332425  │ 0.0677815   │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 13 microseconds, 960 nanoseconds │ 0.00329806  │ 0.0672474   │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 14 microseconds, 51 nanoseconds  │ 0.00327693  │ 0.0668166   │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 13 microseconds, 681 nanoseconds │ 0.00504834  │ 0.102936    │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 13 microseconds, 970 nanoseconds │ 0.00494354  │ 0.100799    │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 13 microseconds, 780 nanoseconds │ 0.0050117   │ 0.102189    │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 13 microseconds, 331 nanoseconds │ 0.00345393  │ 0.0704256   │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 13 microseconds, 631 nanoseconds │ 0.00337791  │ 0.0688755   │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 13 microseconds, 550 nanoseconds │ 0.00339785  │ 0.0692822   │ 2              │
│ (op_broadcast_example2!, :none)                               │ 13 microseconds, 20 nanoseconds  │ 0.00707233  │ 0.144205    │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 13 microseconds, 530 nanoseconds │ 0.00340287  │ 0.0693846   │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 13 microseconds, 370 nanoseconds │ 0.00344359  │ 0.0702149   │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 13 microseconds, 450 nanoseconds │ 0.00342337  │ 0.0698025   │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 13 microseconds, 440 nanoseconds │ 0.00342566  │ 0.0698492   │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 12 microseconds, 601 nanoseconds │ 0.00365404  │ 0.0745058   │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 13 microseconds, 529 nanoseconds │ 0.00340312  │ 0.0693897   │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 14 microseconds, 181 nanoseconds │ 0.00487033  │ 0.099306    │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 13 microseconds, 710 nanoseconds │ 0.00503729  │ 0.10271     │ 3              │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 68 microseconds, 669 nanoseconds  │ 57.9291  │ 1181.17     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 80 microseconds, 229 nanoseconds  │ 49.5822  │ 1010.98     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 92 microseconds, 329 nanoseconds  │ 43.0843  │ 878.489     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 80 microseconds, 90 nanoseconds   │ 49.6683  │ 1012.74     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 203 microseconds, 498 nanoseconds │ 29.3216  │ 597.868     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 216 microseconds, 259 nanoseconds │ 27.5916  │ 562.592     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 197 microseconds, 838 nanoseconds │ 30.1605  │ 614.973     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 68 microseconds, 429 nanoseconds  │ 58.1322  │ 1185.32     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 71 microseconds, 69 nanoseconds   │ 55.9728  │ 1141.29     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 74 microseconds, 249 nanoseconds  │ 53.5755  │ 1092.41     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 69 microseconds, 829 nanoseconds  │ 85.4501  │ 1742.33     │ 3              │
│ (op_broadcast_example1!, :none)                               │ 165 microseconds, 438 nanoseconds │ 48.0897  │ 980.549     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 164 microseconds, 739 nanoseconds │ 48.294   │ 984.715     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 67 microseconds, 589 nanoseconds  │ 58.8547  │ 1200.05     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 68 microseconds, 69 nanoseconds   │ 58.4397  │ 1191.59     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 68 microseconds, 769 nanoseconds  │ 57.8448  │ 1179.46     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 67 microseconds, 839 nanoseconds  │ 58.6378  │ 1195.62     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 68 microseconds, 689 nanoseconds  │ 57.9122  │ 1180.83     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 69 microseconds, 730 nanoseconds  │ 57.0484  │ 1163.22     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 288 microseconds, 927 nanoseconds │ 20.6519  │ 421.093     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 145 microseconds, 699 nanoseconds │ 40.9536  │ 835.043     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘
Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 53 microseconds, 300 nanoseconds  │ 37.3164  │ 760.882     │ 2              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 70 microseconds, 899 nanoseconds  │ 28.0535  │ 572.011     │ 2              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 80 microseconds, 59 nanoseconds   │ 24.8437  │ 506.564     │ 2              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 72 microseconds, 870 nanoseconds  │ 27.2951  │ 556.547     │ 2              │
│ (op_DivergenceF2C!, :none)                                    │ 114 microseconds, 439 nanoseconds │ 26.0702  │ 531.571     │ 3              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 162 microseconds, 79 nanoseconds  │ 18.4075  │ 375.329     │ 3              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 124 microseconds, 449 nanoseconds │ 23.9733  │ 488.815     │ 3              │
│ (op_InterpolateF2C!, :none)                                   │ 54 microseconds, 950 nanoseconds  │ 36.1966  │ 738.048     │ 2              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 62 microseconds, 750 nanoseconds  │ 31.6972  │ 646.305     │ 2              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 71 microseconds, 350 nanoseconds  │ 27.8762  │ 568.395     │ 2              │
│ (op_broadcast_example0!, :none)                               │ 66 microseconds, 101 nanoseconds  │ 45.1354  │ 920.31      │ 3              │
│ (op_broadcast_example1!, :none)                               │ 94 microseconds, 239 nanoseconds  │ 42.2111  │ 860.684     │ 4              │
│ (op_broadcast_example2!, :none)                               │ 93 microseconds, 700 nanoseconds  │ 42.4544  │ 865.644     │ 4              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 60 microseconds, 90 nanoseconds   │ 33.0998  │ 674.904     │ 2              │
│ (op_LeftBiasedF2C!, :none)                                    │ 51 microseconds, 369 nanoseconds  │ 38.7192  │ 789.484     │ 2              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 58 microseconds, 879 nanoseconds  │ 33.7806  │ 688.785     │ 2              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 59 microseconds, 200 nanoseconds  │ 33.5974  │ 685.051     │ 2              │
│ (op_RightBiasedF2C!, :none)                                   │ 50 microseconds, 309 nanoseconds  │ 39.535   │ 806.118     │ 2              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 59 microseconds, 49 nanoseconds   │ 33.6833  │ 686.802     │ 2              │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 189 microseconds, 938 nanoseconds │ 15.7075  │ 320.276     │ 3              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 112 microseconds, 570 nanoseconds │ 26.5033  │ 540.402     │ 3              │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

@charleskawczynski
Copy link
Member Author

Here is a compressed summary of the results.

Global specs:

  • Problem size, matched across all datalayouts: (4, 4, 1, 1, 5400)
  • Device bandwidth (GBs): 2039

fill!

N reads-writes: 1
        |----------- Main ---------|----------- this PR -------|
┌───────┬─────────────┬────────────┬─────────────┬─────────────┐
│ funcs │ bw % F64    │ bw % F32   │ bw % F64    │ bw % F32    │
├───────┼─────────────┼────────────┼─────────────┼─────────────┤
│ DataF │ 3.43102e-5  │ 1.7059e-5  │ 3.43424e-5  │ 1.65791e-5  │
│ IJFH  │ 2.61999     │ 1.28861    │ 2.65301     │ 1.33549     │
│ IFH   │ 0.697853    │ 0.335575   │ 0.673498    │ 0.334437    │
│ IJF   │ 0.000560059 │ 0.00028081 │ 0.000532464 │ 0.000262644 │
│ IF    │ 0.000134093 │ 7.02024e-5 │ 0.000133725 │ 6.72935e-5  │
│ VF    │ 0.0021004   │ 0.00103789 │ 0.00210232  │ 0.00105502  │
│ VIJFH │ 27.0097     │ 18.5746    │ 56.3925     │ 31.5809     │
│ VIFH  │ 26.7047     │ 14.5053    │ 32.9081     │ 16.5747     │
└───────┴─────────────┴────────────┴─────────────┴─────────────┘

copyto!

N reads-writes: 2
        |----------- Main --------|----------- this PR -----|
┌───────┬────────────┬────────────┬────────────┬────────────┐
│ funcs │ bw % F64   │ bw % F32   │ bw % F64   │ bw % F32   │
├───────┼────────────┼────────────┼────────────┼────────────┤
│ DataF │ 6.10532e-5 │ 2.82383e-5 │ 6.17236e-5 │ 3.00496e-5 │
│ IJFH  │ 3.42065    │ 1.70746    │ 4.63256    │ 2.33858    │
│ IFH   │ 1.16412    │ 0.592146   │ 1.28651    │ 0.651752   │
│ VF    │ 0.00356934 │ 0.0017749  │ 0.00366567 │ 0.00193449 │
│ VIJFH │ 34.6603    │ 29.011     │ 63.9744    │ 42.3103    │
│ VIFH  │ 43.4272    │ 22.9885    │ 51.2382    │ 26.0609    │
└───────┴────────────┴────────────┴────────────┴────────────┘

stencils

                                                         |------- Main --------|------- this PR -----|
┌────────────────────────────────────────────────────────┬──────────┬──────────┬──────────┬──────────┐
│ funcs                                                  │ bw % F64 │ bw % F32 │ bw % F64 │ bw % F32 │
├────────────────────────────────────────────────────────┼──────────┼──────────┼──────────┼──────────┤
│ (op_GradientF2C!, :none)                               │ 58.8373  │ 31.7883  │ 57.9291  │ 37.3164  │
│ (op_GradientF2C!, :SetValue, :SetValue)                │ 42.391   │ 22.8801  │ 49.5822  │ 28.0535  │
│ (op_GradientC2F!, :SetGradient, :SetGradient)          │ 39.3002  │ 21.4469  │ 43.0843  │ 24.8437  │
│ (op_GradientC2F!, :SetValue, :SetValue)                │ 45.3325  │ 24.4465  │ 49.6683  │ 27.2951  │
│ (op_DivergenceF2C!, :none)                             │ 29.4592  │ 24.9121  │ 29.3216  │ 26.0702  │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)        │ 22.9348  │ 13.2758  │ 27.5916  │ 18.4075  │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)    │ 31.1931  │ 23.0188  │ 30.1605  │ 23.9733  │
│ (op_InterpolateF2C!, :none)                            │ 58.8713  │ 31.9975  │ 58.1322  │ 36.1966  │
│ (op_InterpolateC2F!, :SetValue, :SetValue)             │ 56.7141  │ 30.8515  │ 55.9728  │ 31.6972  │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)       │ 50.7008  │ 25.5257  │ 53.5755  │ 27.8762  │
│ (op_broadcast_example0!, :none)                        │ 45.9597  │ 39.7427  │ 85.4501  │ 45.1354  │
│ (op_broadcast_example1!, :none)                        │ 32.0429  │ 29.6597  │ 48.0897  │ 42.2111  │
│ (op_broadcast_example2!, :none)                        │ 32.0597  │ 29.5167  │ 48.294   │ 42.4544  │
│ (op_LeftBiasedC2F!, :SetValue)                         │ 59.5062  │ 33.8836  │ 58.8547  │ 33.0998  │
│ (op_LeftBiasedF2C!, :none)                             │ 59.7475  │ 32.925   │ 58.4397  │ 38.7192  │
│ (op_LeftBiasedF2C!, :SetValue)                         │ 58.3968  │ 31.8031  │ 57.8448  │ 33.7806  │
│ (op_RightBiasedC2F!, :SetValue)                        │ 59.3544  │ 32.4734  │ 58.6378  │ 33.5974  │
│ (op_RightBiasedF2C!, :none)                            │ 59.8464  │ 34.5847  │ 57.9122  │ 39.535   │
│ (op_RightBiasedF2C!, :SetValue)                        │ 57.9029  │ 30.9182  │ 57.0484  │ 33.6833  │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)          │ 13.6446  │ 10.6128  │ 20.6519  │ 15.7075  │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)│ 41.5875  │ 25.8757  │ 40.9536  │ 26.5033  │
└────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴──────────┘

So, the notable improvements here will be on the fill! calls (e.g, @. x = 1) and copyto! (all pointwise) kernels. The stencils don't see much of an improvement. Perhaps there's a different issue with the stencils.

@charleskawczynski charleskawczynski force-pushed the ck/thread_blocks branch 9 times, most recently from 78058a0 to 4417516 Compare September 6, 2024 01:00
@charleskawczynski
Copy link
Member Author

Closes #1854.

@charleskawczynski charleskawczynski force-pushed the ck/thread_blocks branch 2 times, most recently from 226bec3 to d34f647 Compare September 6, 2024 12:17
@sriharshakandala
Copy link
Member

Can we also add the usual benchmarks for these kernels with a sync statement at the end? Might be easier to use benchmark tools for gathering timing information!

@charleskawczynski
Copy link
Member Author

Can we also add the usual benchmarks for these kernels with a sync statement at the end? Might be easier to use benchmark tools for gathering timing information!

The fill!, copyto!, and stencil benchmarks are all using BenchmarkTools.@benchmark CUDA.@cuda_sync to gather timings, is that what you're referring to?

@sriharshakandala
Copy link
Member

Overall, looks good to me!

@charleskawczynski charleskawczynski merged commit 3bc75d1 into main Sep 9, 2024
22 of 23 checks passed
@charleskawczynski charleskawczynski deleted the ck/thread_blocks branch September 9, 2024 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants