Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a linear partition, and use in FD stencils #2002

Closed
wants to merge 2 commits into from

Conversation

charleskawczynski
Copy link
Member

A performance regression was found in #1969. Concretely,

Main
Problem size: (4, 4, 1, 63, 1536), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬───────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %      │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼───────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 31 microseconds, 280 nanoseconds  │ 36.1744737.5972              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 45 microseconds, 911 nanoseconds  │ 24.6461502.5332              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 49 microseconds, 740 nanoseconds  │ 22.7483463.8382              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 41 microseconds, 890 nanoseconds  │ 27.0119550.7722              │
│ (op_DivergenceF2C!, :none)                                    │ 2 milliseconds, 101 microseconds  │ 0.80757116.46643              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 496 microseconds, 567 nanoseconds │ 3.4179869.69253              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 77 microseconds, 619 nanoseconds  │ 21.8664445.8563              │
│ (op_InterpolateF2C!, :none)                                   │ 31 microseconds, 831 nanoseconds  │ 35.5482724.8282              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 43 microseconds, 859 nanoseconds  │ 25.7986526.0332              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 39 microseconds, 699 nanoseconds  │ 28.502581.1552              │
│ (op_broadcast_example0!, :none)                               │ 33 microseconds, 870 nanoseconds  │ 50.11221021.793              │
│ (op_broadcast_example1!, :none)                               │ 63 microseconds, 279 nanoseconds  │ 35.7623729.1934              │
│ (op_broadcast_example2!, :none)                               │ 62 microseconds, 560 nanoseconds  │ 36.1739737.5854              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 33 microseconds, 590 nanoseconds  │ 33.6856686.852              │
│ (op_LeftBiasedF2C!, :none)                                    │ 30 microseconds, 740 nanoseconds  │ 36.8087750.532              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 32 microseconds, 999 nanoseconds  │ 34.2889699.1512              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 33 microseconds, 980 nanoseconds  │ 33.299678.9672              │
│ (op_RightBiasedF2C!, :none)                                   │ 30 microseconds, 761 nanoseconds  │ 36.7848750.0422              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 33 microseconds, 910 nanoseconds  │ 33.3677680.3682              │
│ (op_CurlC2F!, :SetCurl, :SetCurl)                             │ 56 microseconds, 891 nanoseconds  │ -9.94463-202.771-1             │
│ (op_CurlC2F!, :SetValue, :SetValue)                           │ 1 millisecond, 225 microseconds   │ -0.461728-9.41462-1             │
│ (op_UBPC2F!, :SetValue, :SetValue)                            │ 68 microseconds, 790 nanoseconds  │ -8.22443-167.696-1             │
│ (op_UBPC2F!, :Extrapolate, :Extrapolate)                      │ 69 microseconds, 570 nanoseconds  │ -8.13222-165.816-1             │
│ (op_divO3UBPC2F!, :1SidedO3, :1SidedO3, :SetValue, :SetValue) │ 244 microseconds, 58 nanoseconds  │ -2.3181-47.266-1             │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 134 microseconds, 919 nanoseconds │ 12.5798256.5023              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 59 microseconds, 341 nanoseconds  │ 28.6021583.1973              │
│ (op_div_interp_CC!, :SetValue, :SetValue, :none)              │ 586 microseconds, 486 nanoseconds │ -0.964645-19.6691-1             │
│ (op_div_interp_FF!, :none, :SetValue, :SetValue)              │ 69 microseconds, 470 nanoseconds  │ -8.14392-166.055-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate)              │ 1 millisecond, 844 microseconds   │ -0.306743-6.2545-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :SetValue)                 │ 114 microseconds, 140 nanoseconds │ -4.95668-101.067-1             │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴───────────┴─────────────┴────────────────┘
3fd62e1
Problem size: (4, 4, 1, 63, 1536), N-reps: 1,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 28 microseconds, 900 nanoseconds  │ 39.1536798.3422              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 39 microseconds, 41 nanoseconds   │ 28.9831590.9652              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 43 microseconds, 701 nanoseconds  │ 25.8925527.9472              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 34 microseconds, 959 nanoseconds  │ 32.3665659.9532              │
│ (op_DivergenceF2C!, :none)                                    │ 68 microseconds, 559 nanoseconds  │ 24.7561504.7763              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 88 microseconds, 871 nanoseconds  │ 19.0981389.4113              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 67 microseconds, 650 nanoseconds  │ 25.0887511.5593              │
│ (op_InterpolateF2C!, :none)                                   │ 29 microseconds, 61 nanoseconds   │ 38.9367793.9192              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 30 microseconds, 320 nanoseconds  │ 37.3186760.9262              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 33 microseconds, 530 nanoseconds  │ 33.7469688.12              │
│ (op_broadcast_example0!, :none)                               │ 47 microseconds, 409 nanoseconds  │ 35.8002729.9653              │
│ (op_broadcast_example1!, :none)                               │ 82 microseconds, 741 nanoseconds  │ 27.3507557.6824              │
│ (op_broadcast_example2!, :none)                               │ 82 microseconds, 599 nanoseconds  │ 27.3974558.6344              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 28 microseconds, 440 nanoseconds  │ 39.7869811.2552              │
│ (op_LeftBiasedF2C!, :none)                                    │ 28 microseconds, 330 nanoseconds  │ 39.94814.3772              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 29 microseconds, 461 nanoseconds  │ 38.408783.1392              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 28 microseconds, 820 nanoseconds  │ 39.2623800.5582              │
│ (op_RightBiasedF2C!, :none)                                   │ 27 microseconds, 511 nanoseconds  │ 41.1305838.6512              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 30 microseconds, 19 nanoseconds   │ 37.6928768.5562              │
│ (op_CurlC2F!, :SetCurl, :SetCurl)                             │ 51 microseconds, 860 nanoseconds  │ -10.9092-222.438-1             │
│ (op_CurlC2F!, :SetValue, :SetValue)                           │ 54 microseconds, 509 nanoseconds  │ -10.379-211.628-1             │
│ (op_UBPC2F!, :SetValue, :SetValue)                            │ 54 microseconds, 660 nanoseconds  │ -10.3503-211.044-1             │
│ (op_UBPC2F!, :Extrapolate, :Extrapolate)                      │ 58 microseconds, 159 nanoseconds  │ -9.72764-198.347-1             │
│ (op_divO3UBPC2F!, :1SidedO3, :1SidedO3, :SetValue, :SetValue) │ 201 microseconds, 219 nanoseconds │ -2.81163-57.3291-1             │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 135 microseconds, 579 nanoseconds │ 12.5185255.2533              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 54 microseconds, 910 nanoseconds  │ 30.9102630.263              │
│ (op_div_interp_CC!, :SetValue, :SetValue, :none)              │ 88 microseconds, 690 nanoseconds  │ -6.37896-130.067-1             │
│ (op_div_interp_FF!, :none, :SetValue, :SetValue)              │ 62 microseconds, 511 nanoseconds  │ -9.05055-184.541-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate)              │ 105 microseconds, 959 nanoseconds │ -5.33933-108.869-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :SetValue)                 │ 92 microseconds, 210 nanoseconds  │ -6.13552-125.103-1             │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

Most notably:

main
│ (op_DivergenceF2C!, :none)                                    │ 2 milliseconds, 101 microseconds  │ 0.80757116.46643              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 496 microseconds, 567 nanoseconds │ 3.4179869.69253              │
│ (op_CurlC2F!, :SetValue, :SetValue)                           │ 1 millisecond, 225 microseconds   │ -0.461728-9.41462-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate)              │ 1 millisecond, 844 microseconds   │ -0.306743-6.2545-1
3fd62e1
│ (op_DivergenceF2C!, :none)                                    │ 68 microseconds, 559 nanoseconds  │ 24.7561504.7763              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 88 microseconds, 871 nanoseconds  │ 19.0981389.4113              │
│ (op_CurlC2F!, :SetValue, :SetValue)                           │ 54 microseconds, 509 nanoseconds  │ -10.379-211.628-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate)              │ 105 microseconds, 959 nanoseconds │ -5.33933-108.869-1

This PR is an attempt to fix this regression by reverting the prescribed thread-block configuration for stencils to use a linear partition.

@charleskawczynski charleskawczynski force-pushed the ck/linear_partition branch 3 times, most recently from 3a8464b to eaa7c4a Compare September 23, 2024 20:48
@charleskawczynski
Copy link
Member Author

@Sbozzolo or @szy21, can one of you please try this out to see if this fixes the performance regression in the coupler job?

@charleskawczynski
Copy link
Member Author

For now, this fully reverts using the multi-dimensional thread-block to using a dynamic CartesianIndices, however, we could try to pass CartesianIndices through Val into the kernel, and see if this offers any speedup (in case the compiler can optimize division by constant divisors).

@Sbozzolo
Copy link
Member

@Sbozzolo or @szy21, can one of you please try this out to see if this fixes the performance regression in the coupler job?

Here: https://buildkite.com/clima/climacoupler-amip/builds/56

@charleskawczynski
Copy link
Member Author

@Sbozzolo or @szy21, can one of you please try this out to see if this fixes the performance regression in the coupler job?

Here: https://buildkite.com/clima/climacoupler-amip/builds/56

Looks like it's broken on an unrelated land component.

@Sbozzolo
Copy link
Member

@Sbozzolo or @szy21, can one of you please try this out to see if this fixes the performance regression in the coupler job?

Here: https://buildkite.com/clima/climacoupler-amip/builds/56

Looks like it's broken on an unrelated land component.

The update in Thermodynamics broke the coupler tests, so we cannot (trivially) merge the fix for this. I'll merge them manually

@Sbozzolo
Copy link
Member

I fixed it, but now it crashed because of the new thermodynamics

@Sbozzolo
Copy link
Member

Sbozzolo commented Sep 25, 2024

This is running: https://buildkite.com/clima/climacoupler-amip/builds/64#019229f8-e641-4bc4-b733-99e276fb74d9 (at the time of writing, this is the only job running and only GPU being used)

@szy21
Copy link
Member

szy21 commented Sep 25, 2024

This is running: https://buildkite.com/clima/climacoupler-amip/builds/64#019229f8-e641-4bc4-b733-99e276fb74d9 (at the time of writing, this is the only job running and only GPU being used)

There seems to be a ~2% increase in SYPD in this run compared with the current main. Not sure if it is significant?

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Sep 26, 2024

There seems to be a ~2% increase in SYPD in this run compared with the current main. Not sure if it is significant?

I can try reverting more kernel launches, we don't have any microbenchmarks for the TDMA and multiple field solves. That should only account for an additional 4%, though, since Atmos was partially responsible for the slowdown (I believe @szy21 mentioned 8% total, 6% from ClimaCore, and 2% from Atmos). If that fails to recover the last few percent, then we can just revert all of ClimaCore back to that commit and reapply changes since.

@charleskawczynski
Copy link
Member Author

This is running: https://buildkite.com/clima/climacoupler-amip/builds/64#019229f8-e641-4bc4-b733-99e276fb74d9

I'd like to iterate on this more quickly, how can I run this job interactively?

@Sbozzolo
Copy link
Member

There seems to be a ~2% increase in SYPD in this run compared with the current main. Not sure if it is significant?

I can try reverting more kernel launches, we don't have any microbenchmarks for the TDMA and multiple field solves. That should only account for an additional 4%, though, since Atmos was partially responsible for the slowdown (I believe @szy21 mentioned 8% total, 6% from ClimaCore, and 2% from Atmos). If that fails to recover the last few percent, and people are still unhappy with the regression at helem=16, then we can just revert all of ClimaCore back to that commit and reapply changes since.

Zhaoyi is already factoring out the contribution from ClimaAtmos here.

This job has SYPD of 0.892 at week 13. If you run an identical job but with ClimaCore 0.14.12, the SYPD ws 0.956 (7% faster). If you run an identical job but with ClimaCore 0.14.13, the SYPD was 0.869 (2.5% slower).

I'd like to iterate on this more quickly, how can I run this job interactively?

julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/run_amip.jl --config_file config/amip_configs/amip.yml --job_id amip

@charleskawczynski
Copy link
Member Author

local notes:

# julia --project=experiments/ClimaEarth/
empty!(ARGS)
ENV["CLIMACOMMS_DEVICE"] = "CUDA";
push!(ARGS, "--config_file", "config/amip_configs/amip.yml")
push!(ARGS, "--job_id", "amip")
using Revise; include("experiments/ClimaEarth/run_amip.jl")

@charleskawczynski
Copy link
Member Author

Superseded by #2055.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants