Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize cases in run_field_matrix_solver, add debug info #1732

Merged
merged 1 commit into from
May 17, 2024

Conversation

charleskawczynski
Copy link
Member

This PR specialize cases in run_field_matrix_solver!, and adds debug info that I used to better understand its performance issues.

For reference/documentation, I collected some stats in run_field_matrix_solver!:

# empty!(ClimaCore.MatrixFields.ATypeStats)
# values(ClimaCore.MatrixFields.ATypeStats)
const ATypeStats = Dict()

NVTX.@annotate function run_field_matrix_solver!(::BlockDiagonalSolve, cache, x, A, b)
    outerkey = (typeof(cache), typeof(x), typeof(A), typeof(b))
    if !haskey(ATypeStats, outerkey)
        ATypeStats[outerkey] = Dict()
    end
    innerkey = map(matrix_row_keys(keys(A))) do name
        if A[name,name] isa UniformScaling
            :UniformScaling
        else
           eltype(A[name,name])
        end
    end
    e = Main.CUDA.@elapsed begin
        names = matrix_row_keys(keys(A))
        if length(names) == 1 || all(name -> A[name,name] isa UniformScaling, names)
            foreach(matrix_row_keys(keys(A))) do name
                single_field_solve!(cache[name], x[name], A[name, name], b[name])
            end
        else
            multiple_field_solve!(cache, x, A, b)
        end
    end
    if haskey(ATypeStats[outerkey], innerkey)
        ATypeStats[outerkey][innerkey] = (ATypeStats[outerkey][innerkey][1]+1, ATypeStats[outerkey][innerkey][2]+e)
    else
        ATypeStats[outerkey][innerkey] = (1, e)
    end
end

Inside a single call to ldiv!, here are the results:

values(ClimaCore.MatrixFields.ATypeStats)
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.11885731f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT}, TridiagonalMatrixRow{FT}) => (52, 0.25513232f0))
Dict{Any, Any}((:UniformScaling,) => (13, 0.001842112f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.118791126f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.119263805f0))
Dict{Any, Any}((:UniformScaling,) => (39, 0.0034500798f0))
Dict{Any, Any}((TridiagonalMatrixRow{AxisTensor{FT, 2, Tuple{CovariantAxis{(3,)}, ContravariantAxis{(3,)}}, SMatrix{1, 1, FT, 1}}},) => (26, 0.059818085f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT}, TridiagonalMatrixRow{AxisTensor{FT, 2, Tuple{CovariantAxis{(3,)}, ContravariantAxis{(3,)}}, SMatrix{1, 1, FT, 1}}}) => (52, 0.2534428f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (26, 0.10939651f0))

A few important notes here:

  • There are 364 (sum(map(x->collect(values(x))[1][1], collect(values(ClimaCore.MatrixFields.ATypeStats))))) calls to run_field_matrix_solver! in a single ldiv! call in prognostic edmf.
  • Perhaps obviously, TridiagonalMatrixRow are the expensive ones.
  • The multiple field solve was likely not suffering from (too much) branch divergence since the eltype(A[name,name]) for all names dispatch into the same methods (revealed by ClimaCore.MatrixFields.ATypeStats).
  • One thing is clear: we can specialize on two cases: length(names) == 1 and all(name -> A[name,name] isa UniformScaling, names), where calling single_field_solve! directly will result in a simpler kernel when length(names) == 1 and a significantly more efficient (and fusible) kernel when all(name -> A[name,name] isa UniformScaling, names).

More notes: normalizing ClimaCore.MatrixFields.ATypeStats yields:

time_per_call = [
0.11885731f0/52,   # 0.0022857175
0.25513232f0/52,   # 0.004906391
0.001842112f0/13,  # 0.00014170093
0.118791126f0/52,  # 0.0022844446
0.119263805f0/52,  # 0.0022935348
0.0034500798f0/39, # 8.846359f-5
0.059818085f0/26,  # 0.0023006955
0.2534428f0/52,    # 0.0048738997
0.10939651f0/26,   # 0.004207558
]
relative_cost = time_per_call ./ sum(time_per_call)

relative_cost = time_per_call ./ sum(time_per_call)
9-element Vector{Float32}:
 0.09775373
 0.20983258
 0.006060152
 0.09769929
 0.098088056
 0.0037833396
 0.0983943
 0.20844303
 0.17994547

For developing/debugging from ClimaCore, I've been using

# ]dev ../ClimaCore.jl/
# julia --project=perf
using Revise
empty!(ARGS);
push!(ARGS, "--config_file", "config/model_configs/aquaplanet_progedmf.yml");
import Random
Random.seed!(1234);
import ClimaAtmos as CA
include(joinpath(pkgdir(CA), "perf", "common.jl"))
using CUDA, BenchmarkTools, OrderedCollections, StatsBase, PrettyTables # needed for CTS.benchmark_step
using ClimaComms
import ClimaTimeSteppers as CTS
(; config_file, job_id) = CA.commandline_kwargs();
config = CA.AtmosConfig(config_file; job_id);
simulation = CA.get_simulation(config);

import ClimaCore; ClimaCore.DataLayouts.empty_kernel_stats()
(; table_summary, trials) = CTS.benchmark_step(
simulation.integrator,
ClimaComms.device(config.comms_ctx);
trace=false,
with_cu_prof=:profile,
only=["ldiv!"],
crop = :both
);

as the benchmark.

This also generated useful output:

--------------- Benchmarking/profiling ldiv!...
Profiler ran for 100.92 ms, capturing 472 events.

Host-side activity: calling CUDA APIs took 810.62 µs (0.80% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name           │
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────┤
│    0.79% │   798.7 µs │   157 │   5.09 µs ± 5.24   (   3.1 ‥ 67.23) │ cuLaunchKernel │
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────┘

Device-side activity: GPU was busy for 100.63 ms (99.71% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   26.98% │   27.23 ms │    12 │   2.27 ms ± 0.01   (  2.25 ‥ 2.28)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE29ExtrudedFiniteDifferenceSpaceI34DeviceExtrudedFiniteDifferenceGridI22DeviceIntervalTopologyI10NamedTupleI15__bottom___top_S2_I5Int6 ⋯
│   19.55% │   19.73 ms │     4 │   4.93 ms ± 0.05   (  4.86 ‥ 4.97)   │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleI5FieldI5VIJFHIS0_I7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceES1_IS2_IS0_I10AxisTensorIS3_Li1ES0_I13CovariantAxisI4_3__EE6SArrayIS0_ILi1EES3_Li1ELi1EEES6_IS3_Li2ES0_IS7_I ⋯
│   19.41% │   19.59 ms │     4 │    4.9 ms ± 0.03   (  4.87 ‥ 4.93)   │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleI5FieldI5VIJFHIS0_I7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceES1_IS2_IS0_IS3_S3_ELi4ES4_IS3_Li5ELi1EEES5_EES0_IS1_IS2_IS3_Li4E8SubArrayIS3_Li5ES4_IS3_Li5ELi1EES0_I5SliceI ⋯
│    7.45% │    7.52 ms │     2 │   3.76 ms ± 0.03   (  3.74 ‥ 3.78)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI10AxisTensorI7Float32Li1ES2_I13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES4_Li1ELi2EEES4_ELi4E13CuDeviceArrayIS4_Li5ELi1EEE29ExtrudedFiniteDifferenceSpaceI34DeviceExtrudedFiniteDif ⋯
│    5.86% │    5.91 ms │    21 │ 281.47 µs ± 0.59   (280.38 ‥ 282.53) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4ES2_IS1_Li5ELi1EEES3_E18StencilBroadcastedIS5_33Mult ⋯
│    4.27% │    4.31 ms │     2 │   2.15 ms ± 0.02   (  2.14 ‥ 2.17)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI10AxisTensorI7Float32Li1ES2_I13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES4_Li1ELi1EEES3_IS4_Li2ES2_IS5_I4_3__E17ContravariantAxisI4_3__EES6_IS2_ILi1ELi1EES4_Li2ELi1EEEELi4E13CuDevic ⋯
│    2.63% │    2.65 ms │    11 │ 240.98 µs ± 0.34   (240.56 ‥ 241.76) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    2.63% │    2.65 ms │    42 │  63.11 µs ± 2.87   ( 57.22 ‥ 66.76)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS_IS0_Li4ES1_IS0_Li5ELi1EEE ⋯
│    1.63% │    1.64 ms │     2 │ 821.59 µs ± 1.01   (820.88 ‥ 822.31) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    1.25% │    1.27 ms │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI4_3__EE6SArrayIS4_ILi1ELi1EES3_Li2ELi1EEEELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceE11Broadc ⋯
│    0.96% │  969.41 µs │     7 │ 138.49 µs ± 0.16   (138.28 ‥ 138.76) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4ES2_IS1_Li5ELi1EEES3_E18StencilBroadcastedIS5_33Mult ⋯
│    0.89% │  897.88 µs │     6 │ 149.65 µs ± 0.25   (149.25 ‥ 149.97) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleI18StencilBroadcastedIS5_33MultiplyColumnwiseBandMatrixFieldS7_IS_IS0_I13BandMatrixRo ⋯
│    0.82% │  828.74 µs │     3 │ 276.25 µs ± 0.36   (275.85 ‥ 276.57) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.68% │  686.41 µs │     2 │  343.2 µs ± 0.51   (342.85 ‥ 343.56) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS3_ILi2EES2_Li1ELi2EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S ⋯
│    0.57% │  577.45 µs │     2 │ 288.72 µs ± 0.0    (288.72 ‥ 288.72) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.47% │  475.65 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.45% │  459.19 µs │     3 │ 153.06 µs ± 0.0    (153.06 ‥ 153.06) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.43% │  429.63 µs │     2 │ 214.82 µs ± 2.02   (213.38 ‥ 216.25) │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELi ⋯
│    0.39% │  393.39 µs │     8 │  49.17 µs ± 3.51   ( 43.15 ‥ 51.5)   │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS2_IS3_ILi4E50CuArray_Float ⋯
│    0.38% │  386.71 µs │     6 │  64.45 µs ± 3.04   ( 60.08 ‥ 67.47)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS_IS0_Li4E8SubArrayIS0_Li5E ⋯
│    0.34% │  339.03 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin1ELi3E7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES6_IS7_ES6_IS7_ES6_IS7_ES6_IS7_EE8identityS5_IS ⋯
│    0.29% │  293.49 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI6_1__2_EE6SArrayIS3_ILi1ELi2EES2_Li2ELi2EEEELi4E13CuDeviceArrayIS2_Li5ELi1EEE11Broad ⋯
│    0.25% │  255.82 µs │     3 │  85.27 µs ± 2.98   ( 82.25 ‥ 88.21)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E8SubArrayIS0_Li5E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES4_IS5_IS6_EES4_IS5_IS6_EE9UnitRangeIS6_ES4_IS5_IS6_EEELinfalseEEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Devic ⋯
│    0.24% │  245.33 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.21% │  207.42 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELi ⋯
│    0.18% │  178.58 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int ⋯
│    0.14% │  142.81 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.13% │  132.32 µs │     2 │  66.16 µs ± 0.17   ( 66.04 ‥ 66.28)  │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELinf ⋯
│    0.09% │   87.74 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI7Float32Li4E8SubArrayIS0_Li5E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES4_IS5_IS6_EES4_IS5_IS6_EE9UnitRangeIS6_ES4_IS5_IS6_EEELinfalseEEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Devic ⋯
│    0.07% │   66.04 µs │     2 │  33.02 µs ± 0.51   ( 32.66 ‥ 33.38)  │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int64 ⋯
│    0.05% │   48.64 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int64 ⋯
│    0.04% │    36.0 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELinf ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                                                            1 column omitted

NVTX ranges:
┌──────────┬────────────┬───────┬──────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                             │
├──────────┼────────────┼───────┼──────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│    2.16% │    2.18 ms │    22 │  99.04 µs ± 85.77  ( 20.74 ‥ 388.86) │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::BlockLowerTriangularSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:315                                    │
│    1.35% │    1.36 ms │     1 │                                      │ ClimaAtmos.ldiv!(x::Fields.FieldVector, A::ImplicitEquationJacobian, b::Fields.FieldVector) /home/charliek/CliMA/ClimaAtmos.jl/src/prognostic_equations/implicit/implicit_solver.jl:368                          │
│    1.35% │    1.36 ms │     1 │                                      │ ClimaCore.MatrixFields.field_matrix_solve!(solver::FieldMatrixSolver, x::Fields.FieldVector, A::FieldMatrix, b::Fields.FieldVector) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:74 │
│    1.34% │    1.35 ms │     1 │                                      │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::SchurComplementReductionSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:452                                │
│    0.96% │  973.22 µs │    32 │  30.41 µs ± 20.3   (  6.44 ‥ 91.31)  │ ClimaCore.MatrixFields.Base.Broadcast.materialize!(dest::FieldNameDict, vector_or_matrix::FieldNameDict) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_name_dict.jl:541                               │
│    0.96% │  964.88 µs │    32 │  30.15 µs ± 20.18  (   6.2 ‥ 90.6)   │ ClimaCore.MatrixFields.copyto_foreach!(dest::FieldNameDict, vector_or_matrix::FieldNameDict) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_name_dict.jl:525                                           │
│    0.69% │  699.28 µs │     1 │                                      │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::StationaryIterativeSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:431                           │
│    0.47% │  475.41 µs │     2 │  237.7 µs ± 26.97  (218.63 ‥ 256.78) │ ClimaCore.MatrixFields.lazy_mul(A₂₂′::LazySchurComplement, x₂) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:171                                                                     │
│    0.35% │  349.52 µs │    28 │  12.48 µs ± 18.95  (   6.2 ‥ 107.53) │ ClimaCore.MatrixFields.run_field_matrix_solver!(::BlockDiagonalSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:250                                              │
│    0.11% │  114.92 µs │     2 │  57.46 µs ± 24.61  ( 40.05 ‥ 74.86)  │ ClimaCore.MatrixFields.apply_preconditioner(P_alg, P_cache, P, lazy_b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:101                                                   │
│    0.08% │   76.77 µs │     8 │    9.6 µs ± 3.99   (  6.68 ‥ 17.88)  │ ClimaCoreCUDAExt.multiple_field_solve!(::ClimaComms.CUDADevice, cache, x, A, b, x1) /home/charliek/CliMA/ClimaCore.jl/ext/cuda/matrix_fields_multiple_field_solve.jl:17                                          │
│    0.05% │   45.54 µs │     1 │                                      │ ClimaCore.MatrixFields.lazy_or_concrete_preconditioner(P_alg, P_cache, A) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:86                                                 │
└──────────┴────────────┴───────┴──────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Together with the above collected information, it's clear that band_matrix_solve!(::Type{<:TridiagonalMatrixRow}, cache, x, Aⱼs, b) is truly the hotspot in ldiv!. I think the issue in band_matrix_solve! is that a single thread traverses the vertical column, using global memory along the way. I think we can probably, relatively easily, try to use shared memory in band_matrix_solve!.

@charleskawczynski charleskawczynski force-pushed the ck/specialize_multiple_field_solve branch from 7944cc9 to 5f2e58e Compare May 17, 2024 19:10
@charleskawczynski charleskawczynski merged commit db8780f into main May 17, 2024
13 of 15 checks passed
@charleskawczynski charleskawczynski deleted the ck/specialize_multiple_field_solve branch May 17, 2024 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant