Specialize cases in run_field_matrix_solver, add debug info #1732

charleskawczynski · 2024-05-17T16:28:22Z

This PR specialize cases in run_field_matrix_solver!, and adds debug info that I used to better understand its performance issues.

For reference/documentation, I collected some stats in run_field_matrix_solver!:

# empty!(ClimaCore.MatrixFields.ATypeStats)
# values(ClimaCore.MatrixFields.ATypeStats)
const ATypeStats = Dict()

NVTX.@annotate function run_field_matrix_solver!(::BlockDiagonalSolve, cache, x, A, b)
    outerkey = (typeof(cache), typeof(x), typeof(A), typeof(b))
    if !haskey(ATypeStats, outerkey)
        ATypeStats[outerkey] = Dict()
    end
    innerkey = map(matrix_row_keys(keys(A))) do name
        if A[name,name] isa UniformScaling
            :UniformScaling
        else
           eltype(A[name,name])
        end
    end
    e = Main.CUDA.@elapsed begin
        names = matrix_row_keys(keys(A))
        if length(names) == 1 || all(name -> A[name,name] isa UniformScaling, names)
            foreach(matrix_row_keys(keys(A))) do name
                single_field_solve!(cache[name], x[name], A[name, name], b[name])
            end
        else
            multiple_field_solve!(cache, x, A, b)
        end
    end
    if haskey(ATypeStats[outerkey], innerkey)
        ATypeStats[outerkey][innerkey] = (ATypeStats[outerkey][innerkey][1]+1, ATypeStats[outerkey][innerkey][2]+e)
    else
        ATypeStats[outerkey][innerkey] = (1, e)
    end
end

Inside a single call to ldiv!, here are the results:

values(ClimaCore.MatrixFields.ATypeStats)
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.11885731f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT}, TridiagonalMatrixRow{FT}) => (52, 0.25513232f0))
Dict{Any, Any}((:UniformScaling,) => (13, 0.001842112f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.118791126f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (52, 0.119263805f0))
Dict{Any, Any}((:UniformScaling,) => (39, 0.0034500798f0))
Dict{Any, Any}((TridiagonalMatrixRow{AxisTensor{FT, 2, Tuple{CovariantAxis{(3,)}, ContravariantAxis{(3,)}}, SMatrix{1, 1, FT, 1}}},) => (26, 0.059818085f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT}, TridiagonalMatrixRow{AxisTensor{FT, 2, Tuple{CovariantAxis{(3,)}, ContravariantAxis{(3,)}}, SMatrix{1, 1, FT, 1}}}) => (52, 0.2534428f0))
Dict{Any, Any}((TridiagonalMatrixRow{FT},) => (26, 0.10939651f0))

A few important notes here:

There are 364 (sum(map(x->collect(values(x))[1][1], collect(values(ClimaCore.MatrixFields.ATypeStats))))) calls to run_field_matrix_solver! in a single ldiv! call in prognostic edmf.
Perhaps obviously, TridiagonalMatrixRow are the expensive ones.
The multiple field solve was likely not suffering from (too much) branch divergence since the eltype(A[name,name]) for all names dispatch into the same methods (revealed by ClimaCore.MatrixFields.ATypeStats).
One thing is clear: we can specialize on two cases: length(names) == 1 and all(name -> A[name,name] isa UniformScaling, names), where calling single_field_solve! directly will result in a simpler kernel when length(names) == 1 and a significantly more efficient (and fusible) kernel when all(name -> A[name,name] isa UniformScaling, names).

More notes: normalizing ClimaCore.MatrixFields.ATypeStats yields:

time_per_call = [
0.11885731f0/52,   # 0.0022857175
0.25513232f0/52,   # 0.004906391
0.001842112f0/13,  # 0.00014170093
0.118791126f0/52,  # 0.0022844446
0.119263805f0/52,  # 0.0022935348
0.0034500798f0/39, # 8.846359f-5
0.059818085f0/26,  # 0.0023006955
0.2534428f0/52,    # 0.0048738997
0.10939651f0/26,   # 0.004207558
]
relative_cost = time_per_call ./ sum(time_per_call)

relative_cost = time_per_call ./ sum(time_per_call)
9-element Vector{Float32}:
 0.09775373
 0.20983258
 0.006060152
 0.09769929
 0.098088056
 0.0037833396
 0.0983943
 0.20844303
 0.17994547

For developing/debugging from ClimaCore, I've been using

# ]dev ../ClimaCore.jl/
# julia --project=perf
using Revise
empty!(ARGS);
push!(ARGS, "--config_file", "config/model_configs/aquaplanet_progedmf.yml");
import Random
Random.seed!(1234);
import ClimaAtmos as CA
include(joinpath(pkgdir(CA), "perf", "common.jl"))
using CUDA, BenchmarkTools, OrderedCollections, StatsBase, PrettyTables # needed for CTS.benchmark_step
using ClimaComms
import ClimaTimeSteppers as CTS
(; config_file, job_id) = CA.commandline_kwargs();
config = CA.AtmosConfig(config_file; job_id);
simulation = CA.get_simulation(config);

import ClimaCore; ClimaCore.DataLayouts.empty_kernel_stats()
(; table_summary, trials) = CTS.benchmark_step(
simulation.integrator,
ClimaComms.device(config.comms_ctx);
trace=false,
with_cu_prof=:profile,
only=["ldiv!"],
crop = :both
);

as the benchmark.

This also generated useful output:

--------------- Benchmarking/profiling ldiv!...
Profiler ran for 100.92 ms, capturing 472 events.

Host-side activity: calling CUDA APIs took 810.62 µs (0.80% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name           │
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────┤
│    0.79% │   798.7 µs │   157 │   5.09 µs ± 5.24   (   3.1 ‥ 67.23) │ cuLaunchKernel │
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────┘

Device-side activity: GPU was busy for 100.63 ms (99.71% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   26.98% │   27.23 ms │    12 │   2.27 ms ± 0.01   (  2.25 ‥ 2.28)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE29ExtrudedFiniteDifferenceSpaceI34DeviceExtrudedFiniteDifferenceGridI22DeviceIntervalTopologyI10NamedTupleI15__bottom___top_S2_I5Int6 ⋯
│   19.55% │   19.73 ms │     4 │   4.93 ms ± 0.05   (  4.86 ‥ 4.97)   │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleI5FieldI5VIJFHIS0_I7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceES1_IS2_IS0_I10AxisTensorIS3_Li1ES0_I13CovariantAxisI4_3__EE6SArrayIS0_ILi1EES3_Li1ELi1EEES6_IS3_Li2ES0_IS7_I ⋯
│   19.41% │   19.59 ms │     4 │    4.9 ms ± 0.03   (  4.87 ‥ 4.93)   │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleI5FieldI5VIJFHIS0_I7Float32S3_ELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceES1_IS2_IS0_IS3_S3_ELi4ES4_IS3_Li5ELi1EEES5_EES0_IS1_IS2_IS3_Li4E8SubArrayIS3_Li5ES4_IS3_Li5ELi1EES0_I5SliceI ⋯
│    7.45% │    7.52 ms │     2 │   3.76 ms ± 0.03   (  3.74 ‥ 3.78)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI10AxisTensorI7Float32Li1ES2_I13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES4_Li1ELi2EEES4_ELi4E13CuDeviceArrayIS4_Li5ELi1EEE29ExtrudedFiniteDifferenceSpaceI34DeviceExtrudedFiniteDif ⋯
│    5.86% │    5.91 ms │    21 │ 281.47 µs ± 0.59   (280.38 ‥ 282.53) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4ES2_IS1_Li5ELi1EEES3_E18StencilBroadcastedIS5_33Mult ⋯
│    4.27% │    4.31 ms │     2 │   2.15 ms ± 0.02   (  2.14 ‥ 2.17)   │ _Z26single_field_solve_kernel_10CUDADevice5FieldI5VIJFHI5TupleI10AxisTensorI7Float32Li1ES2_I13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES4_Li1ELi1EEES3_IS4_Li2ES2_IS5_I4_3__E17ContravariantAxisI4_3__EES6_IS2_ILi1ELi1EES4_Li2ELi1EEEELi4E13CuDevic ⋯
│    2.63% │    2.65 ms │    11 │ 240.98 µs ± 0.34   (240.56 ‥ 241.76) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    2.63% │    2.65 ms │    42 │  63.11 µs ± 2.87   ( 57.22 ‥ 66.76)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS_IS0_Li4ES1_IS0_Li5ELi1EEE ⋯
│    1.63% │    1.64 ms │     2 │ 821.59 µs ± 1.01   (820.88 ‥ 822.31) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    1.25% │    1.27 ms │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI4_3__EE6SArrayIS4_ILi1ELi1EES3_Li2ELi1EEEELi4E13CuDeviceArrayIS3_Li5ELi1EEE16PlaceholderSpaceE11Broadc ⋯
│    0.96% │  969.41 µs │     7 │ 138.49 µs ± 0.16   (138.28 ‥ 138.76) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4ES2_IS1_Li5ELi1EEES3_E18StencilBroadcastedIS5_33Mult ⋯
│    0.89% │  897.88 µs │     6 │ 149.65 µs ± 0.25   (149.25 ‥ 149.97) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleI18StencilBroadcastedIS5_33MultiplyColumnwiseBandMatrixFieldS7_IS_IS0_I13BandMatrixRo ⋯
│    0.82% │  828.74 µs │     3 │ 276.25 µs ± 0.36   (275.85 ‥ 276.57) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.68% │  686.41 µs │     2 │  343.2 µs ± 0.51   (342.85 ‥ 343.56) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS3_ILi2EES2_Li1ELi2EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S ⋯
│    0.57% │  577.45 µs │     2 │ 288.72 µs ± 0.0    (288.72 ‥ 288.72) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.47% │  475.65 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.45% │  459.19 µs │     3 │ 153.06 µs ± 0.0    (153.06 ‥ 153.06) │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.43% │  429.63 µs │     2 │ 214.82 µs ± 2.02   (213.38 ‥ 216.25) │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELi ⋯
│    0.39% │  393.39 µs │     8 │  49.17 µs ± 3.51   ( 43.15 ‥ 51.5)   │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS2_IS3_ILi4E50CuArray_Float ⋯
│    0.38% │  386.71 µs │     6 │  64.45 µs ± 3.04   ( 60.08 ‥ 67.47)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_ES5_IS6_ES5_IS6_EE8identityS4_IS_IS0_Li4E8SubArrayIS0_Li5E ⋯
│    0.34% │  339.03 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin1ELi3E7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NE5TupleI5OneToI5Int64ES6_IS7_ES6_IS7_ES6_IS7_ES6_IS7_EE8identityS5_IS ⋯
│    0.29% │  293.49 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI6_1__2_EE6SArrayIS3_ILi1ELi2EES2_Li2ELi2EEEELi4E13CuDeviceArrayIS2_Li5ELi1EEE11Broad ⋯
│    0.25% │  255.82 µs │     3 │  85.27 µs ± 2.98   ( 82.25 ‥ 88.21)  │ _Z11knl_copyto_5VIJFHI7Float32Li4E8SubArrayIS0_Li5E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES4_IS5_IS6_EES4_IS5_IS6_EE9UnitRangeIS6_ES4_IS5_IS6_EEELinfalseEEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Devic ⋯
│    0.24% │  245.33 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDeviceArrayIS2_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS7_8identityS3_IS8_IS9_S7_ ⋯
│    0.21% │  207.42 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELi ⋯
│    0.18% │  178.58 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int ⋯
│    0.14% │  142.81 µs │     1 │                                      │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_8identity5TupleIS4_IS5_S3_4rsubS7_IS_IS0_IS1_Li4E8SubArrayIS1_Li5ES2_IS1_Li5ELi1EES7_I5SliceI5OneToI ⋯
│    0.13% │  132.32 µs │     2 │  66.16 µs ± 0.17   ( 66.04 ‥ 66.28)  │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELinf ⋯
│    0.09% │   87.74 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI7Float32Li4E8SubArrayIS0_Li5E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES4_IS5_IS6_EES4_IS5_IS6_EE9UnitRangeIS6_ES4_IS5_IS6_EEELinfalseEEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Devic ⋯
│    0.07% │   66.04 µs │     2 │  33.02 µs ± 0.51   ( 32.66 ‥ 33.38)  │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int64 ⋯
│    0.05% │   48.64 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__where_NES2_I5OneToI5Int64 ⋯
│    0.04% │    36.0 µs │     1 │                                      │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EES2_I5SliceI5OneToI5Int64EES7_IS8_IS9_EES7_IS8_IS9_EE9UnitRangeIS9_ES7_IS8_IS9_EEELinf ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                                                            1 column omitted

NVTX ranges:
┌──────────┬────────────┬───────┬──────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                             │
├──────────┼────────────┼───────┼──────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│    2.16% │    2.18 ms │    22 │  99.04 µs ± 85.77  ( 20.74 ‥ 388.86) │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::BlockLowerTriangularSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:315                                    │
│    1.35% │    1.36 ms │     1 │                                      │ ClimaAtmos.ldiv!(x::Fields.FieldVector, A::ImplicitEquationJacobian, b::Fields.FieldVector) /home/charliek/CliMA/ClimaAtmos.jl/src/prognostic_equations/implicit/implicit_solver.jl:368                          │
│    1.35% │    1.36 ms │     1 │                                      │ ClimaCore.MatrixFields.field_matrix_solve!(solver::FieldMatrixSolver, x::Fields.FieldVector, A::FieldMatrix, b::Fields.FieldVector) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:74 │
│    1.34% │    1.35 ms │     1 │                                      │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::SchurComplementReductionSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:452                                │
│    0.96% │  973.22 µs │    32 │  30.41 µs ± 20.3   (  6.44 ‥ 91.31)  │ ClimaCore.MatrixFields.Base.Broadcast.materialize!(dest::FieldNameDict, vector_or_matrix::FieldNameDict) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_name_dict.jl:541                               │
│    0.96% │  964.88 µs │    32 │  30.15 µs ± 20.18  (   6.2 ‥ 90.6)   │ ClimaCore.MatrixFields.copyto_foreach!(dest::FieldNameDict, vector_or_matrix::FieldNameDict) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_name_dict.jl:525                                           │
│    0.69% │  699.28 µs │     1 │                                      │ ClimaCore.MatrixFields.run_field_matrix_solver!(alg::StationaryIterativeSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:431                           │
│    0.47% │  475.41 µs │     2 │  237.7 µs ± 26.97  (218.63 ‥ 256.78) │ ClimaCore.MatrixFields.lazy_mul(A₂₂′::LazySchurComplement, x₂) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:171                                                                     │
│    0.35% │  349.52 µs │    28 │  12.48 µs ± 18.95  (   6.2 ‥ 107.53) │ ClimaCore.MatrixFields.run_field_matrix_solver!(::BlockDiagonalSolve, cache, x, A, b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_solver.jl:250                                              │
│    0.11% │  114.92 µs │     2 │  57.46 µs ± 24.61  ( 40.05 ‥ 74.86)  │ ClimaCore.MatrixFields.apply_preconditioner(P_alg, P_cache, P, lazy_b) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:101                                                   │
│    0.08% │   76.77 µs │     8 │    9.6 µs ± 3.99   (  6.68 ‥ 17.88)  │ ClimaCoreCUDAExt.multiple_field_solve!(::ClimaComms.CUDADevice, cache, x, A, b, x1) /home/charliek/CliMA/ClimaCore.jl/ext/cuda/matrix_fields_multiple_field_solve.jl:17                                          │
│    0.05% │   45.54 µs │     1 │                                      │ ClimaCore.MatrixFields.lazy_or_concrete_preconditioner(P_alg, P_cache, A) /home/charliek/CliMA/ClimaCore.jl/src/MatrixFields/field_matrix_iterative_solver.jl:86                                                 │
└──────────┴────────────┴───────┴──────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Together with the above collected information, it's clear that band_matrix_solve!(::Type{<:TridiagonalMatrixRow}, cache, x, Aⱼs, b) is truly the hotspot in ldiv!. I think the issue in band_matrix_solve! is that a single thread traverses the vertical column, using global memory along the way. I think we can probably, relatively easily, try to use shared memory in band_matrix_solve!.

Try unrolled_all Fix

charleskawczynski requested a review from dennisYatunin May 17, 2024 16:28

charleskawczynski mentioned this pull request May 17, 2024

Revert multiple field solve #1728

Closed

Specialize cases in run_field_matrix_solver, add debug info

5f2e58e

Try unrolled_all Fix

charleskawczynski force-pushed the ck/specialize_multiple_field_solve branch from 7944cc9 to 5f2e58e Compare May 17, 2024 19:10

charleskawczynski merged commit db8780f into main May 17, 2024
13 of 15 checks passed

charleskawczynski deleted the ck/specialize_multiple_field_solve branch May 17, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize cases in run_field_matrix_solver, add debug info #1732

Specialize cases in run_field_matrix_solver, add debug info #1732

charleskawczynski commented May 17, 2024

Specialize cases in run_field_matrix_solver, add debug info #1732

Specialize cases in run_field_matrix_solver, add debug info #1732

Conversation

charleskawczynski commented May 17, 2024