Specialize cases in run_field_matrix_solver, add debug info #1732
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR specialize cases in
run_field_matrix_solver!
, and adds debug info that I used to better understand its performance issues.For reference/documentation, I collected some stats in
run_field_matrix_solver!
:Inside a single call to
ldiv!
, here are the results:A few important notes here:
364
(sum(map(x->collect(values(x))[1][1], collect(values(ClimaCore.MatrixFields.ATypeStats))))
) calls torun_field_matrix_solver!
in a singleldiv!
call in prognostic edmf.TridiagonalMatrixRow
are the expensive ones.eltype(A[name,name])
for allnames
dispatch into the same methods (revealed byClimaCore.MatrixFields.ATypeStats
).length(names) == 1
andall(name -> A[name,name] isa UniformScaling, names)
, where callingsingle_field_solve!
directly will result in a simpler kernel whenlength(names) == 1
and a significantly more efficient (and fusible) kernel whenall(name -> A[name,name] isa UniformScaling, names)
.More notes: normalizing
ClimaCore.MatrixFields.ATypeStats
yields:For developing/debugging from ClimaCore, I've been using
as the benchmark.
This also generated useful output:
Together with the above collected information, it's clear that
band_matrix_solve!(::Type{<:TridiagonalMatrixRow}, cache, x, Aⱼs, b)
is truly the hotspot inldiv!
. I think the issue inband_matrix_solve!
is that a single thread traverses the vertical column, using global memory along the way. I think we can probably, relatively easily, try to use shared memory inband_matrix_solve!
.