Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up gemm_wrapper and stabilize MulAddMul strategically #47206

Closed
wants to merge 13 commits into from

Conversation

amilsted
Copy link
Contributor

This is a hybrid of #47088 and #47026. See also #46865. Like #47088, it introduces a macro @stable_muladdmul to construct MulAddMul(alpha, beta) without type instability in case alpha and beta are not constant. This helps us avoid runtime dispatch in some performance hot-paths, including native multiplication of small matrices, as well as calls out to gemm!().

To avoid blowing up compile times a lot, this splits gemm_wrapper!() and friends into an inlined and non-inlined part. The inlined part contains the stabilized calls to native small-matrix multiplication. The non-inlined part calls out to gemm!() and, failing that, generic_matmatmul!(), which is called without the @stable_muladdmul wrapper, introducing an inference barrier in case alpha and beta are not constant, preventing 4 different versions of (heavy) generic_matmatmul!() being compiled in those cases.

@dkarrasch @N5N3 @Micket

@giordano giordano added performance Must go faster linear algebra Linear algebra labels Oct 17, 2022
@Micket
Copy link
Contributor

Micket commented Oct 17, 2022

I was pinged so here i go. Some background thoughts

My understanding is that the original goal with using the MulAddMul type was to intentionally dispatch out 4 different variants, presumably with the goal of the variants with 1 and 0' coefficients get optimized slightly. But I seriously doubt that optimization was ever noticeable because all it would do is simplify part of some logical expressions. So even if the constant-propagation optimization worked as intended, this was probably not worth it, at least not for gemm. Disclaimer: I didn't benchmark this.
Worse yet, in practice optimizations frequently fails here and one ends up with a way more costly type instability. To me this looked like a high-risk low-reward strategy which didn't pay out.
It also has the cost of requiring more compilation, though i think this is a strictly secondary concern.

I think @amilsted shares these general conclusions (please correct me if i'm wrong)
it would be nice to have some benchmarks across different matrix sizes and zero/one/arbitrary coefficients to have these claims proved in numbers


I think i understand the core goal of this PR and the macro it introduces, but i think some choices are motivated only by how ubiquitous the use of MulAddMul was.
So I have two general comments;

  1. I think the added complexity tax here is quite significant. Code gets pretty cryptic (doubly so for macros) for anyone not versed in the entire history of this issue here.
  2. It needs to touch almost all the function calls and definitions anyway, so i think just removing MulAddMul and passing the alpha and beta arguments directly everywhere would actually be a smaller diff.

@amilsted
Copy link
Contributor Author

I think it's not quite right that dispatching on MulAddMul only simplifies some logical expressions. If you look at the native matmul code, including the 2x2 and 3x3 cases, which are supposed to be fast (faster than calling out to BLAS), we are able to write generic code that avoids unnecessary arithmetic ops.

Also, avoiding any ops that take inputs from Cis important for correctness in some cases when beta is zero, as C may have undefined entries.

I agree that the code is getting rather convoluted. Just duplicating these native matmul functions for a few cases is likely more readable.

@Micket
Copy link
Contributor

Micket commented Oct 18, 2022

Correction: Upon a second read through, i have clearly misunderstood things in this PR a lot.

To avoid blowing up compile times a lot, this splits gemm_wrapper!() and friends into an inlined and non-inlined part

(no)inlining are just hints, which is most often ignored. Especially the noinline

The inlined part contains the stabilized calls to native small-matrix multiplication.

Though the code here generates a branch for all 4 variants.

The non-inlined part calls out to gemm!() and, failing that, generic_matmatmul!(), which is called without the @stable_muladdmul wrapper, introducing an inference barrier in case alpha and beta are not constant, preventing 4 different versions of (heavy) generic_matmatmul!() being compiled in those cases.

but then you'll have the type instability there instead?

@Micket
Copy link
Contributor

Micket commented Oct 18, 2022

Sorry this turned into a bit of rambling. Feel free to ignore this.

I think it's not quite right that dispatching on MulAddMul only simplifies some logical expressions. If you look at the native matmul code, including the 2x2 and 3x3 cases, which are supposed to be fast (faster than calling out to BLAS), we are able to write generic code that avoids unnecessary arithmetic ops.

Yeah i'm a bit sleepy apparently. I was focused on the use of ais1 and bis0 directly and I overlooked the dispatch on

@inline (::MulAddMul{true})(x) = x
@inline (p::MulAddMul{false})(x) = x * p.alpha
@inline (::MulAddMul{true, true})(x, _) = x
@inline (p::MulAddMul{false, true})(x, _) = x * p.alpha
@inline (p::MulAddMul{true, false})(x, y) = x + y * p.beta
@inline (p::MulAddMul{false, false})(x, y) = x * p.alpha + y * p.beta

I tried to think twice as hard this time, and the the current dispatch seems domed from the start (even if constant-propagation worked as intended).
I mean the general case beta = rand() could be zero.. and then needs to be able to dispatch out to a specialized method according to the code.

So, i've now come to think that the problem stems from not being able to select what you really want.
I might know that for my algorithm I always wanted the general case for e.g. mul(C, tA, tB, A, alpha, beta) for my 3x3 matrices, but there was a theoretical non-zero chance that alpha happened to be 1.0 during one iteration. You are simply not given the option to specify this information, you only get to use the runtime values of alpha and beta, and this function also needs to be the default implementation for specialized C = matmul3x3(tA, tB, A, B).

So, in my philosophical view of this problem, i think that's the root of all trouble; 2 different properties coupled to the same parameter.
This could have been separate argument and dispatch on that.
This is basically what MulAddMul does, just bundled into one type. In my example above, I could achieve that control via;

add = MulAddMul{false false,Float64,Float64}(alpha, beta)  # might be one or zero, but i don't want the dispatch to that. which
whichever_matmul_i_wanted!(C, A, B, ...., add)
# if could technically be separate arguments
whichever_matmul_i_wanted!(C, A, B, ...., alpha, beta, IsOne{false}, IsZero{false}) 

(though obviously, a bit more polished..)

I suppose it's a bit of a foot-gun to specify this apart from the value of the coeffients, but this does let me have this control, and there is no unnecessary compilations happening, no extra runtime branch checks, no switching behaviour based on what values alpha and beta happens to take.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 18, 2022

The non-inlined part calls out to gemm!() and, failing that, generic_matmatmul!(), which is called without the @stable_muladdmul wrapper, introducing an inference barrier in case alpha and beta are not constant, preventing 4 different versions of (heavy) generic_matmatmul!() being compiled in those cases.

but then you'll have the type instability there instead?

Yes, that seems to be the only way to prevent generic_matmatmul!() from being compiled 4 times in this case (see the discussion here). It often won't matter as that function is quite slow and appears to allocate anyway, even for dense, BLAS-type matrices.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 18, 2022

I suppose it's a bit of a foot-gun to specify this apart from the value of the coeffients, but this does let me have this control, and there is no unnecessary compilations happening, no extra runtime branch checks, no switching behaviour based on what values alpha and beta happens to take.

Yeah, it's tricky. I thought about detecting, at compile time, that alpha and beta were not Const{}. If this were possible, one could just always build MulAddMul{false, false}(alpha, beta) in that case. However, even that is probably too fragile, as const-prop can indeed fail in unexpected circumstances. If we do false, false with an undef C, even thought the user said beta=0, we could cause errors (in the non-BLAS paths).

@amilsted
Copy link
Contributor Author

Regarding convoluted code: The @stable_muladdmul macro is just a convenient way of avoiding rewriting the same if ... else statements many times. One could replace it with a bunch of @inline-hinted wrappers around the functions that still consume MulAddMul(), but I'm not sure that would be cleaner. One could also put in the branches explicitly everywhere, but if I add type-stablilization to all of the many uses of MulAddMul() that can involve non-const alpha and beta (except those involving generic_matmatmul!(), deemed too heavy), that will add many lines of code to LinearAlgebra, each vulnerable to typos.

@dkarrasch
Copy link
Member

@nanosoldier runbenchmarks("linalg", vs = "master")

@dkarrasch
Copy link
Member

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@dkarrasch
Copy link
Member

If I read the benchmark correctly, then 5-arg mul! for small (2x2 and 3x3) matrices is faster and avoids any allocations, whereas * of two Diagonals is slower. Can this be reproduced locally?

@PallHaraldsson
Copy link
Contributor

Is the Diagonal case supposed to NOT allocate now? It seems to do (as many) allocations, the main worry, but if allocations are expected, if 45% slowdown real, or can it be a fluke with small data?

@N5N3
Copy link
Member

N5N3 commented Oct 18, 2022

If I read the benchmark correctly, then 5-arg mul! for small (2x2 and 3x3) matrices is faster and avoids any allocations, whereas * of two Diagonals is slower. Can this be reproduced locally?

Looks like Diagonals * Diagonals doesn't use MulAddMul, so this PR should not cause regression there. I believe it's just noise.

@amilsted
Copy link
Contributor Author

It occurs to me one could do a "light" runtime dispatch by hand for generic_matmatmul!(), if we don't like always doing full runtime dispatch in that case. Keep 4 versions in a global tuple (or vector?) and look up the correct one at runtime, depending on the alpha-is-one, beta-is-zero case. Ugly, but presumably faster.

@Micket
Copy link
Contributor

Micket commented Oct 19, 2022

I still really think the problem is tying the algorithm to the values here; i can't even imagine any code where you dont know which version you really really want (regardless of values or const'ness)
After all there is no doubt of exactly what algorithm/arithmetic must occur when mul!(C, A, B) vs mul!(C, A, B, 2+rand(), 1+rand()) are called.

Basically expanding upon what e.g. the 3 argument mul! already does (just as a rough concept);

mul!(C, A, B, alpha::Union{Number, Nothing}=nothing, beta::Union{Number, Nothing}=nothing) = gemm_wrapper!(C, A, B, MulAddMulX(alpha, beta))

# where this type stable MulAddMulX simply uses the Nothing type to determine things rather than iszero
@inline (::MulAddMulX{Nothing, Nothing})(x, _) = x
@inline (p::MulAddMulX{T, Nothing)(x, _) where T = x * p.alpha
@inline (p::MulAddMulX{Nothing, T})(x, y) where T = x + y * p.beta
@inline (p::MulAddMulX{TA, TB})(x, y) where {TA, TB} = x * p.alpha + y * p.beta

This would be a breaking change for the users though. A change for the better I.M.H.O.
If you don't want beta, then let it to nothing. If you specify 0 then you'll get 0*C.

There should be no type instability, no redundant compilation, no extra runtime checks and branches. simpler than the original code, probably shorter with some handy default arguments, also allows for specifying only 4 arguments.

@Micket
Copy link
Contributor

Micket commented Oct 31, 2022

I didn't follow all the details, but I just want to point out that in a long discussion it was decided that in case any of alpha or beta equal zero that the corresponding matrix is not touched.

Zero check is just for beta, there are additional checks for if alpha=1 though, but this just (sometimes) impacts the performance (very very little in most cases).

For instance, if beta is zero, then 0*C is not even to be computed. The latter may be actually impossible because C's elements may not be defined when it is generated from A*B where the return type is BigFloat.

Yes i don't think this was ever in doubt.

So the call chain is A*B -> mul!(similar(B, ...), A, B) -> mul!(C, A, B, true, false), and the last false means that C is not to be read from, only written to.

The fact that this is based on the value of beta (and alpha) is what lead to this type instability.
I proposed using the type to determine the choice of arithmetic using a type instead of value, i.e beta=nothing vs beta=0 (which computes a type unstable MulAddMul which in turns dispatches).

Similar things may happen for matrices of matrices and other custom types in the ecosystem. If we were to break that, we would need to provide a whole bunch of 3-arg mul! methods with lots of code duplication.

Well, 3-arg mul! is already provided today, but regardless, no approach here is suggesting a solution that would require any code duplication.

Having said that, I think pretty much anything is acceptable as a solution that respects these design decisions:

I understand changing the behavior is difficult, but i believe the design decision was wrong.

But, adding a wrapper (at least for beta=0 check) to preserve backwards compatibility is doable without much significant code addition (probably still less than what is there today), though this prevents the user from being able to explicitly say what arithmetic they actually want.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 31, 2022

But, adding a wrapper (at least for beta=0 check) to preserve backwards compatibility is doable without much significant code addition (probably still less than what is there today), though this prevents the user from being able to explicitly say what arithmetic they actually want.

@Micket I don't think this is true (see my message above).

@Micket
Copy link
Contributor

Micket commented Nov 1, 2022

@Micket I don't think this is true (see my message above).

Perhaps i was misleading in my attempt to write a simple examples, since i used mul! directly; regardless of whether there is ambiguity there or not, it would not be an appropriate place to check for this due to the sheer number of mul! methods, so i meant in appropriate (new) wrappers.

Using the first code snippet i saw in this PR since it was the cleanest case, bidiag has a bunch of mul!'s defined

@inline mul!(C::AbstractMatrix, A::SymTridiagonal, B::BiTriSym, alpha::Number, beta::Number = A_mul_B_td!(C, A, B, MulAddMul(alpha, beta))
...

with a single wrapper for MulAddMul

@inline A_mul_B_td_wrapper!(C, A, B, alpha, beta::Nothing) = A_mul_B_td!(C, A, B, MulAddMul(alpha, beta))
@inline A_mul_B_td_wrapper!(C, A, B, alpha, beta::Number) = begin #could also check alpha is one here if we want to preserve entire old behavior)
    if iszero(beta)
        A_mul_B_td!(C, A, B, MulAddMul(alpha, nothing))
    else
        A_mul_B_td!(C, A, B, MulAddMul(alpha, beta))
    end
end

@inline mul!(C::AbstractMatrix, A::SymTridiagonal, B::BiTriSym, alpha::Union{Number,Nothing}, beta::Union{Number,Nothing}) = A_mul_B_td_wrapper!(C, A, B, alpha, beta)
... etc.

so i guess there would be need for a couple of such wrappers (e.g. one for generic_matvecmul!, and possibly combined with the existing {gemm,herk,syrk}_wrapper's (which certainly would need some minor code tweaks))
No ambiguity here.

I don't think this would just be a matter of internal implementation details either. This would make tangible differences;

  1. Allowing at least internal use of the of the methods to specify what they truly mean with alpha or beta set to nothing where applicable (i admit i haven't read through all the code to check if this would actually ever be applicable though).
  2. Allow code to actually start using nothing (which would requisite if deprecation of the old way is ever considered).
  3. Any code path that uses nothing's to circumvent the additional conditionals and compilations

In a way, these are the benefits that would stem from not using the macro into an inline if-else block, but instead expose that code as a function, allowing for dispatch mechanics to work its magic (which brings benefits, even if they are hampered by some backwards compatibility).

And it would bring the code more in line with the possible future approach, if this is what people want. Would only require dropping some wrappers

@dkarrasch
Copy link
Member

I think I might understand the proposal a little better now. The idea is to turn the default case ("alpha = 1, beta = 0") into a alpha, beta = nothing, nothing case, and to make mul!(C, A, B) = mul!(C, A, B, nothing, nothing), and all other cases be like C <- C*beta + A*B*alpha, everything explicitly computed? So that the type of computation does no longer depend on the values of alpha and beta, but on their type? That sounds reasonable, but means a breaking change, see

if isnanfillable(C)
@testset "β = 0 ignores C .= NaN" begin
parent(C) .= NaN
Ac = Matrix(A)
Bc = Matrix(B)
returned_mat = mul!(C, A, B, α, zero(eltype(C)))
@test returned_mat === C
@test collect(returned_mat) α * Ac * Bc rtol=rtol
end
end
if isnanfillable(A)
@testset "α = 0 ignores A .= NaN" begin
parent(A) .= NaN
Cc = copy(C)
returned_mat = mul!(C, A, B, zero(eltype(A)), β)
@test returned_mat === C
@test collect(returned_mat) β * Cc rtol=rtol
end

@Micket
Copy link
Contributor

Micket commented Nov 1, 2022

I think I might understand the proposal a little better now. The idea is to turn the default case ("alpha = 1, beta = 0") into a alpha, beta = nothing, nothing case, and to make mul!(C, A, B) = mul!(C, A, B, nothing, nothing), and all other cases be like C <- C*beta + A*B*alpha, everything explicitly computed? So that the type of computation does no longer depend on the values of alpha and beta, but on their type?

Yes

That sounds reasonable, but means a breaking change

Indeed. To quote myself from earlier in the thread;

This would be a breaking change for the users though. A change for the better I.M.H.O.
If you don't want beta, then let it to nothing. If you specify 0 then you'll get 0*C.

But of course, can't make such a breaking change out of nowhere, but I really think the old behavior could be conserved with some wrapping ( #47206 (comment) ) while still offering partial benefits (if this is the approach where we ultimately want to go).

I'm not super happy about the Nothing type i use in my examples (the meaning becomes inconsistent since it's a=1 vs b=0 but with sufficient documentation i guess it might be acceptable).
Perhaps there is a nicer way to expose this choice to the caller; the only thing that matters is whenever MulAddMul is used to instead expose the choice of "algorithm" to the caller via normal dispatch directly, by whatever means. I think callers always knows what they want.

@amilsted
Copy link
Contributor Author

amilsted commented Nov 1, 2022

@Micket It occurs to me that there's no need to have two definitions for each mul!() to implement the beta branching. The following would be equivalent:

@inline mul!(C, A, B, alpha, beta) = begin
    if beta !== nothing && iszero(beta)
        A_mul_B_td!(C, A, B, MulAddMul(alpha, nothing))
    else
        A_mul_B_td!(C, A, B, MulAddMul(alpha, beta))
    end
end

The first term in the condition can be evaluated at compile time.

Why is this better? If we wanted to, we could implement this change via a macro like the one in this PR, rather than adding a branch to every mul!() variant.

@N5N3
Copy link
Member

N5N3 commented Nov 2, 2022

Use Nothing here might be too breaking, as most packages extend 5-args mul! as suggested by the doc.
And I believe most of them restrict the type to avoid possible ambiguilty.

Of cource we can revisit it once we has static bool support in Base.

@amilsted
Copy link
Contributor Author

amilsted commented Nov 3, 2022

@N5N3 by "breaking" do you mean that there might be issues with method ambiguity, or just that users may be surprised if nothing works in some cases, but not others?

@N5N3
Copy link
Member

N5N3 commented Nov 4, 2022

I mean the ecosystem has live with alpha::Number and beta::Number for a long time. We would have to update all user extension if we use nothing here. I’m not objective to the idea, but there would be less mess if we have static bool in Base.

@amilsted
Copy link
Contributor Author

amilsted commented Nov 4, 2022

Ah, okay. So StaticBool<:Number and wouldn't require changing method signatures. @N5N3 is there a timeline for this?
Do you think we could get something like this PR merged (and maybe get it backported to 1.8.x) in the meantime?

@N5N3
Copy link
Member

N5N3 commented Nov 5, 2022

is there a timeline for this?

AFAIK, no.

I think we can merge this if there's no further objection. Although the compile overhead for A_mul_B_td! is not that light.
Ping @tkf @Jutho for last review if they have time to take a look.

Comment on lines +693 to +695
# Not using @stable_muladdmul here deliberately to create an inferrence
# barrier in case α, β are not compile-time constants. This avoids compiling
# four versions of generic_matmatmul!() in those cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already questioned in the comments, I think these changes shouldn't be here (i.e. all the changes that tries to give special treatment to the general method, for both syrk, herk, and gemm wrappers).

  1. i see no reason why one would want to preserve the type instability it definitely introduces; It basically just solves the problem for 2x2 and 3x3 matrices (for which one should probably use a static array instead). Code is much more difficult to understand since it makes this strange exception to the pattern otherwise used for all other types.
  2. the inline and especially noinline are just hints (that the compiler can and often will decide what to do regardless), though here they are used as if they introduce some strong guarantees.
  3. all of these changes will have to be reverted back once a nicer solution with staticbool can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the first concern, runtime dispatch is not a bad ideal when the inner kernal is big. Although we can make it stable, but the complie tex would be high.

I agree that 2x2 and 3x3 are not that important. And in most cases, these repeated branches would never be hit.
But the current design was caused by their performance regression. So I'm not sure is it ok to re-introduce the regression.

As for the 3rd concern, there's no time line for staticbool and related compiler improvement. #46471 might help here, but I don't think it's a block for temporary improvement.

Copy link
Contributor

@Micket Micket Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of this PR is to replace the runtime dispatch with a static if-else block at the cost of increased compilation times. Except it excludes the one scenario which everyone cares about, gemm for matrices larger than 3x3

I'm not suggesting dropping the small matrix optimizations, but this PR would literally keep the instability in the issue that actually started this #46865, only making it slightly worse since it needs to compile more for the small matrix variants. And it moves the type instability deeper, so it seems less likely that optimizations would take the new instability away.

I think const prop is a red herring; it would of course not do anything in the common case where alpha and beta aren't constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that it does not exclude the case of gemm for matrices larger than
3x3. The BLAS case is explicitly included: it no longer involves MulAddMul at
all in this PR. This PR fixes the allocations seen in #46865 - it even includes
an explicit test for this!

Also, compile times are only worse in cases that previously involved
runtime dispatch, since const-prop should still eliminate the branches
(assuming the @inline hint is successful -- see below).

generic_matmatmul is the only case that now involves runtime dispatch
(likely always) but it is also not the hot path everyone cares about, as
far as I am aware. When I tested it with dense matrices, it did allocations
all by itself and the runtime dispatch overhead was negligible.

Regarding inlining hints: They are not required to be respected for this
approach to work. The @inline is the more important one, as it makes it
more likely for const-prop to still happen in the small-matrix cases and
eliminate the new branches (keeping compile time down). The @noinline
doesn't really serve a useful purpose beyond showing the reader why
the gemm_wrapper functions are split the way they are and could just
be removed.

Copy link
Contributor Author

@amilsted amilsted Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no replies, can I assume this is resolved? @Micket @N5N3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Micket Reading this comment made me realize that the changes in this PR, plus use of Static.jl, pretty much give you the interface you have been suggesting. If you use Static.jl you can already supply alpha and beta as static bools or ints, which will lead to compile-time selection of the correct methods as the branches added in this PR will be evaluated at compile time. The only sacrifice is the generic_matmatmul inference barrier, which doesn't appear to hurt performance noticeably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but introducing the static half of the problem was never really any issue (whether you used Union{Nothing,Number}, Val{x}, static(x), or just introduced your own custom type (which I played around with, only takes a few lines of code)). It was to stop doing all the extra stuff for the dynamic case which brought the real benefits I.M.H.O.

A beta=static(0) would allow the compiler to maybe optimize away a branch, but for the general case, even if we know we have beta=rand()+1 (something strictly > 0), the extra compilations for all those possible runtime branches will still be there (probably, I don't think the optimizer would see through that).
But in both of those if your matrices are small enough to notice a runtime cost from this you should surely be using StaticArrays anyway.

I just think the interface was wrong from the start; to select algorithm based on value when it could/should have been based only on type, which is necessarily a breaking change.
The main benefits i see from that is code simplification, no special macros, no extra branches and special rules. You simply get what you call for, and only ever pay for that.

(I'm also not sure static(1)/static(0) is better at Nothing here; since what one is really selecting a different algorithm in a way; not just some optimization as it's not really equal to multiplication by 0 anymore (in terms of IEEE-754 math))

Note: I've gotten beyond busy recently, so I'm unlikely to write more in this issue. I don't have any authority here anyway and anyone should feel free to dismiss my comments.

@amilsted
Copy link
Contributor Author

amilsted commented Nov 6, 2022 via email

@N5N3
Copy link
Member

N5N3 commented Dec 2, 2022

Sorry for the late reply @amilsted.
I don't think I'm the right person to decide the proper trade off, especially because of the tortuous commit history.
Add the triage label for discoverability.
List for triage:

  1. Is the performance of 2x2/3x3 matmatmul that important in stdlib?
    I personally think we can ignore them as user would select StaticArrays if they want better performance.
    If we do think the regression on 2x2/3x3 is acceptable, than perhaps the usage of @stable_muladdmul should be (more) limited.

@N5N3 N5N3 added the triage This should be discussed on a triage call label Dec 2, 2022
@amilsted
Copy link
Contributor Author

amilsted commented Dec 9, 2022

I would suggest that, if performance of 2x2 and 3x3 matmul is not important in stdlib, those native functions should just be removed. The only reason to have them at all is for performance, otherwise we can just use generic_matmatmul.

@amilsted amilsted mentioned this pull request Jan 13, 2023
@amilsted
Copy link
Contributor Author

I could separate out the tests from this PR and mark them as broken. @N5N3 @dkarrasch do you think they might be easier to merge? At least then the problems with non-const alpha, beta become more visible. Could also add a benchmark...

@LilithHafner LilithHafner added linalg triage and removed triage This should be discussed on a triage call labels Mar 2, 2023
amilsted added a commit to amilsted/julia that referenced this pull request Mar 31, 2023
@amilsted
Copy link
Contributor Author

amilsted commented Jun 12, 2023

Is it worth me resolving the conflicts here? Is this still being considered?

If not, maybe we can merge #49210?

@dkarrasch
Copy link
Member

Resolving the conflicts is not worth it. Recent changes reduced the number of mul! methods dramatically, so such a change would affect less methods. We could take the macro from here and apply it to whatever is left from earlier. That should be much easier than to resolve the conflicts.

@amilsted
Copy link
Contributor Author

@dkarrasch Will it be helpful if I go ahead and do that?

@ViralBShah
Copy link
Member

@dkarrasch Should we try to get this in as you proposed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants