-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up gemm_wrapper and stabilize MulAddMul strategically #47206
Conversation
Do the same for herk and syrk. Also remove @stable_muladdmul from generic_matmatmul!.
I was pinged so here i go. Some background thoughts My understanding is that the original goal with using the I think @amilsted shares these general conclusions (please correct me if i'm wrong) I think i understand the core goal of this PR and the macro it introduces, but i think some choices are motivated only by how ubiquitous the use of MulAddMul was.
|
I hope.
I think it's not quite right that dispatching on MulAddMul only simplifies some logical expressions. If you look at the native matmul code, including the 2x2 and 3x3 cases, which are supposed to be fast (faster than calling out to BLAS), we are able to write generic code that avoids unnecessary arithmetic ops. Also, avoiding any ops that take inputs from I agree that the code is getting rather convoluted. Just duplicating these native matmul functions for a few cases is likely more readable. |
Correction: Upon a second read through, i have clearly misunderstood things in this PR a lot.
(no)inlining are just hints, which is most often ignored. Especially the noinline
Though the code here generates a branch for all 4 variants.
but then you'll have the type instability there instead? |
Sorry this turned into a bit of rambling. Feel free to ignore this.
Yeah i'm a bit sleepy apparently. I was focused on the use of
I tried to think twice as hard this time, and the the current dispatch seems domed from the start (even if constant-propagation worked as intended). So, i've now come to think that the problem stems from not being able to select what you really want. So, in my philosophical view of this problem, i think that's the root of all trouble; 2 different properties coupled to the same parameter.
(though obviously, a bit more polished..) I suppose it's a bit of a foot-gun to specify this apart from the value of the coeffients, but this does let me have this control, and there is no unnecessary compilations happening, no extra runtime branch checks, no switching behaviour based on what values |
Yes, that seems to be the only way to prevent |
Yeah, it's tricky. I thought about detecting, at compile time, that |
Regarding convoluted code: The |
@nanosoldier |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
If I read the benchmark correctly, then 5-arg |
Is the Diagonal case supposed to NOT allocate now? It seems to do (as many) allocations, the main worry, but if allocations are expected, if 45% slowdown real, or can it be a fluke with small data? |
Looks like |
It occurs to me one could do a "light" runtime dispatch by hand for |
I still really think the problem is tying the algorithm to the values here; i can't even imagine any code where you dont know which version you really really want (regardless of values or const'ness) Basically expanding upon what e.g. the 3 argument
This would be a breaking change for the users though. A change for the better I.M.H.O. There should be no type instability, no redundant compilation, no extra runtime checks and branches. simpler than the original code, probably shorter with some handy default arguments, also allows for specifying only 4 arguments. |
Zero check is just for beta, there are additional checks for if alpha=1 though, but this just (sometimes) impacts the performance (very very little in most cases).
Yes i don't think this was ever in doubt.
The fact that this is based on the value of beta (and alpha) is what lead to this type instability.
Well, 3-arg mul! is already provided today, but regardless, no approach here is suggesting a solution that would require any code duplication.
I understand changing the behavior is difficult, but i believe the design decision was wrong. But, adding a wrapper (at least for beta=0 check) to preserve backwards compatibility is doable without much significant code addition (probably still less than what is there today), though this prevents the user from being able to explicitly say what arithmetic they actually want. |
@Micket I don't think this is true (see my message above). |
Perhaps i was misleading in my attempt to write a simple examples, since i used mul! directly; regardless of whether there is ambiguity there or not, it would not be an appropriate place to check for this due to the sheer number of mul! methods, so i meant in appropriate (new) wrappers. Using the first code snippet i saw in this PR since it was the cleanest case, bidiag has a bunch of mul!'s defined
with a single wrapper for MulAddMul
so i guess there would be need for a couple of such wrappers (e.g. one for I don't think this would just be a matter of internal implementation details either. This would make tangible differences;
In a way, these are the benefits that would stem from not using the macro into an inline if-else block, but instead expose that code as a function, allowing for dispatch mechanics to work its magic (which brings benefits, even if they are hampered by some backwards compatibility). And it would bring the code more in line with the possible future approach, if this is what people want. Would only require dropping some wrappers |
I think I might understand the proposal a little better now. The idea is to turn the default case ("alpha = 1, beta = 0") into a julia/stdlib/LinearAlgebra/test/addmul.jl Lines 197 to 215 in ebe3159
|
Yes
Indeed. To quote myself from earlier in the thread;
But of course, can't make such a breaking change out of nowhere, but I really think the old behavior could be conserved with some wrapping ( #47206 (comment) ) while still offering partial benefits (if this is the approach where we ultimately want to go). I'm not super happy about the Nothing type i use in my examples (the meaning becomes inconsistent since it's a=1 vs b=0 but with sufficient documentation i guess it might be acceptable). |
@Micket It occurs to me that there's no need to have two definitions for each @inline mul!(C, A, B, alpha, beta) = begin
if beta !== nothing && iszero(beta)
A_mul_B_td!(C, A, B, MulAddMul(alpha, nothing))
else
A_mul_B_td!(C, A, B, MulAddMul(alpha, beta))
end
end The first term in the condition can be evaluated at compile time. Why is this better? If we wanted to, we could implement this change via a macro like the one in this PR, rather than adding a branch to every |
Use Of cource we can revisit it once we has static bool support in |
@N5N3 by "breaking" do you mean that there might be issues with method ambiguity, or just that users may be surprised if |
I mean the ecosystem has live with |
Ah, okay. So |
# Not using @stable_muladdmul here deliberately to create an inferrence | ||
# barrier in case α, β are not compile-time constants. This avoids compiling | ||
# four versions of generic_matmatmul!() in those cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already questioned in the comments, I think these changes shouldn't be here (i.e. all the changes that tries to give special treatment to the general method, for both syrk, herk, and gemm wrappers).
- i see no reason why one would want to preserve the type instability it definitely introduces; It basically just solves the problem for 2x2 and 3x3 matrices (for which one should probably use a static array instead). Code is much more difficult to understand since it makes this strange exception to the pattern otherwise used for all other types.
- the inline and especially noinline are just hints (that the compiler can and often will decide what to do regardless), though here they are used as if they introduce some strong guarantees.
- all of these changes will have to be reverted back once a nicer solution with staticbool can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the first concern, runtime dispatch is not a bad ideal when the inner kernal is big. Although we can make it stable, but the complie tex would be high.
I agree that 2x2 and 3x3 are not that important. And in most cases, these repeated branches would never be hit.
But the current design was caused by their performance regression. So I'm not sure is it ok to re-introduce the regression.
As for the 3rd concern, there's no time line for staticbool and related compiler improvement. #46471 might help here, but I don't think it's a block for temporary improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of this PR is to replace the runtime dispatch with a static if-else block at the cost of increased compilation times. Except it excludes the one scenario which everyone cares about, gemm for matrices larger than 3x3
I'm not suggesting dropping the small matrix optimizations, but this PR would literally keep the instability in the issue that actually started this #46865, only making it slightly worse since it needs to compile more for the small matrix variants. And it moves the type instability deeper, so it seems less likely that optimizations would take the new instability away.
I think const prop is a red herring; it would of course not do anything in the common case where alpha and beta aren't constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that it does not exclude the case of gemm for matrices larger than
3x3. The BLAS case is explicitly included: it no longer involves MulAddMul at
all in this PR. This PR fixes the allocations seen in #46865 - it even includes
an explicit test for this!
Also, compile times are only worse in cases that previously involved
runtime dispatch, since const-prop should still eliminate the branches
(assuming the @inline
hint is successful -- see below).
generic_matmatmul
is the only case that now involves runtime dispatch
(likely always) but it is also not the hot path everyone cares about, as
far as I am aware. When I tested it with dense matrices, it did allocations
all by itself and the runtime dispatch overhead was negligible.
Regarding inlining hints: They are not required to be respected for this
approach to work. The @inline
is the more important one, as it makes it
more likely for const-prop to still happen in the small-matrix cases and
eliminate the new branches (keeping compile time down). The @noinline
doesn't really serve a useful purpose beyond showing the reader why
the gemm_wrapper
functions are split the way they are and could just
be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Micket Reading this comment made me realize that the changes in this PR, plus use of Static.jl, pretty much give you the interface you have been suggesting. If you use Static.jl you can already supply alpha
and beta
as static bools or ints, which will lead to compile-time selection of the correct methods as the branches added in this PR will be evaluated at compile time. The only sacrifice is the generic_matmatmul
inference barrier, which doesn't appear to hurt performance noticeably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but introducing the static half of the problem was never really any issue (whether you used Union{Nothing,Number}, Val{x}, static(x), or just introduced your own custom type (which I played around with, only takes a few lines of code)). It was to stop doing all the extra stuff for the dynamic case which brought the real benefits I.M.H.O.
A beta=static(0)
would allow the compiler to maybe optimize away a branch, but for the general case, even if we know we have beta=rand()+1
(something strictly > 0), the extra compilations for all those possible runtime branches will still be there (probably, I don't think the optimizer would see through that).
But in both of those if your matrices are small enough to notice a runtime cost from this you should surely be using StaticArrays anyway.
I just think the interface was wrong from the start; to select algorithm based on value when it could/should have been based only on type, which is necessarily a breaking change.
The main benefits i see from that is code simplification, no special macros, no extra branches and special rules. You simply get what you call for, and only ever pay for that.
(I'm also not sure static(1)/static(0) is better at Nothing here; since what one is really selecting a different algorithm in a way; not just some optimization as it's not really equal to multiplication by 0 anymore (in terms of IEEE-754 math))
Note: I've gotten beyond busy recently, so I'm unlikely to write more in this issue. I don't have any authority here anyway and anyone should feel free to dismiss my comments.
Edit: Email replies don't work too well. I moved this to where I intended it to be, in the above thread.
…On Sun, 6 Nov 2022, 02:24 Mikael Öhman, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In stdlib/LinearAlgebra/src/matmul.jl
<#47206 (comment)>:
> + # Not using @stable_muladdmul here deliberately to create an inferrence
+ # barrier in case α, β are not compile-time constants. This avoids compiling
+ # four versions of generic_matmatmul!() in those cases.
The only purpose of this PR is to replace the runtime dispatch with a
static if-else block at the cost of increased compilation times. Except it
excludes the one scenario which everyone cares about, gemm for matrices
larger than 3x3
I'm not suggesting dropping the small matrix optimizations, but this PR
would literally keep the instability in the case that actually started the
#46865 <#46865>, only making it
slightly worse since it needs to compile more for the small matrix
variants. And it moves the type instability deeper, so it seems less likely
that optimizations would take the new instability away.
I think const prop is a red herring; it would of course not do anything in
the common case where alpha and beta aren't constant.
—
Reply to this email directly, view it on GitHub
<#47206 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJPCAID63W65HEBCDT7UCDWG6BNXANCNFSM6AAAAAARHQMZME>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sorry for the late reply @amilsted.
|
I would suggest that, if performance of 2x2 and 3x3 matmul is not important in stdlib, those native functions should just be removed. The only reason to have them at all is for performance, otherwise we can just use |
I could separate out the tests from this PR and mark them as broken. @N5N3 @dkarrasch do you think they might be easier to merge? At least then the problems with non-const alpha, beta become more visible. Could also add a benchmark... |
See JuliaLang#47206 for a fix proposal.
Is it worth me resolving the conflicts here? Is this still being considered? If not, maybe we can merge #49210? |
Resolving the conflicts is not worth it. Recent changes reduced the number of |
@dkarrasch Will it be helpful if I go ahead and do that? |
@dkarrasch Should we try to get this in as you proposed? |
This is a hybrid of #47088 and #47026. See also #46865. Like #47088, it introduces a macro
@stable_muladdmul
to constructMulAddMul(alpha, beta)
without type instability in casealpha
andbeta
are not constant. This helps us avoid runtime dispatch in some performance hot-paths, including native multiplication of small matrices, as well as calls out togemm!()
.To avoid blowing up compile times a lot, this splits
gemm_wrapper!()
and friends into an inlined and non-inlined part. The inlined part contains the stabilized calls to native small-matrix multiplication. The non-inlined part calls out togemm!()
and, failing that,generic_matmatmul!()
, which is called without the@stable_muladdmul
wrapper, introducing an inference barrier in casealpha
andbeta
are not constant, preventing 4 different versions of (heavy)generic_matmatmul!()
being compiled in those cases.@dkarrasch @N5N3 @Micket