-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse tensor.pad
with producers.
#9194
Fuse tensor.pad
with producers.
#9194
Conversation
Allow `tensor.insert_slice` operation to get folded with its producers. This effectively is a fusion of pad operations with its producers (with an additional fill of the result buffer with the padde value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to be more restrictive and invert the matching logic such that it handles only non-contiguous/padded insertions to start since fusing the insertion otherwise introduces a false dependency on the entire output tensor (a big hammer). The check being too broad is risky mostly because it will be really hard to track down as a root cause of concurrency issues later on and I've seen sequences for concats that have interleaved ops that would not match this (and it'll get more complex once we have concurrent loop regions and such that will require non-local analysis to detect, etc).
Actually seeing the behavior change is on me; I still have #7729 open and really need to fix it. That should handle the non-padding/concat-style insertions that aren't matched while avoiding the false dependency.
Ah, non-contiguous is a good idea, but there might be non-contiguous inserts that might exhibit the same issue. I did try to account for that case. Trying to see if my reasoning matches your expectation.
I can make it even more restrictive. I can check that the |
I think if I can actually do the insertions in-place in the stream allocator we should stick to always preferring to have them unfused unless there's a case-specific benefit (like with internal padding). Doing the single use checks may cover some cases (not control flow/etc) but even then fusing the insert is still a lossy operation from the perspective of the host side: in flow/stream we would no longer know that only a subset of the resource is being modified and as such we can't tell the runtime/drivers/devices that (as a specific example we need to be able to fill out a VkBufferMemoryBarrier). If we lose range information on the host the result tensors will just be seen as a read/write over their entirety and will need to be transferred and flushed to start a dispatch and any subsequent read would need a barrier/invalidation of the full tensor vs. just the updated range. Are there other cases beyond padding where having the insert_slice on the inside of a dispatch would be useful? Wondering if there's analysis information you use or whatnot; conceptually having the buffer pre-offset should make the dispatches simpler (half the offset math is hoisted), but just as we don't want the host to be blind to subranges maybe we also don't want to have the device be blind too? If it's useful to have the full extract/insert slice ranges on dispatches we could extend dispatch regions to capture them explicitly. Maybe fixing #7729 is what I get around to next as it's been bugging me since I wrote the stream stuff and it'd feel really good to close it out :P |
Ok, well, let me see if I can handle pad specifically. I thought this would be a general solution to take any thing that writes into a |
Oh I think it's a useful fusion - just think it should start opt-in for specific known uses given the broader implications vs the norm (dispatch count, etc). We can keep an eye out for more scenarios and see if we can find matches (positive and negative) that help widen the net. Incidentally the reasoning is the same as why having extracts outside of dispatches is useful; I'm not actually sure what's happening to those nowadays. Outside of dispatch regions ideally we'd have an extract of the minimal contiguous ND subrange on every operand and an insert for every result - if that best comes by construction (we don't fuse them in) or by analysis (we do all the fusions and then extract that information) or something in-between I don't know. Maybe there's something we can do that lets us leave this on but still get the info. Just riffing RE this particular fusion: if there's any min/max info during fusion that can be derived from the fused insert_slice we can split things such that we do a coarse subrange on the outside via insert_slice and then also put a smaller/tighter insert_slice on the inside. It'd be computationally simple (we have to compute offsets on the outside anyway) and would still give the information required for the host side. Effectively just rebasing the fused insert_slice to 0,0,,... and leaving an insert_slice with the original offsets outside. That feels trickier to do after fusion (would have to hoist the offset values and those may be dynamically computed inside the region) but maybe it falls out of other analysis you're doing already. |
It is hard to carve a special path by looking at uses of
Right now all
I think the easiest place is to look at the offsets and sizes of the |
I think the issue with looking at the flow.dispatch.tensor.load/stores is that it's a phase ordering inversion that creates a complex hoisting/recovery/raising - if after a round or two of canonicalizations/transforms they end up not just using dispatch operands or the loads/stores end up in conditional regions things get crazy. Conceptually it feels better to construct things with the right ranges instead. What are the cases where folding in a non-workgroup-id-dependent contiguous extract/insert helps codegen? Could they be solved by some better extract/insert propagation (if the issue is that you have chained use -> insert -> use you want fused)? I'm trying to reason about what the information we need is (the minimal contiguous subranges of I/O) and how that matches with the execution model if we remove it (a particular dispatch across all workgroups must only access those minimal contiguous subranges). Basically, it should always be safe and more efficient to have the extracts/inserts on the outside, so what do we gain by fusing them in and how might we get that without doing it. If it's really just fixed-point iteration while building up a dispatch region this may be something the in_parallel stuff helps with by making the ordering clearer (fusion happens there, then when lowering those to dispatch regions we hoist whatever we can to extracts/slices on the outside)? |
I am not sure I follow the question (and not sure |
You'd not be making this change if it didn't provide value to something you're working on and I'm saying that fusing extract/insert into dispatches that widen the subrange a dispatch operates on is a negative to the system as a whole if applied by default. I haven't articulated these expectations around extract/insert before and we have no code today explicitly trying to optimize for that - this PR is the first I've seen that potentially pessimizes things and hence the discussion. I'm trying to explore whether solving for whatever it is that caused you to make this PR is something that can also help us better align with the expectations around extract/insert being as narrow as possible. in_parallel & co intersect as that may be where fusion decision making moves to (vs your greedy stuff here) and it has extracts/inserts as part of its structure - thus if we're going to work on extract/insert fusion it seems relevant to talk about that. Happy to pick this up in our 1:1 tomorrow. |
I do understand the overall reason why fusing
I was trying to remove dispatches that are only So some questions for the future around how to handle
(just to restate, I understand that having the |
tensor.insert_slice
with producers.tensor.pad
with producers.
Allow
tensor.insert_slice
operation to get folded with itsproducers. This effectively is a fusion of pad operations with its
producers (with an additional fill of the result buffer with the padde
value).
Fixes #2783.