Fuse `tensor.pad` with producers. #9194

MaheshRavishankar · 2022-05-24T00:08:46Z

Allow tensor.insert_slice operation to get folded with its
producers. This effectively is a fusion of pad operations with its
producers (with an additional fill of the result buffer with the padde
value).

Fixes #2783.

Allow `tensor.insert_slice` operation to get folded with its producers. This effectively is a fusion of pad operations with its producers (with an additional fill of the result buffer with the padde value).

benvanik

I think we want to be more restrictive and invert the matching logic such that it handles only non-contiguous/padded insertions to start since fusing the insertion otherwise introduces a false dependency on the entire output tensor (a big hammer). The check being too broad is risky mostly because it will be really hard to track down as a root cause of concurrency issues later on and I've seen sequences for concats that have interleaved ops that would not match this (and it'll get more complex once we have concurrent loop regions and such that will require non-local analysis to detect, etc).

Actually seeing the behavior change is on me; I still have #7729 open and really need to fix it. That should handle the non-padding/concat-style insertions that aren't matched while avoiding the false dependency.

MaheshRavishankar · 2022-05-24T00:44:18Z

I think we want to be more restrictive and invert the matching logic such that it handles only non-contiguous/padded insertions to start since fusing the insertion otherwise introduces a false dependency on the entire output tensor (a big hammer). The check being too broad is risky mostly because it will be really hard to track down as a root cause of concurrency issues later on and I've seen sequences for concats that have interleaved ops that would not match this (and it'll get more complex once we have concurrent loop regions and such that will require non-local analysis to detect, etc).

Actually seeing the behavior change is on me; I still have #7729 open and really need to fix it. That should handle the non-padding/concat-style insertions that aren't matched while avoiding the false dependency.

Ah, non-contiguous is a good idea, but there might be non-contiguous inserts that might exhibit the same issue. I did try to account for that case. Trying to see if my reasoning matches your expectation.

If dest is from a tensor.insert_slice then its not fused.
If the result is used in a tensor.insert_slice then also it is not fused.

I can make it even more restrictive. I can check that the dest of the tensor.insert_slice has a single use and the result of the tensor.insert_slice has a single use that is not the dest of another tensor.insert_slice.

benvanik · 2022-05-24T02:14:46Z

I think if I can actually do the insertions in-place in the stream allocator we should stick to always preferring to have them unfused unless there's a case-specific benefit (like with internal padding).

Doing the single use checks may cover some cases (not control flow/etc) but even then fusing the insert is still a lossy operation from the perspective of the host side: in flow/stream we would no longer know that only a subset of the resource is being modified and as such we can't tell the runtime/drivers/devices that (as a specific example we need to be able to fill out a VkBufferMemoryBarrier). If we lose range information on the host the result tensors will just be seen as a read/write over their entirety and will need to be transferred and flushed to start a dispatch and any subsequent read would need a barrier/invalidation of the full tensor vs. just the updated range.

Are there other cases beyond padding where having the insert_slice on the inside of a dispatch would be useful? Wondering if there's analysis information you use or whatnot; conceptually having the buffer pre-offset should make the dispatches simpler (half the offset math is hoisted), but just as we don't want the host to be blind to subranges maybe we also don't want to have the device be blind too? If it's useful to have the full extract/insert slice ranges on dispatches we could extend dispatch regions to capture them explicitly.

Maybe fixing #7729 is what I get around to next as it's been bugging me since I wrote the stream stuff and it'd feel really good to close it out :P

MaheshRavishankar · 2022-05-24T03:33:21Z

I think if I can actually do the insertions in-place in the stream allocator we should stick to always preferring to have them unfused unless there's a case-specific benefit (like with internal padding).

Doing the single use checks may cover some cases (not control flow/etc) but even then fusing the insert is still a lossy operation from the perspective of the host side: in flow/stream we would no longer know that only a subset of the resource is being modified and as such we can't tell the runtime/drivers/devices that (as a specific example we need to be able to fill out a VkBufferMemoryBarrier). If we lose range information on the host the result tensors will just be seen as a read/write over their entirety and will need to be transferred and flushed to start a dispatch and any subsequent read would need a barrier/invalidation of the full tensor vs. just the updated range.

Are there other cases beyond padding where having the insert_slice on the inside of a dispatch would be useful? Wondering if there's analysis information you use or whatnot; conceptually having the buffer pre-offset should make the dispatches simpler (half the offset math is hoisted), but just as we don't want the host to be blind to subranges maybe we also don't want to have the device be blind too? If it's useful to have the full extract/insert slice ranges on dispatches we could extend dispatch regions to capture them explicitly.

Maybe fixing #7729 is what I get around to next as it's been bugging me since I wrote the stream stuff and it'd feel really good to close it out :P

Ok, well, let me see if I can handle pad specifically. I thought this would be a general solution to take any thing that writes into a tensor.insert_slice and fuse it with the dispatch that produces it as long as the dest is not part of a a tensor.insert_slice chain. I can try to special carve the fusion with pad operations.

benvanik · 2022-05-24T04:33:33Z

Oh I think it's a useful fusion - just think it should start opt-in for specific known uses given the broader implications vs the norm (dispatch count, etc). We can keep an eye out for more scenarios and see if we can find matches (positive and negative) that help widen the net.

Incidentally the reasoning is the same as why having extracts outside of dispatches is useful; I'm not actually sure what's happening to those nowadays. Outside of dispatch regions ideally we'd have an extract of the minimal contiguous ND subrange on every operand and an insert for every result - if that best comes by construction (we don't fuse them in) or by analysis (we do all the fusions and then extract that information) or something in-between I don't know. Maybe there's something we can do that lets us leave this on but still get the info.

Just riffing RE this particular fusion: if there's any min/max info during fusion that can be derived from the fused insert_slice we can split things such that we do a coarse subrange on the outside via insert_slice and then also put a smaller/tighter insert_slice on the inside. It'd be computationally simple (we have to compute offsets on the outside anyway) and would still give the information required for the host side. Effectively just rebasing the fused insert_slice to 0,0,,... and leaving an insert_slice with the original offsets outside. That feels trickier to do after fusion (would have to hoist the offset values and those may be dynamically computed inside the region) but maybe it falls out of other analysis you're doing already.

MaheshRavishankar · 2022-05-24T19:33:29Z

Oh I think it's a useful fusion - just think it should start opt-in for specific known uses given the broader implications vs the norm (dispatch count, etc). We can keep an eye out for more scenarios and see if we can find matches (positive and negative) that help widen the net.

It is hard to carve a special path by looking at uses of tensor.insert_slice (actually thats basically what I tried. If the tensor.insert_slice is not part of a chain of inserts then I thought it avoided the issues you were highlighting.)

Incidentally the reasoning is the same as why having extracts outside of dispatches is useful; I'm not actually sure what's happening to those nowadays. Outside of dispatch regions ideally we'd have an extract of the minimal contiguous ND subrange on every operand and an insert for every result - if that best comes by construction (we don't fuse them in) or by analysis (we do all the fusions and then extract that information) or something in-between I don't know. Maybe there's something we can do that lets us leave this on but still get the info.

Right now all tensor.extract_slice gets pulled in. One of the issues is that non-contiguous extracts need to end up in dispatches. So currently its biased towards pulling in all the extracts. It is easier to not pull in extracts, but it has a cascading effect of some IR being left out of dispatches or creating more dispatches just cause the extract forms a barrier point. What we do have though is that the extract/insert get fused with the flow.dispatch.tensor.load/store where possible. So that might give an idea of the extent of the tensor accessed within each dispatch.

Just riffing RE this particular fusion: if there's any min/max info during fusion that can be derived from the fused insert_slice we can split things such that we do a coarse subrange on the outside via insert_slice and then also put a smaller/tighter insert_slice on the inside. It'd be computationally simple (we have to compute offsets on the outside anyway) and would still give the information required for the host side. Effectively just rebasing the fused insert_slice to 0,0,,... and leaving an insert_slice with the original offsets outside. That feels trickier to do after fusion (would have to hoist the offset values and those may be dynamically computed inside the region) but maybe it falls out of other analysis you're doing already.

I think the easiest place is to look at the offsets and sizes of the flow.dispatch.tensor.load/store. That should carry all the necessary information.

benvanik · 2022-05-24T19:42:31Z

I think the issue with looking at the flow.dispatch.tensor.load/stores is that it's a phase ordering inversion that creates a complex hoisting/recovery/raising - if after a round or two of canonicalizations/transforms they end up not just using dispatch operands or the loads/stores end up in conditional regions things get crazy. Conceptually it feels better to construct things with the right ranges instead.

What are the cases where folding in a non-workgroup-id-dependent contiguous extract/insert helps codegen? Could they be solved by some better extract/insert propagation (if the issue is that you have chained use -> insert -> use you want fused)? I'm trying to reason about what the information we need is (the minimal contiguous subranges of I/O) and how that matches with the execution model if we remove it (a particular dispatch across all workgroups must only access those minimal contiguous subranges). Basically, it should always be safe and more efficient to have the extracts/inserts on the outside, so what do we gain by fusing them in and how might we get that without doing it. If it's really just fixed-point iteration while building up a dispatch region this may be something the in_parallel stuff helps with by making the ordering clearer (fusion happens there, then when lowering those to dispatch regions we hoist whatever we can to extracts/slices on the outside)?

MaheshRavishankar · 2022-05-24T20:51:54Z

I think the issue with looking at the flow.dispatch.tensor.load/stores is that it's a phase ordering inversion that creates a complex hoisting/recovery/raising - if after a round or two of canonicalizations/transforms they end up not just using dispatch operands or the loads/stores end up in conditional regions things get crazy. Conceptually it feels better to construct things with the right ranges instead.

What are the cases where folding in a non-workgroup-id-dependent contiguous extract/insert helps codegen? Could they be solved by some better extract/insert propagation (if the issue is that you have chained use -> insert -> use you want fused)? I'm trying to reason about what the information we need is (the minimal contiguous subranges of I/O) and how that matches with the execution model if we remove it (a particular dispatch across all workgroups must only access those minimal contiguous subranges). Basically, it should always be safe and more efficient to have the extracts/inserts on the outside, so what do we gain by fusing them in and how might we get that without doing it. If it's really just fixed-point iteration while building up a dispatch region this may be something the in_parallel stuff helps with by making the ordering clearer (fusion happens there, then when lowering those to dispatch regions we hoist whatever we can to extracts/slices on the outside)?

I am not sure I follow the question (and not sure in-parallel stuff helps with these anyway). There is no fixed point iteration happening w.r.t fusion (its a greedy fusion approach, it pulls in whatever it can). If the requirement here is that it is better to leave the tensor.extract_slice/tensor.insert_slice outside dispatches, then that is easy to do too (its literally a one-line change, just not sure what the requirements/ask is cause previously when these were not fused, there were a lot of comments/issues about this not happening, so I am a bit confused as to what is the expectation here). Its worth experimenting on that separately.

benvanik · 2022-05-24T21:03:50Z

You'd not be making this change if it didn't provide value to something you're working on and I'm saying that fusing extract/insert into dispatches that widen the subrange a dispatch operates on is a negative to the system as a whole if applied by default. I haven't articulated these expectations around extract/insert before and we have no code today explicitly trying to optimize for that - this PR is the first I've seen that potentially pessimizes things and hence the discussion. I'm trying to explore whether solving for whatever it is that caused you to make this PR is something that can also help us better align with the expectations around extract/insert being as narrow as possible. in_parallel & co intersect as that may be where fusion decision making moves to (vs your greedy stuff here) and it has extracts/inserts as part of its structure - thus if we're going to work on extract/insert fusion it seems relevant to talk about that.

Happy to pick this up in our 1:1 tomorrow.

MaheshRavishankar · 2022-05-24T21:30:23Z

You'd not be making this change if it didn't provide value to something you're working on and I'm saying that fusing extract/insert into dispatches that widen the subrange a dispatch operates on is a negative to the system as a whole if applied by default. I haven't articulated these expectations around extract/insert before and we have no code today explicitly trying to optimize for that - this PR is the first I've seen that potentially pessimizes things and hence the discussion. I'm trying to explore whether solving for whatever it is that caused you to make this PR is something that can also help us better align with the expectations around extract/insert being as narrow as possible. in_parallel & co intersect as that may be where fusion decision making moves to (vs your greedy stuff here) and it has extracts/inserts as part of its structure - thus if we're going to work on extract/insert fusion it seems relevant to talk about that.

Happy to pick this up in our 1:1 tomorrow.

I do understand the overall reason why fusing tensor.extract_slice/insert_slice into dispatches pessimizes things. Let me first write down what happens today with these ops

tensor.extract_slice is always fused into the dispatch it is used. If any tensor.extract_slice is not pulled into a dispatch
a) Contiguous tensor.extract_slice get converted to flow.slice operations
b) Non-contiguous tensor.extract_slice get pulled into their own dispatch (this is a fallback needed to not have any tensor ops outside dispatches - I think it is rarely hit)
tensor.insert_slice is never fused into dispatches of its consumers/producers.
a) Contiguous insert_slice get converted to flow.update operations.
b) Non-contiguous insert_slice get pulled into their own dispatches.

I was trying to remove dispatches that are only tensor.insert_slice by "fusing" it with its producers. Incidentally that also makes padding fused into its producers. Understanding that this has implications, ill try to change this PR to be more targeted and work on pad alone.

So some questions for the future around how to handle tensor.extract_slice/tensor.insert_slice.

Is there a distinction to be made w.r.t contiguous vs non-contiguous operations since non-contiguous operations cannot lower to flow ops.
There might be (corner) cases where not fusing tensor.extract_slice operations into dispatches might mean some other ops that are producers of the slice op, either get convert to flow.tensor ops (like reshape) or as a fallback need to get pulled into their own dispatch. Off the top of my head I dont know if the latter actually happens. Worth making the change and trying out.

(just to restate, I understand that having the extract_slice/insert_slice` out of dispatches to get a better idea of the ranges being operated on is helpful, so happy to make that information available)

Fuse tensor.insert_slice with producers.

9278396

Allow `tensor.insert_slice` operation to get folded with its producers. This effectively is a fusion of pad operations with its producers (with an additional fill of the result buffer with the padde value).

MaheshRavishankar requested review from okkwon and antiagainst May 24, 2022 00:08

MaheshRavishankar added (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* buildkite:benchmark-x86_64 labels May 24, 2022

MaheshRavishankar requested a review from benvanik May 24, 2022 00:22

benvanik requested changes May 24, 2022

View reviewed changes

[WIP] Pad fusion with producer.

c79c635

MaheshRavishankar marked this pull request as draft May 25, 2022 05:29

MaheshRavishankar changed the title ~~Fuse tensor.insert_slice with producers.~~ Fuse tensor.pad with producers. Jun 6, 2022

MaheshRavishankar mentioned this pull request Jun 6, 2022

Fusion for pad op in Linalg #2783

Open

MaheshRavishankar closed this Aug 24, 2022

benvanik mentioned this pull request Nov 4, 2022

Fills/dispatches when padding not getting folded into consumers/producers. #11049

Open

MaheshRavishankar deleted the fuse_pad_with_producer branch November 17, 2022 21:19

MaheshRavishankar restored the fuse_pad_with_producer branch November 17, 2022 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse `tensor.pad` with producers. #9194

Fuse `tensor.pad` with producers. #9194

MaheshRavishankar commented May 24, 2022 •

edited by benvanik

Loading

benvanik left a comment

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022 •

edited

Loading

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

Fuse tensor.pad with producers. #9194

Fuse tensor.pad with producers. #9194

Conversation

MaheshRavishankar commented May 24, 2022 • edited by benvanik Loading

benvanik left a comment

Choose a reason for hiding this comment

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022 • edited Loading

MaheshRavishankar commented May 24, 2022

benvanik commented May 24, 2022

MaheshRavishankar commented May 24, 2022

Fuse `tensor.pad` with producers. #9194

Fuse `tensor.pad` with producers. #9194

MaheshRavishankar commented May 24, 2022 •

edited by benvanik

Loading

benvanik commented May 24, 2022 •

edited

Loading