Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse tensor.pad with producers. #9194

Conversation

MaheshRavishankar
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar commented May 24, 2022

Allow tensor.insert_slice operation to get folded with its
producers. This effectively is a fusion of pad operations with its
producers (with an additional fill of the result buffer with the padde
value).

Fixes #2783.

Allow `tensor.insert_slice` operation to get folded with its
producers. This effectively is a fusion of pad operations with its
producers (with an additional fill of the result buffer with the padde
value).
Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be more restrictive and invert the matching logic such that it handles only non-contiguous/padded insertions to start since fusing the insertion otherwise introduces a false dependency on the entire output tensor (a big hammer). The check being too broad is risky mostly because it will be really hard to track down as a root cause of concurrency issues later on and I've seen sequences for concats that have interleaved ops that would not match this (and it'll get more complex once we have concurrent loop regions and such that will require non-local analysis to detect, etc).

Actually seeing the behavior change is on me; I still have #7729 open and really need to fix it. That should handle the non-padding/concat-style insertions that aren't matched while avoiding the false dependency.

@MaheshRavishankar
Copy link
Contributor Author

I think we want to be more restrictive and invert the matching logic such that it handles only non-contiguous/padded insertions to start since fusing the insertion otherwise introduces a false dependency on the entire output tensor (a big hammer). The check being too broad is risky mostly because it will be really hard to track down as a root cause of concurrency issues later on and I've seen sequences for concats that have interleaved ops that would not match this (and it'll get more complex once we have concurrent loop regions and such that will require non-local analysis to detect, etc).

Actually seeing the behavior change is on me; I still have #7729 open and really need to fix it. That should handle the non-padding/concat-style insertions that aren't matched while avoiding the false dependency.

Ah, non-contiguous is a good idea, but there might be non-contiguous inserts that might exhibit the same issue. I did try to account for that case. Trying to see if my reasoning matches your expectation.

  1. If dest is from a tensor.insert_slice then its not fused.
  2. If the result is used in a tensor.insert_slice then also it is not fused.

I can make it even more restrictive. I can check that the dest of the tensor.insert_slice has a single use and the result of the tensor.insert_slice has a single use that is not the dest of another tensor.insert_slice.

@benvanik
Copy link
Collaborator

I think if I can actually do the insertions in-place in the stream allocator we should stick to always preferring to have them unfused unless there's a case-specific benefit (like with internal padding).

Doing the single use checks may cover some cases (not control flow/etc) but even then fusing the insert is still a lossy operation from the perspective of the host side: in flow/stream we would no longer know that only a subset of the resource is being modified and as such we can't tell the runtime/drivers/devices that (as a specific example we need to be able to fill out a VkBufferMemoryBarrier). If we lose range information on the host the result tensors will just be seen as a read/write over their entirety and will need to be transferred and flushed to start a dispatch and any subsequent read would need a barrier/invalidation of the full tensor vs. just the updated range.

Are there other cases beyond padding where having the insert_slice on the inside of a dispatch would be useful? Wondering if there's analysis information you use or whatnot; conceptually having the buffer pre-offset should make the dispatches simpler (half the offset math is hoisted), but just as we don't want the host to be blind to subranges maybe we also don't want to have the device be blind too? If it's useful to have the full extract/insert slice ranges on dispatches we could extend dispatch regions to capture them explicitly.

Maybe fixing #7729 is what I get around to next as it's been bugging me since I wrote the stream stuff and it'd feel really good to close it out :P

@MaheshRavishankar
Copy link
Contributor Author

I think if I can actually do the insertions in-place in the stream allocator we should stick to always preferring to have them unfused unless there's a case-specific benefit (like with internal padding).

Doing the single use checks may cover some cases (not control flow/etc) but even then fusing the insert is still a lossy operation from the perspective of the host side: in flow/stream we would no longer know that only a subset of the resource is being modified and as such we can't tell the runtime/drivers/devices that (as a specific example we need to be able to fill out a VkBufferMemoryBarrier). If we lose range information on the host the result tensors will just be seen as a read/write over their entirety and will need to be transferred and flushed to start a dispatch and any subsequent read would need a barrier/invalidation of the full tensor vs. just the updated range.

Are there other cases beyond padding where having the insert_slice on the inside of a dispatch would be useful? Wondering if there's analysis information you use or whatnot; conceptually having the buffer pre-offset should make the dispatches simpler (half the offset math is hoisted), but just as we don't want the host to be blind to subranges maybe we also don't want to have the device be blind too? If it's useful to have the full extract/insert slice ranges on dispatches we could extend dispatch regions to capture them explicitly.

Maybe fixing #7729 is what I get around to next as it's been bugging me since I wrote the stream stuff and it'd feel really good to close it out :P

Ok, well, let me see if I can handle pad specifically. I thought this would be a general solution to take any thing that writes into a tensor.insert_slice and fuse it with the dispatch that produces it as long as the dest is not part of a a tensor.insert_slice chain. I can try to special carve the fusion with pad operations.

@benvanik
Copy link
Collaborator

Oh I think it's a useful fusion - just think it should start opt-in for specific known uses given the broader implications vs the norm (dispatch count, etc). We can keep an eye out for more scenarios and see if we can find matches (positive and negative) that help widen the net.

Incidentally the reasoning is the same as why having extracts outside of dispatches is useful; I'm not actually sure what's happening to those nowadays. Outside of dispatch regions ideally we'd have an extract of the minimal contiguous ND subrange on every operand and an insert for every result - if that best comes by construction (we don't fuse them in) or by analysis (we do all the fusions and then extract that information) or something in-between I don't know. Maybe there's something we can do that lets us leave this on but still get the info.

Just riffing RE this particular fusion: if there's any min/max info during fusion that can be derived from the fused insert_slice we can split things such that we do a coarse subrange on the outside via insert_slice and then also put a smaller/tighter insert_slice on the inside. It'd be computationally simple (we have to compute offsets on the outside anyway) and would still give the information required for the host side. Effectively just rebasing the fused insert_slice to 0,0,,... and leaving an insert_slice with the original offsets outside. That feels trickier to do after fusion (would have to hoist the offset values and those may be dynamically computed inside the region) but maybe it falls out of other analysis you're doing already.

@MaheshRavishankar
Copy link
Contributor Author

Oh I think it's a useful fusion - just think it should start opt-in for specific known uses given the broader implications vs the norm (dispatch count, etc). We can keep an eye out for more scenarios and see if we can find matches (positive and negative) that help widen the net.

It is hard to carve a special path by looking at uses of tensor.insert_slice (actually thats basically what I tried. If the tensor.insert_slice is not part of a chain of inserts then I thought it avoided the issues you were highlighting.)

Incidentally the reasoning is the same as why having extracts outside of dispatches is useful; I'm not actually sure what's happening to those nowadays. Outside of dispatch regions ideally we'd have an extract of the minimal contiguous ND subrange on every operand and an insert for every result - if that best comes by construction (we don't fuse them in) or by analysis (we do all the fusions and then extract that information) or something in-between I don't know. Maybe there's something we can do that lets us leave this on but still get the info.

Right now all tensor.extract_slice gets pulled in. One of the issues is that non-contiguous extracts need to end up in dispatches. So currently its biased towards pulling in all the extracts. It is easier to not pull in extracts, but it has a cascading effect of some IR being left out of dispatches or creating more dispatches just cause the extract forms a barrier point. What we do have though is that the extract/insert get fused with the flow.dispatch.tensor.load/store where possible. So that might give an idea of the extent of the tensor accessed within each dispatch.

Just riffing RE this particular fusion: if there's any min/max info during fusion that can be derived from the fused insert_slice we can split things such that we do a coarse subrange on the outside via insert_slice and then also put a smaller/tighter insert_slice on the inside. It'd be computationally simple (we have to compute offsets on the outside anyway) and would still give the information required for the host side. Effectively just rebasing the fused insert_slice to 0,0,,... and leaving an insert_slice with the original offsets outside. That feels trickier to do after fusion (would have to hoist the offset values and those may be dynamically computed inside the region) but maybe it falls out of other analysis you're doing already.

I think the easiest place is to look at the offsets and sizes of the flow.dispatch.tensor.load/store. That should carry all the necessary information.

@benvanik
Copy link
Collaborator

benvanik commented May 24, 2022

I think the issue with looking at the flow.dispatch.tensor.load/stores is that it's a phase ordering inversion that creates a complex hoisting/recovery/raising - if after a round or two of canonicalizations/transforms they end up not just using dispatch operands or the loads/stores end up in conditional regions things get crazy. Conceptually it feels better to construct things with the right ranges instead.

What are the cases where folding in a non-workgroup-id-dependent contiguous extract/insert helps codegen? Could they be solved by some better extract/insert propagation (if the issue is that you have chained use -> insert -> use you want fused)? I'm trying to reason about what the information we need is (the minimal contiguous subranges of I/O) and how that matches with the execution model if we remove it (a particular dispatch across all workgroups must only access those minimal contiguous subranges). Basically, it should always be safe and more efficient to have the extracts/inserts on the outside, so what do we gain by fusing them in and how might we get that without doing it. If it's really just fixed-point iteration while building up a dispatch region this may be something the in_parallel stuff helps with by making the ordering clearer (fusion happens there, then when lowering those to dispatch regions we hoist whatever we can to extracts/slices on the outside)?

@MaheshRavishankar
Copy link
Contributor Author

I think the issue with looking at the flow.dispatch.tensor.load/stores is that it's a phase ordering inversion that creates a complex hoisting/recovery/raising - if after a round or two of canonicalizations/transforms they end up not just using dispatch operands or the loads/stores end up in conditional regions things get crazy. Conceptually it feels better to construct things with the right ranges instead.

What are the cases where folding in a non-workgroup-id-dependent contiguous extract/insert helps codegen? Could they be solved by some better extract/insert propagation (if the issue is that you have chained use -> insert -> use you want fused)? I'm trying to reason about what the information we need is (the minimal contiguous subranges of I/O) and how that matches with the execution model if we remove it (a particular dispatch across all workgroups must only access those minimal contiguous subranges). Basically, it should always be safe and more efficient to have the extracts/inserts on the outside, so what do we gain by fusing them in and how might we get that without doing it. If it's really just fixed-point iteration while building up a dispatch region this may be something the in_parallel stuff helps with by making the ordering clearer (fusion happens there, then when lowering those to dispatch regions we hoist whatever we can to extracts/slices on the outside)?

I am not sure I follow the question (and not sure in-parallel stuff helps with these anyway). There is no fixed point iteration happening w.r.t fusion (its a greedy fusion approach, it pulls in whatever it can). If the requirement here is that it is better to leave the tensor.extract_slice/tensor.insert_slice outside dispatches, then that is easy to do too (its literally a one-line change, just not sure what the requirements/ask is cause previously when these were not fused, there were a lot of comments/issues about this not happening, so I am a bit confused as to what is the expectation here). Its worth experimenting on that separately.

@benvanik
Copy link
Collaborator

You'd not be making this change if it didn't provide value to something you're working on and I'm saying that fusing extract/insert into dispatches that widen the subrange a dispatch operates on is a negative to the system as a whole if applied by default. I haven't articulated these expectations around extract/insert before and we have no code today explicitly trying to optimize for that - this PR is the first I've seen that potentially pessimizes things and hence the discussion. I'm trying to explore whether solving for whatever it is that caused you to make this PR is something that can also help us better align with the expectations around extract/insert being as narrow as possible. in_parallel & co intersect as that may be where fusion decision making moves to (vs your greedy stuff here) and it has extracts/inserts as part of its structure - thus if we're going to work on extract/insert fusion it seems relevant to talk about that.

Happy to pick this up in our 1:1 tomorrow.

@MaheshRavishankar
Copy link
Contributor Author

You'd not be making this change if it didn't provide value to something you're working on and I'm saying that fusing extract/insert into dispatches that widen the subrange a dispatch operates on is a negative to the system as a whole if applied by default. I haven't articulated these expectations around extract/insert before and we have no code today explicitly trying to optimize for that - this PR is the first I've seen that potentially pessimizes things and hence the discussion. I'm trying to explore whether solving for whatever it is that caused you to make this PR is something that can also help us better align with the expectations around extract/insert being as narrow as possible. in_parallel & co intersect as that may be where fusion decision making moves to (vs your greedy stuff here) and it has extracts/inserts as part of its structure - thus if we're going to work on extract/insert fusion it seems relevant to talk about that.

Happy to pick this up in our 1:1 tomorrow.

I do understand the overall reason why fusing tensor.extract_slice/insert_slice into dispatches pessimizes things. Let me first write down what happens today with these ops

  1. tensor.extract_slice is always fused into the dispatch it is used. If any tensor.extract_slice is not pulled into a dispatch
    a) Contiguous tensor.extract_slice get converted to flow.slice operations
    b) Non-contiguous tensor.extract_slice get pulled into their own dispatch (this is a fallback needed to not have any tensor ops outside dispatches - I think it is rarely hit)
  2. tensor.insert_slice is never fused into dispatches of its consumers/producers.
    a) Contiguous insert_slice get converted to flow.update operations.
    b) Non-contiguous insert_slice get pulled into their own dispatches.

I was trying to remove dispatches that are only tensor.insert_slice by "fusing" it with its producers. Incidentally that also makes padding fused into its producers. Understanding that this has implications, ill try to change this PR to be more targeted and work on pad alone.

So some questions for the future around how to handle tensor.extract_slice/tensor.insert_slice.

  • Is there a distinction to be made w.r.t contiguous vs non-contiguous operations since non-contiguous operations cannot lower to flow ops.
  • There might be (corner) cases where not fusing tensor.extract_slice operations into dispatches might mean some other ops that are producers of the slice op, either get convert to flow.tensor ops (like reshape) or as a fallback need to get pulled into their own dispatch. Off the top of my head I dont know if the latter actually happens. Worth making the change and trying out.

(just to restate, I understand that having the extract_slice/insert_slice` out of dispatches to get a better idea of the ranges being operated on is helpful, so happy to make that information available)

@MaheshRavishankar MaheshRavishankar marked this pull request as draft May 25, 2022 05:29
@MaheshRavishankar MaheshRavishankar changed the title Fuse tensor.insert_slice with producers. Fuse tensor.pad with producers. Jun 6, 2022
@MaheshRavishankar MaheshRavishankar deleted the fuse_pad_with_producer branch November 17, 2022 21:19
@MaheshRavishankar MaheshRavishankar restored the fuse_pad_with_producer branch November 17, 2022 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-*
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fusion for pad op in Linalg
2 participants