[VectorDistribution] Add support for multi-subgroup attention #18188

Groverkss · 2024-08-10T00:46:40Z

This patch adds support for distributing attention to multiple subgroups.

Currently, we distinguish the two matmuls in attention by setting a discardable attribute on the matmuls (set during decomposition) used as a hint to layout anchoring, on what to do when it encounters these matmuls. (Note that even if these hints were dropped, it would only lead to a drop in performance, because the layout anchoring doesn't know its attention anymore). The correct way to handle these matmuls would be to start putting mma_schedule as an operation specific lowering config and teach decomposition to propagate this lowering to the two matmuls after decomposition. This is blocked by work on TileAndDistributeToWorkgroups supporting consumer fusion, and needs some heavy lifting.

raikonenfnu

Awesome work, overall looks great just a quick Q/NIT, but not blocking.

raikonenfnu · 2024-08-28T15:46:29Z

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/AggregatedOpInterfaceImpl.cpp

@@ -297,6 +297,7 @@ OnlineAttentionOp::decomposeOperation(OpBuilder &b) {
  Value sZero = b.create<arith::ConstantOp>(loc, b.getZeroAttr(elementType));
  Value s = b.create<linalg::FillOp>(loc, sZero, emptyS).getResult(0);
  s = computeMatmul(b, loc, getQueryMap(), getKeyMap(), sMap, query, key, s);
+  s.getDefiningOp()->setAttr("attention_qk_matmul", b.getUnitAttr());


NIT: Any thoughts on making this a standardized/registered attribute in LinalgExt dialect?

This is an anti-pattern. We shouldnt rely on such attributes, so i dont want to "codify" them. lets land this for now but plan to unwind this in the medium term. Could you add a note here that we shouldnt be relying on such attributes.

MaheshRavishankar

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

raikonenfnu · 2024-08-29T05:37:06Z

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Today, IIUC we'd need to decompose earlier than vector distribution because attention decomposes into non-trivial ops such as matmuls, shuffles/reductions (to a lesser extend reads, and broadcasts) which requires layout analysis and vector distribution to ensure the thread-distributed shapes play nice with each other.

Groverkss · 2024-08-29T08:17:11Z

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Probably not… if we do that, we would effectively be writing microkernels for attention hardcoded for each intrinsic type at thread level. Which is fine… but not sure if we want to do that.

one thing we could do is do subgroup distribution at attention op level and do thread distribution after decomposition. This would require a major rerwite of vector distribution, splitting it up into subgroup and thread level distribution. Im also not sure if we can properly split things up also.

id rather land this patch, and invest effort in teaching TileAndFuse to do attention instead of rerwiting VectorDistribution.

MaheshRavishankar · 2024-08-29T16:00:39Z

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/AggregatedOpInterfaceImpl.cpp

@@ -297,6 +297,7 @@ OnlineAttentionOp::decomposeOperation(OpBuilder &b) {
  Value sZero = b.create<arith::ConstantOp>(loc, b.getZeroAttr(elementType));
  Value s = b.create<linalg::FillOp>(loc, sZero, emptyS).getResult(0);
  s = computeMatmul(b, loc, getQueryMap(), getKeyMap(), sMap, query, key, s);
+  s.getDefiningOp()->setAttr("attention_qk_matmul", b.getUnitAttr());


This is an anti-pattern. We shouldnt rely on such attributes, so i dont want to "codify" them. lets land this for now but plan to unwind this in the medium term. Could you add a note here that we shouldnt be relying on such attributes.

MaheshRavishankar · 2024-08-29T16:02:12Z

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Probably not… if we do that, we would effectively be writing microkernels for attention hardcoded for each intrinsic type at thread level. Which is fine… but not sure if we want to do that.

one thing we could do is do subgroup distribution at attention op level and do thread distribution after decomposition. This would require a major rerwite of vector distribution, splitting it up into subgroup and thread level distribution. Im also not sure if we can properly split things up also.

Well, we could also just decompose within the pass as a "pre-processing". Then the attribute becomes an internal detail of the pass.

id rather land this patch, and invest effort in teaching TileAndFuse to do attention instead of rerwiting VectorDistribution.

Ok, I stamped it, but please add TODO/warnings as to this being unstable.

Groverkss · 2024-09-04T09:28:25Z

There are some tests that exceed shared memory, so i'm going to wait for #18415 to land before i land this.

[VectorDistribution] Add support for multi-subgroup attention No tech debt Add TODO comment Add configuration heuristics for attention address comments address more comments Update tests

…rg#18188) This patch adds support for distributing attention to multiple subgroups. Currently, we distinguish the two matmuls in attention by setting a discardable attribute on the matmuls (set during decomposition) used as a hint to layout anchoring, on what to do when it encounters these matmuls. (Note that even if these hints were dropped, it would only lead to a drop in performance, because the layout anchoring doesn't know its attention anymore). The correct way to handle these matmuls would be to start putting mma_schedule as an operation specific lowering config and teach decomposition to propagate this lowering to the two matmuls after decomposition. This is blocked by work on TileAndDistributeToWorkgroups supporting consumer fusion, and needs some heavy lifting.

Groverkss force-pushed the users/Groverkss/set-convolution-anchor branch from 80e1d84 to 4cb027b Compare August 26, 2024 12:13

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from dc29782 to 8073687 Compare August 26, 2024 12:20

Groverkss force-pushed the users/Groverkss/set-convolution-anchor branch 3 times, most recently from 14bad9d to 21dddcf Compare August 28, 2024 12:06

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from 8073687 to e312fa7 Compare August 28, 2024 14:45

Base automatically changed from users/Groverkss/set-convolution-anchor to main August 28, 2024 14:59

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from e312fa7 to 648d7ca Compare August 28, 2024 15:02

Groverkss requested a review from raikonenfnu August 28, 2024 15:02

Groverkss marked this pull request as ready for review August 28, 2024 15:40

Groverkss requested review from hanhanW, MaheshRavishankar, qedawkins and kuhar as code owners August 28, 2024 15:40

raikonenfnu approved these changes Aug 28, 2024

View reviewed changes

MaheshRavishankar reviewed Aug 28, 2024

View reviewed changes

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch 2 times, most recently from 6166c83 to 89a46e9 Compare August 29, 2024 15:24

MaheshRavishankar approved these changes Aug 29, 2024

View reviewed changes

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from 89a46e9 to 0ff52b3 Compare September 2, 2024 10:39

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch 2 times, most recently from 7b2e0cd to eb7d6ea Compare September 10, 2024 18:42

Groverkss added 3 commits September 11, 2024 15:51

t This is a combination of 4 commits.

579488c

[VectorDistribution] Add support for multi-subgroup attention No tech debt Add TODO comment Add configuration heuristics for attention address comments address more comments Update tests

Update test

552cf8e

remove n tile subgroup

0b58ffc

Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from eb7d6ea to 0b58ffc Compare September 11, 2024 14:52

Update test

7215e8f

Groverkss merged commit 60843ec into main Sep 11, 2024
40 checks passed

Groverkss deleted the users/Groverkss/attention-multi-subgroup branch September 11, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VectorDistribution] Add support for multi-subgroup attention #18188

[VectorDistribution] Add support for multi-subgroup attention #18188

Groverkss commented Aug 10, 2024 •

edited

Loading

raikonenfnu left a comment •

edited

Loading

raikonenfnu Aug 28, 2024

MaheshRavishankar Aug 29, 2024

MaheshRavishankar left a comment

raikonenfnu commented Aug 29, 2024

Groverkss commented Aug 29, 2024

MaheshRavishankar Aug 29, 2024

MaheshRavishankar commented Aug 29, 2024

Groverkss commented Sep 4, 2024

[VectorDistribution] Add support for multi-subgroup attention #18188

[VectorDistribution] Add support for multi-subgroup attention #18188

Conversation

Groverkss commented Aug 10, 2024 • edited Loading

raikonenfnu left a comment • edited Loading

Choose a reason for hiding this comment

raikonenfnu Aug 28, 2024

Choose a reason for hiding this comment

MaheshRavishankar Aug 29, 2024

Choose a reason for hiding this comment

MaheshRavishankar left a comment

Choose a reason for hiding this comment

raikonenfnu commented Aug 29, 2024

Groverkss commented Aug 29, 2024

MaheshRavishankar Aug 29, 2024

Choose a reason for hiding this comment

MaheshRavishankar commented Aug 29, 2024

Groverkss commented Sep 4, 2024

Groverkss commented Aug 10, 2024 •

edited

Loading

raikonenfnu left a comment •

edited

Loading