Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VectorDistribution] Add support for multi-subgroup attention #18188

Merged
merged 4 commits into from
Sep 11, 2024

Conversation

Groverkss
Copy link
Contributor

@Groverkss Groverkss commented Aug 10, 2024

This patch adds support for distributing attention to multiple subgroups.

Currently, we distinguish the two matmuls in attention by setting a discardable attribute on the matmuls (set during decomposition) used as a hint to layout anchoring, on what to do when it encounters these matmuls. (Note that even if these hints were dropped, it would only lead to a drop in performance, because the layout anchoring doesn't know its attention anymore). The correct way to handle these matmuls would be to start putting mma_schedule as an operation specific lowering config and teach decomposition to propagate this lowering to the two matmuls after decomposition. This is blocked by work on TileAndDistributeToWorkgroups supporting consumer fusion, and needs some heavy lifting.

@Groverkss Groverkss force-pushed the users/Groverkss/set-convolution-anchor branch from 80e1d84 to 4cb027b Compare August 26, 2024 12:13
@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from dc29782 to 8073687 Compare August 26, 2024 12:20
@Groverkss Groverkss force-pushed the users/Groverkss/set-convolution-anchor branch 3 times, most recently from 14bad9d to 21dddcf Compare August 28, 2024 12:06
@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from 8073687 to e312fa7 Compare August 28, 2024 14:45
Base automatically changed from users/Groverkss/set-convolution-anchor to main August 28, 2024 14:59
@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from e312fa7 to 648d7ca Compare August 28, 2024 15:02
@Groverkss Groverkss requested a review from raikonenfnu August 28, 2024 15:02
@Groverkss Groverkss marked this pull request as ready for review August 28, 2024 15:40
Copy link
Collaborator

@raikonenfnu raikonenfnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, overall looks great just a quick Q/NIT, but not blocking.

@@ -297,6 +297,7 @@ OnlineAttentionOp::decomposeOperation(OpBuilder &b) {
Value sZero = b.create<arith::ConstantOp>(loc, b.getZeroAttr(elementType));
Value s = b.create<linalg::FillOp>(loc, sZero, emptyS).getResult(0);
s = computeMatmul(b, loc, getQueryMap(), getKeyMap(), sMap, query, key, s);
s.getDefiningOp()->setAttr("attention_qk_matmul", b.getUnitAttr());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Any thoughts on making this a standardized/registered attribute in LinalgExt dialect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an anti-pattern. We shouldnt rely on such attributes, so i dont want to "codify" them. lets land this for now but plan to unwind this in the medium term. Could you add a note here that we shouldnt be relying on such attributes.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

@raikonenfnu
Copy link
Collaborator

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Today, IIUC we'd need to decompose earlier than vector distribution because attention decomposes into non-trivial ops such as matmuls, shuffles/reductions (to a lesser extend reads, and broadcasts) which requires layout analysis and vector distribution to ensure the thread-distributed shapes play nice with each other.

@Groverkss
Copy link
Contributor Author

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Probably not… if we do that, we would effectively be writing microkernels for attention hardcoded for each intrinsic type at thread level. Which is fine… but not sure if we want to do that.

one thing we could do is do subgroup distribution at attention op level and do thread distribution after decomposition. This would require a major rerwite of vector distribution, splitting it up into subgroup and thread level distribution. Im also not sure if we can properly split things up also.

id rather land this patch, and invest effort in teaching TileAndFuse to do attention instead of rerwiting VectorDistribution.

@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch 2 times, most recently from 6166c83 to 89a46e9 Compare August 29, 2024 15:24
@@ -297,6 +297,7 @@ OnlineAttentionOp::decomposeOperation(OpBuilder &b) {
Value sZero = b.create<arith::ConstantOp>(loc, b.getZeroAttr(elementType));
Value s = b.create<linalg::FillOp>(loc, sZero, emptyS).getResult(0);
s = computeMatmul(b, loc, getQueryMap(), getKeyMap(), sMap, query, key, s);
s.getDefiningOp()->setAttr("attention_qk_matmul", b.getUnitAttr());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an anti-pattern. We shouldnt rely on such attributes, so i dont want to "codify" them. lets land this for now but plan to unwind this in the medium term. Could you add a note here that we shouldnt be relying on such attributes.

@MaheshRavishankar
Copy link
Contributor

Is there a way you can avoid maybe decomposing attention operation until vector distribution and handle the layout distribution for attention directly?

Probably not… if we do that, we would effectively be writing microkernels for attention hardcoded for each intrinsic type at thread level. Which is fine… but not sure if we want to do that.

one thing we could do is do subgroup distribution at attention op level and do thread distribution after decomposition. This would require a major rerwite of vector distribution, splitting it up into subgroup and thread level distribution. Im also not sure if we can properly split things up also.

Well, we could also just decompose within the pass as a "pre-processing". Then the attribute becomes an internal detail of the pass.

id rather land this patch, and invest effort in teaching TileAndFuse to do attention instead of rerwiting VectorDistribution.

Ok, I stamped it, but please add TODO/warnings as to this being unstable.

@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from 89a46e9 to 0ff52b3 Compare September 2, 2024 10:39
@Groverkss
Copy link
Contributor Author

There are some tests that exceed shared memory, so i'm going to wait for #18415 to land before i land this.

@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch 2 times, most recently from 7b2e0cd to eb7d6ea Compare September 10, 2024 18:42
[VectorDistribution] Add support for multi-subgroup attention

No tech debt

Add TODO comment

Add configuration heuristics for attention

address comments

address more comments

Update tests
@Groverkss Groverkss force-pushed the users/Groverkss/attention-multi-subgroup branch from eb7d6ea to 0b58ffc Compare September 11, 2024 14:52
@Groverkss Groverkss merged commit 60843ec into main Sep 11, 2024
40 checks passed
@Groverkss Groverkss deleted the users/Groverkss/attention-multi-subgroup branch September 11, 2024 15:33
josemonsalve2 pushed a commit to josemonsalve2/iree that referenced this pull request Sep 14, 2024
…rg#18188)

This patch adds support for distributing attention to multiple
subgroups.

Currently, we distinguish the two matmuls in attention by setting a
discardable attribute on the matmuls (set during decomposition) used as
a hint to layout anchoring, on what to do when it encounters these
matmuls. (Note that even if these hints were dropped, it would only lead
to a drop in performance, because the layout anchoring doesn't know its
attention anymore). The correct way to handle these matmuls would be to
start putting mma_schedule as an operation specific lowering config and
teach decomposition to propagate this lowering to the two matmuls after
decomposition. This is blocked by work on TileAndDistributeToWorkgroups
supporting consumer fusion, and needs some heavy lifting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants