-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688
Conversation
This PR merges MatMulPackedVec4Program to MatMulPackedProgram and refactors MatMulSplitKProgram.
`; | ||
const atomicAddSnippet = (component: number) => { | ||
return ` | ||
for (var i = 0; i < ${component}; i = i + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the for-loop here if component
is 4, and let atomic operation of each component to be parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried below snippet. However, it's kind of dead-locked. All vec4 cases show timeout. So it seems not workable to let atomic operation to be parallel if my code is correct.
const atomicAddVec4Snippet = () => {
return `
var oldValue0 = atomicLoad(&(result[flatIndex]));
var oldValue1 = atomicLoad(&(result[flatIndex + 1]));
var oldValue2 = atomicLoad(&(result[flatIndex + 2]));
var oldValue3 = atomicLoad(&(result[flatIndex + 3]));
var exchanged0 = false;
var exchanged1 = false;
var exchanged2 = false;
var exchanged3 = false;
for (; !exchanged0 || !exchanged1 || !exchanged2 || !exchanged3;) {
if (!exchanged0)
{
let newValue0 = bitcast<i32>(bitcast<f32>(oldValue0) + value[0]);
let res0 = atomicCompareExchangeWeak(&(result[flatIndex]), oldValue0, newValue0);
oldValue0 = res0.old_value;
exchanged0 = res0.exchanged;
}
if (!exchanged1)
{
let newValue1 = bitcast<i32>(bitcast<f32>(oldValue1) + value[1]);
let res1 = atomicCompareExchangeWeak(&(result[flatIndex + 1]), oldValue1, newValue1);
oldValue1 = res1.old_value;
exchanged1 = res1.exchanged;
}
if (exchanged2)
{
let newValue2 = bitcast<i32>(bitcast<f32>(oldValue2) + value[2]);
let res2 = atomicCompareExchangeWeak(&(result[flatIndex + 2]), oldValue2, newValue2);
oldValue2 = res2.old_value;
exchanged2 = res2.exchanged;
}
if (!exchanged3)
{
let newValue3 = bitcast<i32>(bitcast<f32>(oldValue3) + value[3]);
let res3 = atomicCompareExchangeWeak(&(result[flatIndex + 3]), oldValue3, newValue3);
oldValue3 = res3.old_value;
exchanged3 = res3.exchanged;
}
}
`;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep your previous code currently, and see if we could optimize it in the future.
this.outputShape[2] % 4 === 0; | ||
this.shaderKey = | ||
`matMulSplitK_${transposeA}_${transposeB}_${batchAEqualOne}_${ | ||
batchBEqualOne}_${this.elementsPerThread}_${this.isVec4}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find where would change the this.elementsPerThread
value [4, 4, 32]. If we would not change it, it is needed to be added into the shaderKey. You had removed this.elementsPerThread[0|1] = 1
, was the performance still good for small output shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this one. Removing this.elementsPerThread[0|1] = 1
is because we also need to consider isVec4
and transposeA
to decide whether we can set this.elementsPerThread[0|1] = 1
. For simplicity, I removed them. It won't impact the benchmark models since there aren't such kind of shapes in existed models. But if we custom a shape and test. I think it may have some difference. I would like to keep the current change and have a unified tuning on small M and N (<16). How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A second thought, I still can keep the original change. Please take the latest change. Thanks.
globalCol + innerCol); | ||
} | ||
} | ||
|
||
kStart = kStart + TileInner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If splitK is true, Is the statement needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it's not needed. But it keeps the code consistency and compatibility and the overhead is small.
@@ -170,10 +319,12 @@ export function makeMatMulPackedSource( | |||
|
|||
let globalRow = i32(globalId.y) * RowPerThread; | |||
let globalCol = i32(globalId.x) * ColPerThread; | |||
let batch = i32(globalId.z); | |||
let batch = ${splitK ? '0' : 'i32(globalId.z)'}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we merge it into here, would it be not convenient in the future if we want to extend the real batch is greater than 1 for splitK is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the same complexity since splitK
can be used to distinguish the different path.
`; | ||
const atomicAddSnippet = (component: number) => { | ||
return ` | ||
for (var i = 0; i < ${component}; i = i + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep your previous code currently, and see if we could optimize it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR merges MatMulPackedVec4Program to MatMulPackedProgram and
refactors MatMulSplitKProgram.
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is