Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

Merged
merged 4 commits into from
Aug 12, 2022

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Jul 28, 2022

This PR merges MatMulPackedVec4Program to MatMulPackedProgram and
refactors MatMulSplitKProgram.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

This PR merges MatMulPackedVec4Program to MatMulPackedProgram and
refactors MatMulSplitKProgram.
@qjia7 qjia7 requested review from gyagp and xhcao July 28, 2022 06:43
`;
const atomicAddSnippet = (component: number) => {
return `
for (var i = 0; i < ${component}; i = i + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the for-loop here if component is 4, and let atomic operation of each component to be parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic only support u32 or i32. What do you mean let atomic operation of each component to be parallel ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried below snippet. However, it's kind of dead-locked. All vec4 cases show timeout. So it seems not workable to let atomic operation to be parallel if my code is correct.

    const atomicAddVec4Snippet = () => {
      return `
      var oldValue0 = atomicLoad(&(result[flatIndex]));
      var oldValue1 = atomicLoad(&(result[flatIndex + 1]));
      var oldValue2 = atomicLoad(&(result[flatIndex + 2]));
      var oldValue3 = atomicLoad(&(result[flatIndex + 3]));

      var exchanged0 = false;
      var exchanged1 = false;
      var exchanged2 = false;
      var exchanged3 = false;
      for (; !exchanged0 || !exchanged1 || !exchanged2 || !exchanged3;) {
        if (!exchanged0)
        {
          let newValue0 = bitcast<i32>(bitcast<f32>(oldValue0) + value[0]);
          let res0 = atomicCompareExchangeWeak(&(result[flatIndex]), oldValue0, newValue0);
          oldValue0 = res0.old_value;
          exchanged0 = res0.exchanged;
        }
        if (!exchanged1)
        {
          let newValue1 = bitcast<i32>(bitcast<f32>(oldValue1) + value[1]);
          let res1 = atomicCompareExchangeWeak(&(result[flatIndex + 1]), oldValue1, newValue1);
          oldValue1 = res1.old_value;
          exchanged1 = res1.exchanged;
        }
        if (exchanged2)
        {
          let newValue2 = bitcast<i32>(bitcast<f32>(oldValue2) + value[2]);
          let res2 = atomicCompareExchangeWeak(&(result[flatIndex + 2]), oldValue2, newValue2);
          oldValue2 = res2.old_value;
          exchanged2 = res2.exchanged;
        }
        if (!exchanged3)
        {
          let newValue3 = bitcast<i32>(bitcast<f32>(oldValue3) + value[3]);
          let res3 = atomicCompareExchangeWeak(&(result[flatIndex + 3]), oldValue3, newValue3);
          oldValue3 = res3.old_value;
          exchanged3 = res3.exchanged;
        }
      }
      `;
    };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep your previous code currently, and see if we could optimize it in the future.

this.outputShape[2] % 4 === 0;
this.shaderKey =
`matMulSplitK_${transposeA}_${transposeB}_${batchAEqualOne}_${
batchBEqualOne}_${this.elementsPerThread}_${this.isVec4}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find where would change the this.elementsPerThread value [4, 4, 32]. If we would not change it, it is needed to be added into the shaderKey. You had removed this.elementsPerThread[0|1] = 1, was the performance still good for small output shape?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this one. Removing this.elementsPerThread[0|1] = 1 is because we also need to consider isVec4 and transposeA to decide whether we can set this.elementsPerThread[0|1] = 1. For simplicity, I removed them. It won't impact the benchmark models since there aren't such kind of shapes in existed models. But if we custom a shape and test. I think it may have some difference. I would like to keep the current change and have a unified tuning on small M and N (<16). How do you think?

Copy link
Contributor Author

@qjia7 qjia7 Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second thought, I still can keep the original change. Please take the latest change. Thanks.

globalCol + innerCol);
}
}

kStart = kStart + TileInner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If splitK is true, Is the statement needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it's not needed. But it keeps the code consistency and compatibility and the overhead is small.

@@ -170,10 +319,12 @@ export function makeMatMulPackedSource(

let globalRow = i32(globalId.y) * RowPerThread;
let globalCol = i32(globalId.x) * ColPerThread;
let batch = i32(globalId.z);
let batch = ${splitK ? '0' : 'i32(globalId.z)'};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we merge it into here, would it be not convenient in the future if we want to extend the real batch is greater than 1 for splitK is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same complexity since splitK can be used to distinguish the different path.

`;
const atomicAddSnippet = (component: number) => {
return `
for (var i = 0; i < ${component}; i = i + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep your previous code currently, and see if we could optimize it in the future.

Copy link

@gyagp gyagp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gyagp gyagp merged commit a1e5aea into tensorflow:master Aug 12, 2022
@qjia7 qjia7 deleted the refactor_splitK branch August 15, 2022 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants