webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

qjia7 · 2022-07-28T06:03:52Z

This PR merges MatMulPackedVec4Program to MatMulPackedProgram and
refactors MatMulSplitKProgram.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

This PR merges MatMulPackedVec4Program to MatMulPackedProgram and refactors MatMulSplitKProgram.

xhcao · 2022-08-03T08:26:54Z

tfjs-backend-webgpu/src/matmul_splitK_webgpu.ts

-     `;
+    const atomicAddSnippet = (component: number) => {
+      return `
+      for (var i = 0; i < ${component}; i = i + 1)


Could we remove the for-loop here if component is 4, and let atomic operation of each component to be parallel？

atomic only support u32 or i32. What do you mean let atomic operation of each component to be parallel ?

Tried below snippet. However, it's kind of dead-locked. All vec4 cases show timeout. So it seems not workable to let atomic operation to be parallel if my code is correct.

const atomicAddVec4Snippet = () => { return ` var oldValue0 = atomicLoad(&(result[flatIndex])); var oldValue1 = atomicLoad(&(result[flatIndex + 1])); var oldValue2 = atomicLoad(&(result[flatIndex + 2])); var oldValue3 = atomicLoad(&(result[flatIndex + 3])); var exchanged0 = false; var exchanged1 = false; var exchanged2 = false; var exchanged3 = false; for (; !exchanged0 || !exchanged1 || !exchanged2 || !exchanged3;) { if (!exchanged0) { let newValue0 = bitcast<i32>(bitcast<f32>(oldValue0) + value[0]); let res0 = atomicCompareExchangeWeak(&(result[flatIndex]), oldValue0, newValue0); oldValue0 = res0.old_value; exchanged0 = res0.exchanged; } if (!exchanged1) { let newValue1 = bitcast<i32>(bitcast<f32>(oldValue1) + value[1]); let res1 = atomicCompareExchangeWeak(&(result[flatIndex + 1]), oldValue1, newValue1); oldValue1 = res1.old_value; exchanged1 = res1.exchanged; } if (exchanged2) { let newValue2 = bitcast<i32>(bitcast<f32>(oldValue2) + value[2]); let res2 = atomicCompareExchangeWeak(&(result[flatIndex + 2]), oldValue2, newValue2); oldValue2 = res2.old_value; exchanged2 = res2.exchanged; } if (!exchanged3) { let newValue3 = bitcast<i32>(bitcast<f32>(oldValue3) + value[3]); let res3 = atomicCompareExchangeWeak(&(result[flatIndex + 3]), oldValue3, newValue3); oldValue3 = res3.old_value; exchanged3 = res3.exchanged; } } `; };

Let's keep your previous code currently, and see if we could optimize it in the future.

xhcao · 2022-08-03T08:39:42Z

tfjs-backend-webgpu/src/matmul_splitK_webgpu.ts

+        this.outputShape[2] % 4 === 0;
+    this.shaderKey =
+        `matMulSplitK_${transposeA}_${transposeB}_${batchAEqualOne}_${
+            batchBEqualOne}_${this.elementsPerThread}_${this.isVec4}`;


I did not find where would change the this.elementsPerThread value [4, 4, 32]. If we would not change it, it is needed to be added into the shaderKey. You had removed this.elementsPerThread[0|1] = 1, was the performance still good for small output shape？

Sorry, missed this one. Removing this.elementsPerThread[0|1] = 1 is because we also need to consider isVec4 and transposeA to decide whether we can set this.elementsPerThread[0|1] = 1. For simplicity, I removed them. It won't impact the benchmark models since there aren't such kind of shapes in existed models. But if we custom a shape and test. I think it may have some difference. I would like to keep the current change and have a unified tuning on small M and N (<16). How do you think?

A second thought, I still can keep the original change. Please take the latest change. Thanks.

xhcao · 2022-08-03T08:43:46Z

tfjs-backend-webgpu/src/matmul_packed_webgpu.ts

              globalCol + innerCol);
          }
        }
-
+        kStart = kStart + TileInner;


If splitK is true, Is the statement needed?

For now, it's not needed. But it keeps the code consistency and compatibility and the overhead is small.

xhcao · 2022-08-03T08:49:41Z

tfjs-backend-webgpu/src/matmul_packed_webgpu.ts

@@ -170,10 +319,12 @@ export function makeMatMulPackedSource(

      let globalRow = i32(globalId.y) * RowPerThread;
      let globalCol = i32(globalId.x) * ColPerThread;
-      let batch = i32(globalId.z);
+      let batch = ${splitK ? '0' : 'i32(globalId.z)'};


If we merge it into here, would it be not convenient in the future if we want to extend the real batch is greater than 1 for splitK is true?

I think it's the same complexity since splitK can be used to distinguish the different path.

xhcao · 2022-08-12T01:05:41Z

tfjs-backend-webgpu/src/matmul_splitK_webgpu.ts

-     `;
+    const atomicAddSnippet = (component: number) => {
+      return `
+      for (var i = 0; i < ${component}; i = i + 1)


Let's keep your previous code currently, and see if we could optimize it in the future.

gyagp

LGTM

qjia7 added 2 commits July 28, 2022 13:59

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program

ba91ea2

This PR merges MatMulPackedVec4Program to MatMulPackedProgram and refactors MatMulSplitKProgram.

Fix tslint error

4c1e19a

qjia7 requested review from gyagp and xhcao July 28, 2022 06:43

xhcao reviewed Aug 3, 2022

View reviewed changes

Address comments

f8e5009

xhcao approved these changes Aug 12, 2022

View reviewed changes

gyagp approved these changes Aug 12, 2022

View reviewed changes

Merge branch 'master' into refactor_splitK

25332ba

gyagp merged commit a1e5aea into tensorflow:master Aug 12, 2022

qjia7 deleted the refactor_splitK branch August 15, 2022 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

qjia7 commented Jul 28, 2022 •

edited by nsthorat

Loading

xhcao Aug 3, 2022

qjia7 Aug 3, 2022

qjia7 Aug 11, 2022

xhcao Aug 12, 2022

xhcao Aug 3, 2022

qjia7 Aug 3, 2022

qjia7 Aug 11, 2022 •

edited

Loading

xhcao Aug 3, 2022

qjia7 Aug 3, 2022

xhcao Aug 3, 2022

qjia7 Aug 3, 2022

xhcao Aug 12, 2022

gyagp left a comment

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

webgpu: Merge MatMulPackedProgram and MatMulPackedVec4Program #6688

Conversation

qjia7 commented Jul 28, 2022 • edited by nsthorat Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qjia7 Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyagp left a comment

Choose a reason for hiding this comment

qjia7 commented Jul 28, 2022 •

edited by nsthorat

Loading

qjia7 Aug 11, 2022 •

edited

Loading