[WebGL] Implement packed NHWC Conv2d #6639

Linchenn · 2022-07-20T17:51:14Z

Background:
Currently, WebGL's conv2d uses inefficient img2col algorithm, which needs to run 2 WebGL programs and has many duplicate reads, so this PR is trying to replace img2col algorithm for some cases. Similar to depthwise conv2d, the code is complex, so this implementation will not be used on default until the implementation is simplified.

This PR:

implements packed NHWC conv2d, referring to packed NHWC depthwise's implementation.
Adds 'WEBGL_EXP_CONV' flag (default as false). Only if this flag is set true, conv2d would use this implementation.
Adds a nightly test to test this implementation.

For reviewer, pls review this commit to see the diff from depthwise's implementation.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

pyu10055

Thanks! Looking forward to see perf improvement brought by this change.

Reviewed 1 of 1 files at r1, 4 of 4 files at r2, 2 of 2 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: 0 of 1 approvals obtained (waiting on @Linchenn)

tfjs-backend-webgl/src/conv_packed_gpu.ts line 3 at r1 (raw file):

/**
 * @license
 * Copyright 2018 Google LLC. All Rights Reserved.

2022

tfjs-backend-webgl/src/conv_packed_gpu.ts line 333 at r2 (raw file):

           mainLoop += `
               wTexel = getW(r, ${colIndex + 1}, d1, d2);
               dotProd += vec4(xC${colIndex + 1}.x, xC${colIndex + 1}.x, xC${colIndex + 1}.z, xC${colIndex + 1}.z) * vec4(wTexel.xy, wTexel.xy);

you can simplify this as xC${colIndex}.xxzz
Please apply this for other similar places as well

tfjs-backend-webgl/src/flags_webgl.ts line 242 at r9 (raw file):

/** Whether we will use the experimental conv op. */
ENV.registerFlag('WEBGL_EXP_CONV', () => false);

please add the test for this flag

Linchenn

Thanks for the suggestions!

Reviewable status: 0 of 1 approvals obtained (waiting on @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu.ts line 3 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

2022

Done.

tfjs-backend-webgl/src/conv_packed_gpu.ts line 333 at r2 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

you can simplify this as xC${colIndex}.xxzz
Please apply this for other similar places as well

Done.

tfjs-backend-webgl/src/flags_webgl.ts line 242 at r9 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

please add the test for this flag

Done.

Linchenn

Reviewable status: 0 of 1 approvals obtained (waiting on @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

     mainLoop += `
     for (int r = 0; r < ${filterHeight}; r++) {
      for (int d1 = 0; d1 < ${convInfo.inChannels}; d1 += 2) {

Through the following experiments on Mac-chrome, find inputChannel-loop should be the inner loop, instead of the filterHeight-loop:

Cases1: input[1, 192, 192, 56] with filter[3, 3, 56, 64]
inputChannel-loop as inner loop: 29.408638880000012ms
inputChannel-loop as outer loop: 29.418943359999933ms

Cases2: input[1, 192, 192, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 0.991897599999999ms
inputChannel-loop as outer loop: 1.0030207999999998ms

Cases3: input[1, 768, 768, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 14.807828800000006ms
inputChannel-loop as outer loop: 20.644488320000043ms

Conclusion: Even though the both ways usually performance same, setting inputChannel-loop as the inner loop is always better both theoretically and practically. This deserves more experiments on more devices and parameter settings, so opened an issue #6661 to follow up on this.

pyu10055

Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @Linchenn and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

Previously, Linchenn wrote…

Through the following experiments on Mac-chrome, find inputChannel-loop should be the inner loop, instead of the filterHeight-loop:

Cases1: input[1, 192, 192, 56] with filter[3, 3, 56, 64]
inputChannel-loop as inner loop: 29.408638880000012ms
inputChannel-loop as outer loop: 29.418943359999933ms

Cases2: input[1, 192, 192, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 0.991897599999999ms
inputChannel-loop as outer loop: 1.0030207999999998ms

Cases3: input[1, 768, 768, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 14.807828800000006ms
inputChannel-loop as outer loop: 20.644488320000043ms

Conclusion: Even though the both ways usually performance same, setting inputChannel-loop as the inner loop is always better both theoretically and practically. This deserves more experiments on more devices and parameter settings, so opened an issue #6661 to follow up on this.

Thank you for the quick experiment.

qjia7

LGTM, Thanks.

qjia7 · 2022-07-22T08:02:27Z

tfjs-backend-webgl/src/kernels/Conv2D.ts

@@ -41,6 +42,18 @@ export function conv2d(
      convInfo.strideHeight === 1 && convInfo.strideWidth === 1 &&
      (convInfo.padInfo.type === 'SAME' || convInfo.padInfo.type === 'VALID')) {
    out = conv2dByMatMul({x, filter, convInfo, backend});
+  } else if (convInfo.strideWidth <= 2 && $dataFormat === 'channelsLast'


Please pay attention that we may need more strict conditions to go to this path. For example, convInfo.dilationWidth === 1, convInfo.filterWidth > 3. For such kind of conv2d, Conv2DPackedProgram may get more benefits. Also try the impact of large/small convInfo.inChannels. We need better to know the threshold to go to Conv2DPackedProgram path in future.

Thank you Jiajia! I am not clear why we need more strict conditions here?

As we known, Conv2DPackedProgram is more efficient than the following two branches (conv2dWithIm2Row and Conv2DProgram), because it minimizes the number of 'read' instructions (such as getX, getW) in the shaders.

Currently Conv2DPackedProgram only implements NHWC format. However, NCHW format works better for larger filters, so we may implement it in future. Do you mean add more strict conditions to let NCHW-Conv2DPackedProgram to work on these cases?

(Since we are discussing the future work, I merged this PR.)

Personally, I don't think Conv2DPackedProgram is always better than conv2dWithIm2Row even you have the condition that convInfo.strideWidth <= 2. But currently, almost all the models in benchmarks satisfy this condition. So you can easily test this new algorithm in benchmarks. I tested your PR locally with resnet50 model in benchmarks on Intel CFL. It becomes worse when WEBGL_EXP_CONV is true. So I think the conditions should be changed, at least on intel CFL. I use resnet50 because it has many non-point wise conv2d, which can better see the performance difference. That's why I suspect whether we should have more strict conditions. Have you tested some benchmarks and see obviously advantages to use Conv2DPackedProgram compared with conv2dWithIm2Row?

Thank you Jiajia for benchmarking this!

From my original perspective, Conv2DPackedProgram aims to optimize all non-point wise conv2d against conv2dWithIm2Row (from your comments, this point of view deserves more experiments to support). The models (whose inference time is dominated by non-pointwise conv2d) are supposed to perform better if set WEBGL_EXP_CONV as true.

Conv2DPackedProgram performs great for one of our internal models (whose inference time is dominated by non-pointwise conv2d) (the model is speed up from 170ms to 50ms). A comprehensive benchmarking is a great idea and I will do it and share with you later.

I just tested bodypix with ResNet50 base on my Mac, and it just performance similar under WEBGL_EXP_CONV as true and as false. (The aggregated Conv2d time drops from 73.09ms to 52.11ms, when set it as true. It's wired that the final inference time keeps similar. I will take a look over this)

Since I do not have Intel CFL model, I am curious how worse it is for Intel CFL?

@Linchenn Thanks for your data. I rechecked it on Intel CFL, TGL and NV RTX3090 on Windows. All of them have obvious regressions on bodypix-ResNet50-image-0.5. Since Conv2DPackedProgram is behind a flag, it's ok now as it is. But once we plan to enable it by default. We need really to be careful about the conditions when to use it.

Take Intel TGL as an example:

WEBGL_EXP_CONV as true

WEBGL_EXP_CONV as false

Thank you Jiajia for the detailed checking on various devices! I will check it before removing the flag and share with yoiu latter.

lina128

Thank you so much Lin!

Reviewable status: complete! 3 of 1 approvals obtained (waiting on @Linchenn and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

Thank you for the quick experiment.

Great insight, thank you!

Linchenn and others added 10 commits July 20, 2022 10:42

Copy DepthwiseConvPacked2DProgram

9c1ad8a

impl packed conv2d

9889f5a

lint

c3565b4

add flag

c5d4754

fix activation

4be7330

temp fix

e590f74

Update conv_packed_gpu.ts

9bfd743

add flag

f553f05

Add nightly tests

355c03e

Merge branch 'master' into expConv2d

f017a3a

Linchenn requested a review from pyu10055 July 21, 2022 18:38

pyu10055 requested changes Jul 21, 2022

View reviewed changes

Linchenn added 4 commits July 21, 2022 13:09

flag test

d874dd6

date

6751c65

simplify

9ffaf85

switch two loops

124b306

Linchenn commented Jul 21, 2022

View reviewed changes

Linchenn requested a review from pyu10055 July 21, 2022 22:01

pyu10055 approved these changes Jul 21, 2022

View reviewed changes

Linchenn requested review from lina128 and qjia7 July 21, 2022 22:35

qjia7 approved these changes Jul 22, 2022

View reviewed changes

lina128 approved these changes Jul 22, 2022

View reviewed changes

Merge branch 'master' into expConv2d

485e630

Linchenn merged commit 17c0b71 into tensorflow:master Jul 22, 2022

Linchenn deleted the expConv2d branch July 22, 2022 18:25

vladmandic mentioned this pull request Jul 28, 2022

model inference causes unresponsiveness and even system crash when using webgl backend #6678

Closed

gaikwadrahul8 mentioned this pull request Jul 30, 2023

[WebGL] Compare setting inputChannel-loop as inner loop against filterHeight-loop as inner loop for Conv2d #6661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGL] Implement packed NHWC Conv2d #6639

[WebGL] Implement packed NHWC Conv2d #6639

Linchenn commented Jul 20, 2022 •

edited

Loading

pyu10055 left a comment

Linchenn left a comment

Linchenn left a comment

pyu10055 left a comment

qjia7 left a comment

qjia7 Jul 22, 2022

Linchenn Jul 22, 2022

qjia7 Jul 23, 2022

Linchenn Jul 23, 2022

qjia7 Jul 25, 2022

Linchenn Jul 25, 2022

lina128 left a comment

[WebGL] Implement packed NHWC Conv2d #6639

[WebGL] Implement packed NHWC Conv2d #6639

Conversation

Linchenn commented Jul 20, 2022 • edited Loading

pyu10055 left a comment

Choose a reason for hiding this comment

Linchenn left a comment

Choose a reason for hiding this comment

Linchenn left a comment

Choose a reason for hiding this comment

pyu10055 left a comment

Choose a reason for hiding this comment

qjia7 left a comment

Choose a reason for hiding this comment

qjia7 Jul 22, 2022

Choose a reason for hiding this comment

Linchenn Jul 22, 2022

Choose a reason for hiding this comment

qjia7 Jul 23, 2022

Choose a reason for hiding this comment

Linchenn Jul 23, 2022

Choose a reason for hiding this comment

qjia7 Jul 25, 2022

Choose a reason for hiding this comment

Linchenn Jul 25, 2022

Choose a reason for hiding this comment

lina128 left a comment

Choose a reason for hiding this comment

Linchenn commented Jul 20, 2022 •

edited

Loading