Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebGL] Implement packed NHWC Conv2d #6639

Merged
merged 15 commits into from
Jul 22, 2022
Merged

Conversation

Linchenn
Copy link
Collaborator

@Linchenn Linchenn commented Jul 20, 2022

Background:
Currently, WebGL's conv2d uses inefficient img2col algorithm, which needs to run 2 WebGL programs and has many duplicate reads, so this PR is trying to replace img2col algorithm for some cases. Similar to depthwise conv2d, the code is complex, so this implementation will not be used on default until the implementation is simplified.

This PR:

  1. implements packed NHWC conv2d, referring to packed NHWC depthwise's implementation.
  2. Adds 'WEBGL_EXP_CONV' flag (default as false). Only if this flag is set true, conv2d would use this implementation.
  3. Adds a nightly test to test this implementation.

For reviewer, pls review this commit to see the diff from depthwise's implementation.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@Linchenn Linchenn requested a review from pyu10055 July 21, 2022 18:38
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking forward to see perf improvement brought by this change.

Reviewed 1 of 1 files at r1, 4 of 4 files at r2, 2 of 2 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: 0 of 1 approvals obtained (waiting on @Linchenn)


tfjs-backend-webgl/src/conv_packed_gpu.ts line 3 at r1 (raw file):

/**
 * @license
 * Copyright 2018 Google LLC. All Rights Reserved.

2022


tfjs-backend-webgl/src/conv_packed_gpu.ts line 333 at r2 (raw file):

           mainLoop += `
               wTexel = getW(r, ${colIndex + 1}, d1, d2);
               dotProd += vec4(xC${colIndex + 1}.x, xC${colIndex + 1}.x, xC${colIndex + 1}.z, xC${colIndex + 1}.z) * vec4(wTexel.xy, wTexel.xy);

you can simplify this as xC${colIndex}.xxzz
Please apply this for other similar places as well


tfjs-backend-webgl/src/flags_webgl.ts line 242 at r9 (raw file):

/** Whether we will use the experimental conv op. */
ENV.registerFlag('WEBGL_EXP_CONV', () => false);

please add the test for this flag

Copy link
Collaborator Author

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!

Reviewable status: 0 of 1 approvals obtained (waiting on @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu.ts line 3 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

2022

Done.


tfjs-backend-webgl/src/conv_packed_gpu.ts line 333 at r2 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

you can simplify this as xC${colIndex}.xxzz
Please apply this for other similar places as well

Done.


tfjs-backend-webgl/src/flags_webgl.ts line 242 at r9 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

please add the test for this flag

Done.

Copy link
Collaborator Author

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

     mainLoop += `
     for (int r = 0; r < ${filterHeight}; r++) {
      for (int d1 = 0; d1 < ${convInfo.inChannels}; d1 += 2) {

Through the following experiments on Mac-chrome, find inputChannel-loop should be the inner loop, instead of the filterHeight-loop:

Cases1: input[1, 192, 192, 56] with filter[3, 3, 56, 64]
inputChannel-loop as inner loop: 29.408638880000012ms
inputChannel-loop as outer loop: 29.418943359999933ms

Cases2: input[1, 192, 192, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 0.991897599999999ms
inputChannel-loop as outer loop: 1.0030207999999998ms

Cases3: input[1, 768, 768, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 14.807828800000006ms
inputChannel-loop as outer loop: 20.644488320000043ms

Conclusion: Even though the both ways usually performance same, setting inputChannel-loop as the inner loop is always better both theoretically and practically. This deserves more experiments on more devices and parameter settings, so opened an issue #6661 to follow up on this.

@Linchenn Linchenn requested a review from pyu10055 July 21, 2022 22:01
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @Linchenn and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

Previously, Linchenn wrote…

Through the following experiments on Mac-chrome, find inputChannel-loop should be the inner loop, instead of the filterHeight-loop:

Cases1: input[1, 192, 192, 56] with filter[3, 3, 56, 64]
inputChannel-loop as inner loop: 29.408638880000012ms
inputChannel-loop as outer loop: 29.418943359999933ms

Cases2: input[1, 192, 192, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 0.991897599999999ms
inputChannel-loop as outer loop: 1.0030207999999998ms

Cases3: input[1, 768, 768, 8] with filter[3, 3, 8, 8]
inputChannel-loop as inner loop: 14.807828800000006ms
inputChannel-loop as outer loop: 20.644488320000043ms

Conclusion: Even though the both ways usually performance same, setting inputChannel-loop as the inner loop is always better both theoretically and practically. This deserves more experiments on more devices and parameter settings, so opened an issue #6661 to follow up on this.

Thank you for the quick experiment.

@Linchenn Linchenn requested review from lina128 and qjia7 July 21, 2022 22:35
Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks.

@@ -41,6 +42,18 @@ export function conv2d(
convInfo.strideHeight === 1 && convInfo.strideWidth === 1 &&
(convInfo.padInfo.type === 'SAME' || convInfo.padInfo.type === 'VALID')) {
out = conv2dByMatMul({x, filter, convInfo, backend});
} else if (convInfo.strideWidth <= 2 && $dataFormat === 'channelsLast'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pay attention that we may need more strict conditions to go to this path. For example, convInfo.dilationWidth === 1, convInfo.filterWidth > 3. For such kind of conv2d, Conv2DPackedProgram may get more benefits. Also try the impact of large/small convInfo.inChannels. We need better to know the threshold to go to Conv2DPackedProgram path in future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jiajia! I am not clear why we need more strict conditions here?

As we known, Conv2DPackedProgram is more efficient than the following two branches (conv2dWithIm2Row and Conv2DProgram), because it minimizes the number of 'read' instructions (such as getX, getW) in the shaders.

Currently Conv2DPackedProgram only implements NHWC format. However, NCHW format works better for larger filters, so we may implement it in future. Do you mean add more strict conditions to let NCHW-Conv2DPackedProgram to work on these cases?

(Since we are discussing the future work, I merged this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think Conv2DPackedProgram is always better than conv2dWithIm2Row even you have the condition that convInfo.strideWidth <= 2. But currently, almost all the models in benchmarks satisfy this condition. So you can easily test this new algorithm in benchmarks. I tested your PR locally with resnet50 model in benchmarks on Intel CFL. It becomes worse when WEBGL_EXP_CONV is true. So I think the conditions should be changed, at least on intel CFL. I use resnet50 because it has many non-point wise conv2d, which can better see the performance difference. That's why I suspect whether we should have more strict conditions. Have you tested some benchmarks and see obviously advantages to use Conv2DPackedProgram compared with conv2dWithIm2Row?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jiajia for benchmarking this!

From my original perspective, Conv2DPackedProgram aims to optimize all non-point wise conv2d against conv2dWithIm2Row (from your comments, this point of view deserves more experiments to support). The models (whose inference time is dominated by non-pointwise conv2d) are supposed to perform better if set WEBGL_EXP_CONV as true.

Conv2DPackedProgram performs great for one of our internal models (whose inference time is dominated by non-pointwise conv2d) (the model is speed up from 170ms to 50ms). A comprehensive benchmarking is a great idea and I will do it and share with you later.

I just tested bodypix with ResNet50 base on my Mac, and it just performance similar under WEBGL_EXP_CONV as true and as false. (The aggregated Conv2d time drops from 73.09ms to 52.11ms, when set it as true. It's wired that the final inference time keeps similar. I will take a look over this)
image
Since I do not have Intel CFL model, I am curious how worse it is for Intel CFL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Linchenn Thanks for your data. I rechecked it on Intel CFL, TGL and NV RTX3090 on Windows. All of them have obvious regressions on bodypix-ResNet50-image-0.5. Since Conv2DPackedProgram is behind a flag, it's ok now as it is. But once we plan to enable it by default. We need really to be careful about the conditions when to use it.

Take Intel TGL as an example:

  • WEBGL_EXP_CONV as true

webgl_exp_conv_true
webgl_exp_conv_true_avg_time
webgl_exp_env_true_individual

  • WEBGL_EXP_CONV as false

webgl_exp_conv_false
webgl_exp_conv_false_avg_time
webgl_exp_conv_false_individual

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jiajia for the detailed checking on various devices! I will check it before removing the flag and share with yoiu latter.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much Lin!

Reviewable status: :shipit: complete! 3 of 1 approvals obtained (waiting on @Linchenn and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu.ts line 72 at r10 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

Thank you for the quick experiment.

Great insight, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants