[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

guberti · 2022-10-03T12:46:18Z

This pull request adds fast microTVM DSP schedules for the optimal (for Cortex-M) conv2d and depthwise_conv2d layouts - NHWC/OHWI and NCHW/OIHW, respectively. By letting us use the special SMLAD DSP instruction without rearranging data, 25% of the instructions in the inner loop can be removed (for the int16 case).

Additionally, this change allows both the conv2d and depthwise_conv2d fast DSP schedules to use the same underlying intrinsic, a variation of a tensordot operator. This makes the code for these schedules much more compact. This PR also:

Adds unit tests for the new schedules
Does not affect the old schedules, which are still used when they apply
- The cases in which these new fast schedules apply are strictly different
Adds support for int8, int16, and int32 input data types in the new schedules
Adds a change_constant_shape utility to tvm.topi

I've also written a comment below delving into why these layouts are optimal.

guberti · 2022-10-03T12:58:21Z

In #12856, we discussed how NHWC was a bad format to use for depthwise_conv2d in microTVM (and likewise NCHW a bad format for regular conv2d). From this, one might ask the question:

Given a conv2d on Cortex-M4 with n groups, what are the optimal data and kernel layouts?

When choosing these layouts, we really want data that will be multiply-accumulated to be next to each other in memory. The primary reason is that doing this lets us use the __SMLAD instruction with minimal overhead, which does two multiply-accumulates with one instruction. The secondary reason, however, is to let us use *ptr++ when reading both the input data and kernel as much as possible, as *ptr++ is one instruction (when pipelined) on Cortex-M.

For depthwise convolutions, channels do not interact with each other at all, so they have no reason to be near each other in memory. This applies to both the input data and the kernel. Hence, NCHW is the optimal data layout and OIHW the optimal kernel layout. By similar reasoning, NHWC and OHWI are optimal for regular Conv2D operators. But we can generalize further - for a generalized Conv2D with n groups and c channels, the optimal layouts are NCHWxc/OIHWxi where x = c / n.

Now, assume we are performing a generalized Conv2D with n groups and c channels, using data layout NCHWxc and kernel layout OIHWxi. For the int16 case, to convolve one entire row (width * channels / groups individual parameters), all we need to do is copy/paste this code width * channels / (2 * groups) times (the 2 coming from the fact two int16 values fit into an int32).

uint32_t tensor_batch = *tensor++;
uint32_t kernel_batch = *kernel++;
sum = __SMLAD(tensor_batch, kernel_batch, sum);

This code does not depend on the number of groups, which allows us to use the same tensorize function for both regular and depthwise convolutions!

guberti · 2022-10-05T08:26:12Z

Additionally, while these schedules are finished and ready to use when appropriate, there are still three changes I must make before they contribute to performance improvements on common tiny models (e.g. MobileNetV1 or ResNet).

Add support for non-word-aligned kernels. For int32 input dtype, the current schedules have no limitations. However, for int16 and int8 input dtypes, we require that the kernels must fit evenly into words. For example, this means that our depthwise schedule would not currently work with a 3x3 kernel when used with the int16 input dtype, as it would take up 4.5 words in memory.

Arm Cortex-M does not support fast unaligned memory accesses, so the fix here is kinda gnarly. The int16 case here is important enough that I really ought to add this feature, but the fix for int8 is less useful and requires way more casework. Hence, I'll probably only fix this for int16.

Use the out_layout attribute to allow changing layouts. Many common models (like MobileNetV1) alternate between regular and depthwise convolutions to improve performance. Since we prefer the NHWC data layout for regular convolutions and the NCHW data layout for depthwise convolutions, it would be really nice for our regular convolution schedule to be able to take in data in NHWC and output NCHW when appropriate. This should actually be pretty straightfoward.
Add legalization code for Cortex-M to change the layouts when appropriate.

I plan to address 1 + 2 in one follow-up PR, and 3 in another. At some point, I'll also write a non-DSP version of the tensordot schedule so our performance on Cortex-M0 doesn't suck.

areusch

thanks @guberti and cc @Mousius @ekalda @leandron @mehrdadh @mkatanbaf @alanmacd for more reviews!

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

python/tvm/topi/arm_cpu/mprofile/dsp/tensordot_conv2ds.py

tests/python/relay/strategy/arm_cpu/test_conv2d_nhwc.py

guberti · 2022-10-07T19:41:31Z

@areusch @AndrewZhaoLuo @ekalda would you mind re-reviewing?

ekalda

Thanks @guberti, looks good to me, clearly a lot of thought has gone into this work and there is an abundance of clear documentation and thanks for all the explanation around selecting the instructions and where this work is going.

Just out of interest, have you guys looked at optimising schedules for any M-class cores with MVE (Helium)? You should get some good throughput from the 128 bit vectors.

areusch · 2022-10-10T20:06:13Z

hey @ekalda this is definitely something we should look at. we haven't done that yet though.

guberti · 2022-10-10T20:37:17Z

Thanks @ekalda! As Andrew said, I've only looked closely at the M4/M7, and these are what I had in mind when writing this schedule.

MVE is super cool though, and it would be straightforward to extend the tensordot kernel to work with it. Definitely worth doing once we're seeing performance improvements from these schedules.

Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import

* [HotFix] Fix python import Tuning doesn't work after #12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint

* [HotFix] Fix python import Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint

…che#12969) * Rewrite conv2D to tensorize with tensordot * Functional conv2D tensordot implementation * Add stupid hack to work around TVM bug * Unit testing for conv2d schedule * Connect new implementations to Arm strategy * Separate into new tensordot conv2d schedule * Separate testing infrastructure * Prototype depthwise implementation * Unit testing for depthwise_conv2d * Linting and documentation * Enforce SIMD alignment in strategy * Prevent black from butchering our formatting * Address code review comments * Fix alignment strategy bug * Fix linting * Remove unconventional offset behavior * Replace math.prod function to support Python 3.7 * Fix CI tests

* [HotFix] Fix python import Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint

guberti added 2 commits October 3, 2022 02:05

Rewrite conv2D to tensorize with tensordot

278c202

Functional conv2D tensordot implementation

c645428

guberti mentioned this pull request Oct 3, 2022

[Bug][AutoTVM] autotvm.task.extract_from_program fails on a conv2d operator #11216

Closed

guberti added 6 commits October 3, 2022 12:04

Add stupid hack to work around TVM bug

e3ca256

Unit testing for conv2d schedule

7783acc

Connect new implementations to Arm strategy

4b0e4c7

Separate into new tensordot conv2d schedule

058cb34

Separate testing infrastructure

90b7657

Prototype depthwise implementation

814bc6c

guberti added 2 commits October 5, 2022 01:35

Unit testing for depthwise_conv2d

5def2e1

Linting and documentation

1289874

guberti marked this pull request as ready for review October 5, 2022 09:55

guberti added 2 commits October 5, 2022 05:29

Enforce SIMD alignment in strategy

deaebc7

Prevent black from butchering our formatting

981b1bd

areusch reviewed Oct 5, 2022

View reviewed changes

guberti added 6 commits October 5, 2022 13:45

Address code review comments

1a21f6c

Fix alignment strategy bug

5cebb66

Fix linting

cf18071

Remove unconventional offset behavior

b7d5b96

Replace math.prod function to support Python 3.7

771c919

Fix CI tests

1966533

guberti requested review from areusch, AndrewZhaoLuo and ekalda and removed request for areusch, AndrewZhaoLuo and ekalda October 7, 2022 19:40

guberti requested review from AndrewZhaoLuo and ekalda and removed request for areusch and AndrewZhaoLuo October 7, 2022 19:41

ekalda approved these changes Oct 10, 2022

View reviewed changes

areusch approved these changes Oct 10, 2022

View reviewed changes

areusch merged commit fcbcd15 into apache:main Oct 11, 2022

guberti mentioned this pull request Oct 12, 2022

[microTVM] Improve code reuse in Corstone300 conv2d tests #13051

Merged

echuraev mentioned this pull request Oct 17, 2022

[HotFix] Fix python import #13099

Merged

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

guberti commented Oct 3, 2022 •

edited

Loading

guberti commented Oct 3, 2022 •

edited

Loading

guberti commented Oct 5, 2022 •

edited

Loading

areusch left a comment

guberti commented Oct 7, 2022

ekalda left a comment

areusch commented Oct 10, 2022

guberti commented Oct 10, 2022

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

Conversation

guberti commented Oct 3, 2022 • edited Loading

guberti commented Oct 3, 2022 • edited Loading

guberti commented Oct 5, 2022 • edited Loading

areusch left a comment

Choose a reason for hiding this comment

guberti commented Oct 7, 2022

ekalda left a comment

Choose a reason for hiding this comment

areusch commented Oct 10, 2022

guberti commented Oct 10, 2022

guberti commented Oct 3, 2022 •

edited

Loading

guberti commented Oct 3, 2022 •

edited

Loading

guberti commented Oct 5, 2022 •

edited

Loading