Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

Merged
merged 18 commits into from
Oct 11, 2022

Conversation

guberti
Copy link
Member

@guberti guberti commented Oct 3, 2022

This pull request adds fast microTVM DSP schedules for the optimal (for Cortex-M) conv2d and depthwise_conv2d layouts - NHWC/OHWI and NCHW/OIHW, respectively. By letting us use the special SMLAD DSP instruction without rearranging data, 25% of the instructions in the inner loop can be removed (for the int16 case).

Additionally, this change allows both the conv2d and depthwise_conv2d fast DSP schedules to use the same underlying intrinsic, a variation of a tensordot operator. This makes the code for these schedules much more compact. This PR also:

  • Adds unit tests for the new schedules
  • Does not affect the old schedules, which are still used when they apply
    • The cases in which these new fast schedules apply are strictly different
  • Adds support for int8, int16, and int32 input data types in the new schedules
  • Adds a change_constant_shape utility to tvm.topi

I've also written a comment below delving into why these layouts are optimal.

@guberti
Copy link
Member Author

guberti commented Oct 3, 2022

In #12856, we discussed how NHWC was a bad format to use for depthwise_conv2d in microTVM (and likewise NCHW a bad format for regular conv2d). From this, one might ask the question:

Given a conv2d on Cortex-M4 with n groups, what are the optimal data and kernel layouts?

When choosing these layouts, we really want data that will be multiply-accumulated to be next to each other in memory. The primary reason is that doing this lets us use the __SMLAD instruction with minimal overhead, which does two multiply-accumulates with one instruction. The secondary reason, however, is to let us use *ptr++ when reading both the input data and kernel as much as possible, as *ptr++ is one instruction (when pipelined) on Cortex-M.

For depthwise convolutions, channels do not interact with each other at all, so they have no reason to be near each other in memory. This applies to both the input data and the kernel. Hence, NCHW is the optimal data layout and OIHW the optimal kernel layout. By similar reasoning, NHWC and OHWI are optimal for regular Conv2D operators. But we can generalize further - for a generalized Conv2D with n groups and c channels, the optimal layouts are NCHWxc/OIHWxi where x = c / n.

Now, assume we are performing a generalized Conv2D with n groups and c channels, using data layout NCHWxc and kernel layout OIHWxi. For the int16 case, to convolve one entire row (width * channels / groups individual parameters), all we need to do is copy/paste this code width * channels / (2 * groups) times (the 2 coming from the fact two int16 values fit into an int32).

uint32_t tensor_batch = *tensor++;
uint32_t kernel_batch = *kernel++;
sum = __SMLAD(tensor_batch, kernel_batch, sum);

This code does not depend on the number of groups, which allows us to use the same tensorize function for both regular and depthwise convolutions!

@guberti
Copy link
Member Author

guberti commented Oct 5, 2022

Additionally, while these schedules are finished and ready to use when appropriate, there are still three changes I must make before they contribute to performance improvements on common tiny models (e.g. MobileNetV1 or ResNet).

  1. Add support for non-word-aligned kernels. For int32 input dtype, the current schedules have no limitations. However, for int16 and int8 input dtypes, we require that the kernels must fit evenly into words. For example, this means that our depthwise schedule would not currently work with a 3x3 kernel when used with the int16 input dtype, as it would take up 4.5 words in memory.
  • Arm Cortex-M does not support fast unaligned memory accesses, so the fix here is kinda gnarly. The int16 case here is important enough that I really ought to add this feature, but the fix for int8 is less useful and requires way more casework. Hence, I'll probably only fix this for int16.
  1. Use the out_layout attribute to allow changing layouts. Many common models (like MobileNetV1) alternate between regular and depthwise convolutions to improve performance. Since we prefer the NHWC data layout for regular convolutions and the NCHW data layout for depthwise convolutions, it would be really nice for our regular convolution schedule to be able to take in data in NHWC and output NCHW when appropriate. This should actually be pretty straightfoward.
  2. Add legalization code for Cortex-M to change the layouts when appropriate.

I plan to address 1 + 2 in one follow-up PR, and 3 in another. At some point, I'll also write a non-DSP version of the tensordot schedule so our performance on Cortex-M0 doesn't suck.

@guberti guberti marked this pull request as ready for review October 5, 2022 09:55
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guberti and cc @Mousius @ekalda @leandron @mehrdadh @mkatanbaf @alanmacd for more reviews!

@guberti guberti requested review from areusch, AndrewZhaoLuo and ekalda and removed request for areusch, AndrewZhaoLuo and ekalda October 7, 2022 19:40
@guberti guberti requested review from AndrewZhaoLuo and ekalda and removed request for areusch and AndrewZhaoLuo October 7, 2022 19:41
@guberti
Copy link
Member Author

guberti commented Oct 7, 2022

@areusch @AndrewZhaoLuo @ekalda would you mind re-reviewing?

Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guberti, looks good to me, clearly a lot of thought has gone into this work and there is an abundance of clear documentation and thanks for all the explanation around selecting the instructions and where this work is going.

Just out of interest, have you guys looked at optimising schedules for any M-class cores with MVE (Helium)? You should get some good throughput from the 128 bit vectors.

@areusch
Copy link
Contributor

areusch commented Oct 10, 2022

hey @ekalda this is definitely something we should look at. we haven't done that yet though.

@guberti
Copy link
Member Author

guberti commented Oct 10, 2022

Thanks @ekalda! As Andrew said, I've only looked closely at the M4/M7, and these are what I had in mind when writing this schedule.

MVE is super cool though, and it would be straightforward to extend the tensordot kernel to work with it. Definitely worth doing once we're seeing performance improvements from these schedules.

@areusch areusch merged commit fcbcd15 into apache:main Oct 11, 2022
echuraev added a commit to echuraev/tvm that referenced this pull request Oct 17, 2022
Tuning doesn't work after apache#12969.
It reports the following error:

```
ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import)
```

In this commit I moved import relay to a function which used in a test.
And it helps to fix this circular import
echuraev added a commit to echuraev/tvm that referenced this pull request Oct 17, 2022
Tuning doesn't work after apache#12969.
It reports the following error:

```
ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import)
```

In this commit I moved import relay to a function which used in a test.
And it helps to fix this circular import
masahi pushed a commit that referenced this pull request Oct 17, 2022
* [HotFix] Fix python import

Tuning doesn't work after #12969.
It reports the following error:

```
ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import)
```

In this commit I moved import relay to a function which used in a test.
And it helps to fix this circular import

* Fix lint
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 10, 2022
* [HotFix] Fix python import

Tuning doesn't work after apache#12969.
It reports the following error:

```
ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import)
```

In this commit I moved import relay to a function which used in a test.
And it helps to fix this circular import

* Fix lint
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
…che#12969)

* Rewrite conv2D to tensorize with tensordot

* Functional conv2D tensordot implementation

* Add stupid hack to work around TVM bug

* Unit testing for conv2d schedule

* Connect new implementations to Arm strategy

* Separate into new tensordot conv2d schedule

* Separate testing infrastructure

* Prototype depthwise implementation

* Unit testing for depthwise_conv2d

* Linting and documentation

* Enforce SIMD alignment in strategy

* Prevent black from butchering our formatting

* Address code review comments

* Fix alignment strategy bug

* Fix linting

* Remove unconventional offset behavior

* Replace math.prod function to support Python 3.7

* Fix CI tests
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [HotFix] Fix python import

Tuning doesn't work after apache#12969.
It reports the following error:

```
ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import)
```

In this commit I moved import relay to a function which used in a test.
And it helps to fix this circular import

* Fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants