-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969
Conversation
In #12856, we discussed how
When choosing these layouts, we really want data that will be multiply-accumulated to be next to each other in memory. The primary reason is that doing this lets us use the For depthwise convolutions, channels do not interact with each other at all, so they have no reason to be near each other in memory. This applies to both the input data and the kernel. Hence, Now, assume we are performing a generalized Conv2D with
This code does not depend on the number of groups, which allows us to use the same tensorize function for both regular and depthwise convolutions! |
Additionally, while these schedules are finished and ready to use when appropriate, there are still three changes I must make before they contribute to performance improvements on common tiny models (e.g. MobileNetV1 or ResNet).
I plan to address 1 + 2 in one follow-up PR, and 3 in another. At some point, I'll also write a non-DSP version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@areusch @AndrewZhaoLuo @ekalda would you mind re-reviewing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guberti, looks good to me, clearly a lot of thought has gone into this work and there is an abundance of clear documentation and thanks for all the explanation around selecting the instructions and where this work is going.
Just out of interest, have you guys looked at optimising schedules for any M-class cores with MVE (Helium)? You should get some good throughput from the 128 bit vectors.
hey @ekalda this is definitely something we should look at. we haven't done that yet though. |
Thanks @ekalda! As Andrew said, I've only looked closely at the M4/M7, and these are what I had in mind when writing this schedule. MVE is super cool though, and it would be straightforward to extend the |
Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import
Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import
* [HotFix] Fix python import Tuning doesn't work after #12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint
* [HotFix] Fix python import Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint
…che#12969) * Rewrite conv2D to tensorize with tensordot * Functional conv2D tensordot implementation * Add stupid hack to work around TVM bug * Unit testing for conv2d schedule * Connect new implementations to Arm strategy * Separate into new tensordot conv2d schedule * Separate testing infrastructure * Prototype depthwise implementation * Unit testing for depthwise_conv2d * Linting and documentation * Enforce SIMD alignment in strategy * Prevent black from butchering our formatting * Address code review comments * Fix alignment strategy bug * Fix linting * Remove unconventional offset behavior * Replace math.prod function to support Python 3.7 * Fix CI tests
* [HotFix] Fix python import Tuning doesn't work after apache#12969. It reports the following error: ``` ImportError: cannot import name 'get_const_float' from partially initialized module 'tvm.topi.utils' (most likely due to a circular import) ``` In this commit I moved import relay to a function which used in a test. And it helps to fix this circular import * Fix lint
This pull request adds fast microTVM DSP schedules for the optimal (for Cortex-M)
conv2d
anddepthwise_conv2d
layouts -NHWC/OHWI
andNCHW/OIHW
, respectively. By letting us use the specialSMLAD
DSP instruction without rearranging data, 25% of the instructions in the inner loop can be removed (for theint16
case).Additionally, this change allows both the
conv2d
anddepthwise_conv2d
fast DSP schedules to use the same underlying intrinsic, a variation of a tensordot operator. This makes the code for these schedules much more compact. This PR also:int8
,int16
, andint32
input data types in the new scheduleschange_constant_shape
utility totvm.topi
I've also written a comment below delving into why these layouts are optimal.