-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-optimal performance on ARM Mali GPUs #128
Comments
With using Collective Knowledge based tuning on CLBlast, we've been able to achieve 12-13 GFLOPS on several Midgard-based devices (Chromebook-2, Odroid-XU3, Firefly-RK3399). This is about half of what ARM hand-optimised code achieves on the same devices, so better compiler support should push the performance of autotuned CLBlast code up. The SGEMM performance improvements have resulted in 3-4x improvements to the performance of Caffe e.g. see our analysis for the Firefly-RK3399 board. On the quad-core Mali-T860 @ 800 MHz, the performance of Caffe is improved by 3.5-4.2 times for AlexNet, GoogleNet and SqueezeNet, reaching 3.6, 1.3 and 3.9 fps, respectively. (Using OpenBLAS on the dual-core Cortex-A72 @ 1800 MHz, the performance reaches 4.6, 1.8 and 7.5 fps, respectively.) |
I learnt that Qualcomm's OpenCL compiler doesn't support unrolling. Hence I have tried manual unrolling of
After unrolling:
However there seem to be no improvement in the performance. Is this the right way to go about it? |
Yes, that would be unrolling indeed, but you did it for a fixed value of The original performance issue on ARM as discussed above is related to unrolling to promote an array to registers, e.g. going from:
to:
This kind of case is not present in your snippet it seems. |
It's been a while, but I finally started working on a kernel pre-processor to 1) unroll-loops and 2) apply array-to-register promotion. Work is ongoing in the |
The
It seems that in most cases there are significant gains, however, this is not always the case. I also ran with the new pre-processor for m=n=k=1024, for which I attained 10 GFLOPS as the best result. @psyhtest and others: could you also have a try with the @sivagnanamn Perhaps you can also have a go with this branch on your Qualcomm Adreno system? |
@CNugteren Thank you. I did a brief test with M=32, N=50176, K=144 in Qualcomm Adreno 330
With this new branch, my overall inference time reduced by ~50-60 ms with Adreno 330. |
Hmmm, so that gain is really minimal unfortunately :( That means we'll have to investigate Adreno further. Let's hope results on ARM Mali are more encouraging. |
Mali T760 - Tuner results using
@CNugteren Mali results are far better than previous release of CLBlast. Could you please share your thoughts about Qualcomm and is it possible for improvement in Qualcomm based GPU's? |
Hi Cedric,
|
@sivagnanamn: Many thanks for testing and re-running the tuner on Mali T760. I will trow the old results out of the database and replace them with the new ones. As for the Qualcomm issue, apparently that's unrelated. I've opened a separate issue for that #228. @fvella Thanks for testing on those 2 devices. What are the expected performance numbers in GFLOPS, e.g. the ARM reference implementation? If you re-run the tuner, perhaps you'll get more gains even. I propose to remove all old Mali tuning results from the database. If you could re-run all the tuners for T860, then that would be great! I can run myself for T628. |
FYI, this branch is now merged into master with new tuning results for Mali T760 and T628. |
Hi Cedric,
The arm compute should take ~75.00 ms. I will check it later |
Good to hear we are now close to ARM's reference implementation. Would be great if you can share all the .JSON tuner results for T860 ( |
Just to confirm, on Mali T628 the results are much better than before, now achieving ~11 GFLOPS single-precision and double that amount half-precision: I'm closing this issue now, because the pre-processor solves the shortcomings in ARM's OpenCL compiler. Further tuning results are always welcome, they can be shared in #1. Thanks all for testing! |
Performance of CLBlast is suboptimal on ARM Mali GPUs. This is many because the way the OpenCL kernels are currently written isn't handled nicely by ARM's OpenCL compiler.
To allow parametrised code in CLBlast without having to generate OpenCL kernel strings and without having to write hundreds of lines, CLBlast makes heavily use of small un-rollable loops over small thread-private arrays. Here is an example of setting the amount of work per thread (e.g. for register tiling in the GEMM kernels):
As far as I see, the ARM Mali compiler doesn't promote these small arrays to register values and thus will generate code with loads & stores. However, other compilers tested handle this code af it is was the following:
In the case of ARM Mali the manually un-rolled version yields significantly better performance (e.g. a factor 2 for a GEMM kernel) compared to the version with a loop and and an array. For other tested compilers, performance is equal.
So, why not do manual unrolling everywhere? This will increase the (source) size of the kernels significantly and will make it less readable (just think of nesting such constructs). But it will also only be able to handle a limited number of cases. For example, let's say in our example that WPT can be any power of 2, we'll get:
I hope the issue is clear. I see two solutions:
Has anyone seen the same issue with another OpenCL compiler? Any thoughts perhaps on how to handle this issue nicely in CLBlast?
The text was updated successfully, but these errors were encountered: