Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal performance on ARM Mali GPUs #128

Closed
CNugteren opened this issue Dec 4, 2016 · 14 comments
Closed

Sub-optimal performance on ARM Mali GPUs #128

CNugteren opened this issue Dec 4, 2016 · 14 comments

Comments

@CNugteren
Copy link
Owner

CNugteren commented Dec 4, 2016

Performance of CLBlast is suboptimal on ARM Mali GPUs. This is many because the way the OpenCL kernels are currently written isn't handled nicely by ARM's OpenCL compiler.

To allow parametrised code in CLBlast without having to generate OpenCL kernel strings and without having to write hundreds of lines, CLBlast makes heavily use of small un-rollable loops over small thread-private arrays. Here is an example of setting the amount of work per thread (e.g. for register tiling in the GEMM kernels):

#define WPT 4 // value tuned specifically per device
__kernel void example() {
   float values[WPT];
   #pragma unroll
   for (int i = 0; i < WPT; ++i) {
      values[i] = 0.0;
   }
   // rest of the kernel
}

As far as I see, the ARM Mali compiler doesn't promote these small arrays to register values and thus will generate code with loads & stores. However, other compilers tested handle this code af it is was the following:

__kernel void example() {
   float value0, value1, value2, value3;
   value0 = 0.0;
   value1 = 0.0;
   value2 = 0.0;
   value3 = 0.0;
   // rest of the kernel
}

In the case of ARM Mali the manually un-rolled version yields significantly better performance (e.g. a factor 2 for a GEMM kernel) compared to the version with a loop and and an array. For other tested compilers, performance is equal.

So, why not do manual unrolling everywhere? This will increase the (source) size of the kernels significantly and will make it less readable (just think of nesting such constructs). But it will also only be able to handle a limited number of cases. For example, let's say in our example that WPT can be any power of 2, we'll get:

#define WPT 4 // value tuned specifically per device
__kernel void example() {
   #if WPT == 1
      float value0;
      value0 = 0.0;
   #endif
   #if WPT == 2
      float value0, value1;
      value0 = 0.0;
      value1 = 0.0;
   #endif
   #if WPT == 4
      float value0, value1, value2, value3;
      value0 = 0.0;
      value1 = 0.0;
      value2 = 0.0;
      value3 = 0.0;
   #endif
   #if WPT == 8
      (...)
   // how far do we go? 8? 16? 32?
   // rest of the kernel
}

I hope the issue is clear. I see two solutions:

  1. Raise this issue with ARM and hope that they'll be able to implement such optimisations in the compiler. It might be good to do this anyway.
  2. Include a simple 'pre-processor' in CLBlast that does "array + loop" -> "register + unroll" automatically. I believe that this doesn't have to be a complicated piece of software, since it can be made to work only specifically with CLBlast (e.g. assuming certain variable names for arrays).

Has anyone seen the same issue with another OpenCL compiler? Any thoughts perhaps on how to handle this issue nicely in CLBlast?

@psyhtest
Copy link
Contributor

psyhtest commented May 31, 2017

With using Collective Knowledge based tuning on CLBlast, we've been able to achieve 12-13 GFLOPS on several Midgard-based devices (Chromebook-2, Odroid-XU3, Firefly-RK3399). This is about half of what ARM hand-optimised code achieves on the same devices, so better compiler support should push the performance of autotuned CLBlast code up.

The SGEMM performance improvements have resulted in 3-4x improvements to the performance of Caffe e.g. see our analysis for the Firefly-RK3399 board. On the quad-core Mali-T860 @ 800 MHz, the performance of Caffe is improved by 3.5-4.2 times for AlexNet, GoogleNet and SqueezeNet, reaching 3.6, 1.3 and 3.9 fps, respectively. (Using OpenBLAS on the dual-core Cortex-A72 @ 1800 MHz, the performance reaches 4.6, 1.8 and 7.5 fps, respectively.)

@amoghaUdupa
Copy link

amoghaUdupa commented Jun 14, 2017

I learnt that Qualcomm's OpenCL compiler doesn't support unrolling. Hence I have tried manual unrolling of gemm_direct kernel's loops.
Before unroll:

#pragma unroll
  for (int mia=0; mia<MWAD; ++mia) {
    #pragma unroll
    for (int kia=0; kia<KWAD; ++kia) {

      // Computes the indices for the global memory
      int mg = mia + la0*MWAD;
      int kg = kia + la1*KWAD;
      int idm = (a_transpose) ? mg + kwg : mg + GetGroupID0()*WGD;
      int idk = (a_transpose) ? kg + GetGroupID0()*WGD : kg + kwg;

      // Loads the data from global memory into the local memory
      real result = agms[idk*a_ld + idm + a_offset];
      if (a_conjugate) { COMPLEX_CONJUGATE(result); }
      alm[kg*(WGD + PADA) + mg] = result;
    }
  }

After unrolling:

const int mg0 = 0 + la0*MWAD;
const int kg0 = 0 + la1*KWAD;
const int idm0 = (a_transpose) ? mg0 + kwg : mg0 + GetGroupID0()*WGD;
const int idk0 = (a_transpose) ? kg0 + GetGroupID0()*WGD : kg0 + kwg;
const real result0 = agms[idk0*a_ld + idm0 + a_offset];
if (a_conjugate) { COMPLEX_CONJUGATE(result0); }
alm[kg0*(WGD + PADA) + mg0] = result0;

const int mg1 = 0 + la0*MWAD;
const int kg1 = 1 + la1*KWAD;
const int idm1 = (a_transpose) ? mg1 + kwg : mg1 + GetGroupID0()*WGD;
const int idk1 = (a_transpose) ? kg1 + GetGroupID0()*WGD : kg1 + kwg;
const real result1 = agms[idk1*a_ld + idm1 + a_offset];
if (a_conjugate) { COMPLEX_CONJUGATE(result1); }
alm[kg1*(WGD + PADA) + mg1] = result1;

//few more

However there seem to be no improvement in the performance. Is this the right way to go about it?

@CNugteren
Copy link
Owner Author

Yes, that would be unrolling indeed, but you did it for a fixed value of KWAD/MWAD (understandably so), so this is not a solution to the issue. And the fact that you don't improve performance might also be normal - unrolling is not necessarily improving performance of course. In some causes it might, in others it might not.

The original performance issue on ARM as discussed above is related to unrolling to promote an array to registers, e.g. going from:

   float values[WPT];
   #pragma unroll
   for (int i = 0; i < WPT; ++i) {
      values[i] = 0.0;
   }

to:

   float value0, value1, value2, value3;
   value0 = 0.0;
   value1 = 0.0;
   value2 = 0.0;
   value3 = 0.0;

This kind of case is not present in your snippet it seems.

@CNugteren
Copy link
Owner Author

It's been a while, but I finally started working on a kernel pre-processor to 1) unroll-loops and 2) apply array-to-register promotion. Work is ongoing in the kernel_preprocessor branch. I'll report back when everything is implemented and when the first performance results are available.

@CNugteren
Copy link
Owner Author

CNugteren commented Dec 10, 2017

The kernel_preprocessor implementation is now finished. It is now enabled by default to run for ARM Mali and Qualcomm Adreno GPUs. I've done a first test running the tuner on a Mali T628 MP6, here are the results of tuning GEMM for m=n=k=512:

It seems that in most cases there are significant gains, however, this is not always the case. I also ran with the new pre-processor for m=n=k=1024, for which I attained 10 GFLOPS as the best result.

@psyhtest and others: could you also have a try with the kernel_preprocessor branch if you have time? You can play with line 86 in src/utilities/compile.cpp to disabled/enable the new pre-processor.

@sivagnanamn Perhaps you can also have a go with this branch on your Qualcomm Adreno system?

@sivagnanamn
Copy link
Contributor

@CNugteren Thank you.

I did a brief test with M=32, N=50176, K=144 in Qualcomm Adreno 330

############### With kernel pre-processor
"best_kernel": "Xgemm",
  "best_time": "65.87",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=1",
############### Without kernel pre-processor ==> Result from old tuner json file
"best_kernel": "Xgemm",
  "best_time": "72.1",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=4",

With this new branch, my overall inference time reduced by ~50-60 ms with Adreno 330.

@CNugteren
Copy link
Owner Author

Hmmm, so that gain is really minimal unfortunately :( That means we'll have to investigate Adreno further. Let's hope results on ARM Mali are more encouraging.

@sivagnanamn
Copy link
Contributor

Mali T760 - Tuner results using kernel_preprocessor branch.
CLBlast_Tuner_JSONs_kernel_preprocessor.zip

###### Without kernel pre-processor
  "kernel_family": "xgemm_1",
  "precision": "32",
  "best_kernel": "Xgemm",
  "best_time": "383.12",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4"
######## With kernel pre-processor
  "kernel_family": "xgemm_1",
  "precision": "32",
  "best_kernel": "Xgemm",
  "best_time": "191.41",
  "best_parameters": "KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2"

@CNugteren Mali results are far better than previous release of CLBlast.

Could you please share your thoughts about Qualcomm and is it possible for improvement in Qualcomm based GPU's?

@fvella
Copy link

fvella commented Dec 13, 2017

Hi Cedric,
we ran the client of the new preprocessor both on Odroid (Mali-T628) and Firefly (Mali-T860) and we got a 20% and 15% of improvement respectively m=n=k=1024.
I am also running the tuner, results coming soon.
After tuning, I used the default one:

  • Odroid 13.5 GFLOPS
  • Firefly 14.5 GFLOPS

@CNugteren
Copy link
Owner Author

@sivagnanamn: Many thanks for testing and re-running the tuner on Mali T760. I will trow the old results out of the database and replace them with the new ones. As for the Qualcomm issue, apparently that's unrelated. I've opened a separate issue for that #228.

@fvella Thanks for testing on those 2 devices. What are the expected performance numbers in GFLOPS, e.g. the ARM reference implementation? If you re-run the tuner, perhaps you'll get more gains even. I propose to remove all old Mali tuning results from the database. If you could re-run all the tuners for T860, then that would be great! I can run myself for T628.

@CNugteren
Copy link
Owner Author

FYI, this branch is now merged into master with new tuning results for Mali T760 and T628.

@fvella
Copy link

fvella commented Dec 18, 2017

Hi Cedric,
with our tuner we got 20 GFLOS on T860!
Below the configuration:

  "device_compute_units": "4", 
  "device_core_clock": "200", 
  "device_type": "GPU", 
  "device_vendor": "ARM", 
  "kernel": "xgemm", 
  "post_processed": "yes", 
  "precision": "32", 
  "statistics": {
    "best_configuration": {
      "GFLOPS": 20.232748075636668, 
      "kernel": "Xgemm", 
      "parameters": {
        "KWG": 16, 
        "KWI": 1, 
        "MDIMA": 8, 
        "MDIMC": 8, 
        "MWG": 32, 
        "NDIMB": 16, 
        "NDIMC": 16, 
        "NWG": 64, 
        "PRECISION": 32, 
        "SA": 1, 
        "SB": 1, 
        "STRM": 1, 
        "STRN": 0, 
        "VWM": 4, 
        "VWN": 4
      }, 
      "time": 106.139
    }, 
    "default_configuration": {}, 
    "default_family": {}
  }
}

The arm compute should take ~75.00 ms. I will check it later

@CNugteren
Copy link
Owner Author

Good to hear we are now close to ARM's reference implementation. Would be great if you can share all the .JSON tuner results for T860 (make alltuners), then I can integrate those into CLBlast.

@CNugteren
Copy link
Owner Author

Just to confirm, on Mali T628 the results are much better than before, now achieving ~11 GFLOPS single-precision and double that amount half-precision:
hgemm_plot_tight.pdf

I'm closing this issue now, because the pre-processor solves the shortcomings in ARM's OpenCL compiler. Further tuning results are always welcome, they can be shared in #1. Thanks all for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants