Sub-optimal performance on ARM Mali GPUs #128

CNugteren · 2016-12-04T16:11:36Z

Performance of CLBlast is suboptimal on ARM Mali GPUs. This is many because the way the OpenCL kernels are currently written isn't handled nicely by ARM's OpenCL compiler.

To allow parametrised code in CLBlast without having to generate OpenCL kernel strings and without having to write hundreds of lines, CLBlast makes heavily use of small un-rollable loops over small thread-private arrays. Here is an example of setting the amount of work per thread (e.g. for register tiling in the GEMM kernels):

#define WPT 4 // value tuned specifically per device
__kernel void example() {
   float values[WPT];
   #pragma unroll
   for (int i = 0; i < WPT; ++i) {
      values[i] = 0.0;
   }
   // rest of the kernel
}

As far as I see, the ARM Mali compiler doesn't promote these small arrays to register values and thus will generate code with loads & stores. However, other compilers tested handle this code af it is was the following:

__kernel void example() {
   float value0, value1, value2, value3;
   value0 = 0.0;
   value1 = 0.0;
   value2 = 0.0;
   value3 = 0.0;
   // rest of the kernel
}

In the case of ARM Mali the manually un-rolled version yields significantly better performance (e.g. a factor 2 for a GEMM kernel) compared to the version with a loop and and an array. For other tested compilers, performance is equal.

So, why not do manual unrolling everywhere? This will increase the (source) size of the kernels significantly and will make it less readable (just think of nesting such constructs). But it will also only be able to handle a limited number of cases. For example, let's say in our example that WPT can be any power of 2, we'll get:

#define WPT 4 // value tuned specifically per device
__kernel void example() {
   #if WPT == 1
      float value0;
      value0 = 0.0;
   #endif
   #if WPT == 2
      float value0, value1;
      value0 = 0.0;
      value1 = 0.0;
   #endif
   #if WPT == 4
      float value0, value1, value2, value3;
      value0 = 0.0;
      value1 = 0.0;
      value2 = 0.0;
      value3 = 0.0;
   #endif
   #if WPT == 8
      (...)
   // how far do we go? 8? 16? 32?
   // rest of the kernel
}

I hope the issue is clear. I see two solutions:

Raise this issue with ARM and hope that they'll be able to implement such optimisations in the compiler. It might be good to do this anyway.
Include a simple 'pre-processor' in CLBlast that does "array + loop" -> "register + unroll" automatically. I believe that this doesn't have to be a complicated piece of software, since it can be made to work only specifically with CLBlast (e.g. assuming certain variable names for arrays).

Has anyone seen the same issue with another OpenCL compiler? Any thoughts perhaps on how to handle this issue nicely in CLBlast?

The text was updated successfully, but these errors were encountered:

psyhtest · 2017-05-31T13:37:13Z

With using Collective Knowledge based tuning on CLBlast, we've been able to achieve 12-13 GFLOPS on several Midgard-based devices (Chromebook-2, Odroid-XU3, Firefly-RK3399). This is about half of what ARM hand-optimised code achieves on the same devices, so better compiler support should push the performance of autotuned CLBlast code up.

The SGEMM performance improvements have resulted in 3-4x improvements to the performance of Caffe e.g. see our analysis for the Firefly-RK3399 board. On the quad-core Mali-T860 @ 800 MHz, the performance of Caffe is improved by 3.5-4.2 times for AlexNet, GoogleNet and SqueezeNet, reaching 3.6, 1.3 and 3.9 fps, respectively. (Using OpenBLAS on the dual-core Cortex-A72 @ 1800 MHz, the performance reaches 4.6, 1.8 and 7.5 fps, respectively.)

amoghaUdupa · 2017-06-14T10:26:00Z

I learnt that Qualcomm's OpenCL compiler doesn't support unrolling. Hence I have tried manual unrolling of gemm_direct kernel's loops.
Before unroll:

#pragma unroll
  for (int mia=0; mia<MWAD; ++mia) {
    #pragma unroll
    for (int kia=0; kia<KWAD; ++kia) {

      // Computes the indices for the global memory
      int mg = mia + la0*MWAD;
      int kg = kia + la1*KWAD;
      int idm = (a_transpose) ? mg + kwg : mg + GetGroupID0()*WGD;
      int idk = (a_transpose) ? kg + GetGroupID0()*WGD : kg + kwg;

      // Loads the data from global memory into the local memory
      real result = agms[idk*a_ld + idm + a_offset];
      if (a_conjugate) { COMPLEX_CONJUGATE(result); }
      alm[kg*(WGD + PADA) + mg] = result;
    }
  }

After unrolling:

const int mg0 = 0 + la0*MWAD;
const int kg0 = 0 + la1*KWAD;
const int idm0 = (a_transpose) ? mg0 + kwg : mg0 + GetGroupID0()*WGD;
const int idk0 = (a_transpose) ? kg0 + GetGroupID0()*WGD : kg0 + kwg;
const real result0 = agms[idk0*a_ld + idm0 + a_offset];
if (a_conjugate) { COMPLEX_CONJUGATE(result0); }
alm[kg0*(WGD + PADA) + mg0] = result0;

const int mg1 = 0 + la0*MWAD;
const int kg1 = 1 + la1*KWAD;
const int idm1 = (a_transpose) ? mg1 + kwg : mg1 + GetGroupID0()*WGD;
const int idk1 = (a_transpose) ? kg1 + GetGroupID0()*WGD : kg1 + kwg;
const real result1 = agms[idk1*a_ld + idm1 + a_offset];
if (a_conjugate) { COMPLEX_CONJUGATE(result1); }
alm[kg1*(WGD + PADA) + mg1] = result1;

//few more

However there seem to be no improvement in the performance. Is this the right way to go about it?

CNugteren · 2017-06-14T12:47:28Z

Yes, that would be unrolling indeed, but you did it for a fixed value of KWAD/MWAD (understandably so), so this is not a solution to the issue. And the fact that you don't improve performance might also be normal - unrolling is not necessarily improving performance of course. In some causes it might, in others it might not.

The original performance issue on ARM as discussed above is related to unrolling to promote an array to registers, e.g. going from:

   float values[WPT];
   #pragma unroll
   for (int i = 0; i < WPT; ++i) {
      values[i] = 0.0;
   }

to:

   float value0, value1, value2, value3;
   value0 = 0.0;
   value1 = 0.0;
   value2 = 0.0;
   value3 = 0.0;

This kind of case is not present in your snippet it seems.

CNugteren · 2017-11-29T19:24:05Z

It's been a while, but I finally started working on a kernel pre-processor to 1) unroll-loops and 2) apply array-to-register promotion. Work is ongoing in the kernel_preprocessor branch. I'll report back when everything is implemented and when the first performance results are available.

CNugteren · 2017-12-10T15:49:33Z

The kernel_preprocessor implementation is now finished. It is now enabled by default to run for ARM Mali and Qualcomm Adreno GPUs. I've done a first test running the tuner on a Mali T628 MP6, here are the results of tuning GEMM for m=n=k=512:

Before: tuner_xgemm_old.txt
With pre-processor: tuner_xgemm_with_preprocessor.txt
The above 2 compared in GFLOPS: tuner_xgemm_comparison.txt

It seems that in most cases there are significant gains, however, this is not always the case. I also ran with the new pre-processor for m=n=k=1024, for which I attained 10 GFLOPS as the best result.

@psyhtest and others: could you also have a try with the kernel_preprocessor branch if you have time? You can play with line 86 in src/utilities/compile.cpp to disabled/enable the new pre-processor.

@sivagnanamn Perhaps you can also have a go with this branch on your Qualcomm Adreno system?

sivagnanamn · 2017-12-11T09:42:28Z

@CNugteren Thank you.

I did a brief test with M=32, N=50176, K=144 in Qualcomm Adreno 330

############### With kernel pre-processor
"best_kernel": "Xgemm",
  "best_time": "65.87",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=1",

############### Without kernel pre-processor ==> Result from old tuner json file
"best_kernel": "Xgemm",
  "best_time": "72.1",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=4",

With this new branch, my overall inference time reduced by ~50-60 ms with Adreno 330.

CNugteren · 2017-12-11T21:49:11Z

Hmmm, so that gain is really minimal unfortunately :( That means we'll have to investigate Adreno further. Let's hope results on ARM Mali are more encouraging.

sivagnanamn · 2017-12-13T12:18:55Z

Mali T760 - Tuner results using kernel_preprocessor branch.
CLBlast_Tuner_JSONs_kernel_preprocessor.zip

###### Without kernel pre-processor
  "kernel_family": "xgemm_1",
  "precision": "32",
  "best_kernel": "Xgemm",
  "best_time": "383.12",
  "best_parameters": "KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4"

######## With kernel pre-processor
  "kernel_family": "xgemm_1",
  "precision": "32",
  "best_kernel": "Xgemm",
  "best_time": "191.41",
  "best_parameters": "KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2"

@CNugteren Mali results are far better than previous release of CLBlast.

Could you please share your thoughts about Qualcomm and is it possible for improvement in Qualcomm based GPU's?

fvella · 2017-12-13T16:02:06Z

Hi Cedric,
we ran the client of the new preprocessor both on Odroid (Mali-T628) and Firefly (Mali-T860) and we got a 20% and 15% of improvement respectively m=n=k=1024.
I am also running the tuner, results coming soon.
After tuning, I used the default one:

Odroid 13.5 GFLOPS
Firefly 14.5 GFLOPS

CNugteren · 2017-12-15T20:22:33Z

@sivagnanamn: Many thanks for testing and re-running the tuner on Mali T760. I will trow the old results out of the database and replace them with the new ones. As for the Qualcomm issue, apparently that's unrelated. I've opened a separate issue for that #228.

@fvella Thanks for testing on those 2 devices. What are the expected performance numbers in GFLOPS, e.g. the ARM reference implementation? If you re-run the tuner, perhaps you'll get more gains even. I propose to remove all old Mali tuning results from the database. If you could re-run all the tuners for T860, then that would be great! I can run myself for T628.

CNugteren · 2017-12-17T17:01:26Z

FYI, this branch is now merged into master with new tuning results for Mali T760 and T628.

fvella · 2017-12-18T15:55:21Z

Hi Cedric,
with our tuner we got 20 GFLOS on T860!
Below the configuration:

  "device_compute_units": "4", 
  "device_core_clock": "200", 
  "device_type": "GPU", 
  "device_vendor": "ARM", 
  "kernel": "xgemm", 
  "post_processed": "yes", 
  "precision": "32", 
  "statistics": {
    "best_configuration": {
      "GFLOPS": 20.232748075636668, 
      "kernel": "Xgemm", 
      "parameters": {
        "KWG": 16, 
        "KWI": 1, 
        "MDIMA": 8, 
        "MDIMC": 8, 
        "MWG": 32, 
        "NDIMB": 16, 
        "NDIMC": 16, 
        "NWG": 64, 
        "PRECISION": 32, 
        "SA": 1, 
        "SB": 1, 
        "STRM": 1, 
        "STRN": 0, 
        "VWM": 4, 
        "VWN": 4
      }, 
      "time": 106.139
    }, 
    "default_configuration": {}, 
    "default_family": {}
  }
}

The arm compute should take ~75.00 ms. I will check it later

CNugteren · 2017-12-18T18:59:14Z

Good to hear we are now close to ARM's reference implementation. Would be great if you can share all the .JSON tuner results for T860 (make alltuners), then I can integrate those into CLBlast.

CNugteren · 2017-12-31T16:02:31Z

Just to confirm, on Mali T628 the results are much better than before, now achieving ~11 GFLOPS single-precision and double that amount half-precision:
hgemm_plot_tight.pdf

I'm closing this issue now, because the pre-processor solves the shortcomings in ARM's OpenCL compiler. Further tuning results are always welcome, they can be shared in #1. Thanks all for testing!

CNugteren mentioned this issue Dec 4, 2016

Kernel tuning on mobile GPU naibaf7/libdnn#18

Closed

CNugteren mentioned this issue Aug 22, 2017

Float16 GEMM on Adreno 330 #181

Closed

spedagadi mentioned this issue Oct 19, 2017

ARM Mali - SGEMM performance (example & questions) #205

Closed

CNugteren added the performance label Oct 28, 2017

CNugteren mentioned this issue Dec 17, 2017

Added an OpenCL kernel preprocessor #230

Merged

CNugteren closed this as completed Dec 31, 2017

fvella mentioned this issue Mar 16, 2018

CLBlast Performance dividiti/ck-caffe-firefly-rk3399#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-optimal performance on ARM Mali GPUs #128

Sub-optimal performance on ARM Mali GPUs #128

CNugteren commented Dec 4, 2016 •

edited

Loading

psyhtest commented May 31, 2017 •

edited

Loading

amoghaUdupa commented Jun 14, 2017 •

edited

Loading

CNugteren commented Jun 14, 2017

CNugteren commented Nov 29, 2017

CNugteren commented Dec 10, 2017 •

edited

Loading

sivagnanamn commented Dec 11, 2017

CNugteren commented Dec 11, 2017

sivagnanamn commented Dec 13, 2017

fvella commented Dec 13, 2017 •

edited

Loading

CNugteren commented Dec 15, 2017

CNugteren commented Dec 17, 2017

fvella commented Dec 18, 2017 •

edited

Loading

CNugteren commented Dec 18, 2017

CNugteren commented Dec 31, 2017

Sub-optimal performance on ARM Mali GPUs #128

Sub-optimal performance on ARM Mali GPUs #128

Comments

CNugteren commented Dec 4, 2016 • edited Loading

psyhtest commented May 31, 2017 • edited Loading

amoghaUdupa commented Jun 14, 2017 • edited Loading

CNugteren commented Jun 14, 2017

CNugteren commented Nov 29, 2017

CNugteren commented Dec 10, 2017 • edited Loading

sivagnanamn commented Dec 11, 2017

CNugteren commented Dec 11, 2017

sivagnanamn commented Dec 13, 2017

fvella commented Dec 13, 2017 • edited Loading

CNugteren commented Dec 15, 2017

CNugteren commented Dec 17, 2017

fvella commented Dec 18, 2017 • edited Loading

CNugteren commented Dec 18, 2017

CNugteren commented Dec 31, 2017

CNugteren commented Dec 4, 2016 •

edited

Loading

psyhtest commented May 31, 2017 •

edited

Loading

amoghaUdupa commented Jun 14, 2017 •

edited

Loading

CNugteren commented Dec 10, 2017 •

edited

Loading

fvella commented Dec 13, 2017 •

edited

Loading

fvella commented Dec 18, 2017 •

edited

Loading