Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dgemm bug for certain matrix dimensions where Sgemm works OK #51

Closed
blueberry opened this issue May 2, 2016 · 18 comments
Closed

Dgemm bug for certain matrix dimensions where Sgemm works OK #51

blueberry opened this issue May 2, 2016 · 18 comments

Comments

@blueberry
Copy link

blueberry commented May 2, 2016

My matrix tests revealed a very, very subtle issue with Dgemm:
Most of my tests work, but this one passes for floats and falls for doubles:
A: 2x3 column matrix ((1 2) (3 4) (5 6))
B: 3x2 column ((1 2 3) (4 5 6))
C: 2x2 column ((1 2) (3 4))

xgemm 2.0 * AB + 3.0 * C should update C to: ((47 62) (107 140))
and Sgemm does!
but Dgemm gives ((47 62) (9 12))

I would have sent you a test case in code, but I am doing this in Clojure, and the code is not more complex than the example I've given here.

I guess that the problem is that you do some pre-transformations and paddings for some matrices, so in some cases, such as this one, some operations are done in "wrong" places in memory.

Or maybe I am not using the API well, but I doubt that, given that all other tests work well, and, what especially puzzles me, that Sgemm works all right in the same example!

@CNugteren
Copy link
Owner

Curious! SGEMM and DGEMM follow the same code path, the only differ through a template parameter in C++ and some defines in the OpenCL kernel.

Have you run the included CLBlast tests? Do they pass for GEMM?

@CNugteren
Copy link
Owner

CNugteren commented May 2, 2016

Are you sure you copy back the full matrix C? Perhaps you are seeing old results: the double-precision version requires twice as wide CPU-GPU copies.

I have just modified the sgemm.c example and it produces the results which you expect. In other words: there is nothing wrong on my system.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <clblast_c.h>
int main(void) {

  // OpenCL platform/device settings
  const size_t platform_id = 0;
  const size_t device_id = 0;

  // Example GEMM arguments
  const size_t m = 2;
  const size_t n = 2;
  const size_t k = 3;
  const double alpha = 2.0f;
  const double beta = 3.0f;
  const size_t a_ld = k;
  const size_t b_ld = n;
  const size_t c_ld = n;

  // Initializes the OpenCL platform
  cl_uint num_platforms;
  clGetPlatformIDs(0, NULL, &num_platforms);
  cl_platform_id* platforms = (cl_platform_id*)malloc(num_platforms*sizeof(cl_platform_id));
  clGetPlatformIDs(num_platforms, platforms, NULL);
  cl_platform_id platform = platforms[platform_id];

  // Initializes the OpenCL device
  cl_uint num_devices;
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);
  cl_device_id* devices = (cl_device_id*)malloc(num_devices*sizeof(cl_device_id));
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);
  cl_device_id device = devices[device_id];

  // Creates the OpenCL context, queue, and an event
  cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);
  cl_event event = NULL;

  // Populate host matrices with some example data
  double* host_a = (double*)malloc(sizeof(double)*m*k);
  double* host_b = (double*)malloc(sizeof(double)*n*k);
  double* host_c = (double*)malloc(sizeof(double)*m*n);
  int t;
  t = 1; for (size_t i=0; i<k; ++i) { for (size_t j=0; j<m; ++j) { host_a[i*m + j] = t; printf("%d ", t); ++t; } printf("\n"); }
  t = 1; for (size_t i=0; i<n; ++i) { for (size_t j=0; j<k; ++j) { host_b[i*k + j] = t; printf("%d ", t); ++t; } printf("\n"); }
  t = 1; for (size_t i=0; i<m; ++i) { for (size_t j=0; j<n; ++j) { host_c[i*n + j] = t; printf("%d ", t); ++t; } printf("\n"); }

  // Copy the matrices to the device
  cl_mem device_a = clCreateBuffer(context, CL_MEM_READ_WRITE, m*k*sizeof(double), NULL, NULL);
  cl_mem device_b = clCreateBuffer(context, CL_MEM_READ_WRITE, n*k*sizeof(double), NULL, NULL);
  cl_mem device_c = clCreateBuffer(context, CL_MEM_READ_WRITE, m*n*sizeof(double), NULL, NULL);
  clEnqueueWriteBuffer(queue, device_a, CL_TRUE, 0, m*k*sizeof(double), host_a, 0, NULL, NULL);
  clEnqueueWriteBuffer(queue, device_b, CL_TRUE, 0, n*k*sizeof(double), host_b, 0, NULL, NULL);
  clEnqueueWriteBuffer(queue, device_c, CL_TRUE, 0, m*n*sizeof(double), host_c, 0, NULL, NULL);

  // Call the DGEMM routine.
  StatusCode status = CLBlastDgemm(kRowMajor, kNo, kNo,
                                   m, n, k,
                                   alpha,
                                   device_a, 0, a_ld,
                                   device_b, 0, b_ld,
                                   beta,
                                   device_c, 0, c_ld,
                                   &queue, &event);

  // Wait for completion
  clWaitForEvents(1, &event);

  // Example completed. See "clblast_c.h" for status codes (0 -> success).
  clEnqueueReadBuffer(queue, device_c, CL_TRUE, 0, n*m*sizeof(double), host_c, 0, NULL, NULL);
  printf("Completed SGEMM with status %d\n", status);
  for (size_t i=0; i<m*n; ++i) { printf("%.3lf\n", host_c[i]); }

  // Clean-up
  free(platforms);
  free(devices);
  free(host_a);
  free(host_b);
  free(host_c);
  clReleaseMemObject(device_a);
  clReleaseMemObject(device_b);
  clReleaseMemObject(device_c);
  clReleaseCommandQueue(queue);
  clReleaseContext(context);
  return 0;
}

@blueberry
Copy link
Author

blueberry commented May 2, 2016

I am pretty sure that there is no copying issue, since I use it through the library that takes care of that, which works for most examples, and also with the custom opencl blas engine I already had. Of course, I cannot be completely sure, so I'll start from the example you've provided.

How do I compile the code from the samples? Is there some hook in (c)make, or these are standalone that I should compile with vanilla gcc? And, can you please share what would be the parameters to that call? I have already installed libclblast, and it is visible globally on my system.

I didn't run the tests, since I tried to avoid a pretty messy clblas build and installation. Maybe it is not THAT complicated, but it looks like it would be from the docs. That's the last resort :)

@CNugteren
Copy link
Owner

CNugteren commented May 2, 2016

If you want to use the CMake infrastructure, you can just set -DSAMPLES=ON and then replace the samples/sgemm.c file with the above pasted code. Or you can compile it with something like:

g++ -std=c++11 sample.c -I/path/to/clblast/include -L/path/to/any/libs/not/in/default/path -lOpenCL -lclblast -o sample

About the tests: they now also work without clBLAS, they can be linked against a normal CPU BLAS installation as well.

@blueberry
Copy link
Author

Yes, I can confirm that it works well on my machine, too, when called from C. Thanks.

@blueberry
Copy link
Author

blueberry commented May 3, 2016

@CNugteren Cross-post from JOCLBlast. The problem also appears in CLBlast

Your example works... on card 0 (AMD R9 270X), but on card 1 (AMD R9 290X) the problem is the same as when I called it from Clojure. It is not due to the defective card, because my third card (also 290X) has the same error, and my custom dgemm kernels (OpenCL 2.0) compute the result right.

I tried Cedric's example, and it also has the same error on cards 1 and 2, while computes well on card 0 (which is the cheaper one)!

So, the problem is either some subtle bug in CLBlast or some problem in AMD's Hawaii (but I doubt since the kernels that I wrote following Cedric's blog work well).

Here is the output of your example:

CL_DEVICE_NAME: Hawaii
A:
 11.0  12.0  13.0  14.0  15.0 
 21.0  22.0  23.0  24.0  25.0 
 31.0  32.0  33.0  34.0  35.0 
 41.0  42.0  43.0  44.0  45.0 
B:
 11.0  12.0  13.0 
 21.0  22.0  23.0 
 31.0  32.0  33.0 
 41.0  42.0  43.0 
 51.0  52.0  53.0 
C:
 11.0  12.0  13.0 
 21.0  22.0  23.0 
 31.0  32.0  33.0 
 41.0  42.0  43.0 
Result of C = 10.0 * A * B + 20.0 * C:
21370.0 240.0 260.0 
37070.0 440.0 460.0 
620.0 640.0 660.0 
820.0 840.0 860.0 

You'll notice that it computes beta * C, but leaves out alpha * AB

Cedric's example from C:

./clblast_sample_sgemm
1 2 
3 4 
5 6 
1 2 3 
4 5 6 
1 2 
3 4 
Completed SGEMM with status 0
47.000
6.000
107.000
12.000

@CNugteren
Copy link
Owner

CNugteren commented May 3, 2016

Thanks for all the info! Unfortunately I can't reproduce it myself, so you'll have to help me debug it. A couple of possible sources for the bug:

  • There is a subtle (e.g. synchronisation barrier) bug in the GEMM kernel (or one of its supporting kernels). Perhaps this bug only manifests with certain devices.
  • There is a compiler bug, causing it to generate incorrect code for your Hawaii device.
  • There is a bug only in a particular configuration of the GEMM kernel (or one of its supporting kernels). Because of the tuning, all devices use different parameters. Perhaps one of those combinations is not correct and should not be used (although the tuner should report this in theory).

To get some more insight, I would ask you to run the GEMM tests included with CLBlast. They will show in which cases it works and in which it doesn't (e.g. requiring a pre-processing transpose matrix kernel or not). Afterwards, you can play around with the values in the CLBlast database, for example by changing include/internal/database/xgemm.h (line 144). You'll see that Hawaii and Pitcairn use different values, as do the 32-bit and 64-bit versions. Perhaps you can copy the Pitcairn and/or the 32-bit values in and what helps and what doesn't.

@blueberry
Copy link
Author

A couple days ago I have already reported problems with Hawaii tuning, but I didn't get any reports on which combinations fail. I only got segmentation faults. I'll try editing the database and will report the results.

@blueberry
Copy link
Author

blueberry commented May 3, 2016

Replacing the tuning parameters for Dgemm in xgemm.h solved the problem:
I tried to use both sgemm Hawaii and dgemm Pitcairn settings for dgemm Hawaii, and both worked, and with the same speed. Probably because these have very similar architecture, private and local memory limits.
Can you please update xgemm.h in the CLBlast official database by hand?

As for the performance, you can see some measurements for sgemm (they are excellent and poor at the same time :) in #53, and this is not related to this problem.

@CNugteren
Copy link
Owner

I could update the parameters, but that doesn't satisfy me. If the tuner produced parameters that lead to incorrect results, then something is wrong somewhere.

Unfortunately I don't have access to such a device myself, so could you perhaps run the CLBlast GEMM tests with the old parameters and see which of the tests fail? Some should pass at least: the tuner didn't notice any issues.

And then I am also interested to see whether just changing the SA and SB parameters to 0 already fixes it? To solve the issue I need to know what parameters are illegal for these devices.

@blueberry
Copy link
Author

blueberry commented May 4, 2016

OK. The tests build with the newest development branch. I'll send the test log with the current settings this evening, and will try to experiment with SA and SB, and to tune the library again and test that tomorrow evening.
EDIT: I have uploaded all suggested tests.

@blueberry
Copy link
Author

blueberry commented May 4, 2016

Hawaii and Pitcairn with the current development database:
tests.zip
EDIT: Hawaii Dgemm is not from development but the one that I copied from Sgemm.

@blueberry
Copy link
Author

In the previous results, Hawaii Dgemm parameters were not the official ones, but the ones that I copied from Sgemm.

Your hunch about parameters was all right: changing SA and SB to 0 gives correct results in the test case that originally failed for me.

Here are CLBlast test results with official Hawaii parameters with changed SA=0 and SB=0 (hawaii0.test)
And, with official Hawaii that was broken for Dgemm (hawaii1.test)

hawaii-develop-tests.zip

@CNugteren
Copy link
Owner

I cannot see any SGEMM or DGEMM failures in any of the tests you uploaded. Or am I missing something?

I do see some failures in other routines: that's not right! Have you noticed them as well? I now re-ran the tests on the AMD R9 M370X GPU, and it shows failures as well. I'll investigate them.

@CNugteren
Copy link
Owner

CNugteren commented May 8, 2016

I've fixed some tests that failed previously on my machine, I believe they should fix some of the errors you saw as well. Some failures fail on AMD GPUs because of specific GEMM parameters, e.g. SYRK and SYR2K. This seems to be related to your issue and remains still unsolved.

@CNugteren
Copy link
Owner

I've fixed a bug related to events not properly set in the GEMM routine in some cases, which had as result that clWaitForEvents wouldn't actually wait for completion of all of the GEMM kernels, leading to potentially incorrect results (716d7c). Could you perhaps try again and see if your bug is still present with the newest version of the development branch?

@blueberry
Copy link
Author

I tested with the development branch and now it works OK with my basic tests, with official dgemm Hawaii settings.

I couldn't compile the CLBlast tests, though. I got Error 2.
Also, tuning still fails for xgemm, so I didn't retune it, but the results I get for sgemm 8192 are now stellar (less than 300 ms).

Thank you very much for fixing this so quickly! :)

@CNugteren
Copy link
Owner

CNugteren commented May 16, 2016

Good to hear that correctness is fixed again! I'll close this issue.

Could you perhaps open a separate issue and post the compilation error you get when compiling the tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants