Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power-of-2 kernels *much* slower than random dimensional kernels #53

Closed
blueberry opened this issue May 3, 2016 · 18 comments
Closed

Power-of-2 kernels *much* slower than random dimensional kernels #53

blueberry opened this issue May 3, 2016 · 18 comments

Comments

@blueberry
Copy link

blueberry commented May 3, 2016

I got this from multiple builds of CLBlas during the past week or so. For reference, my hand-written sgemm for 8192x8192 runs in 600ms. As you can see, on R9 290X with 4G of RAM, 8000x8000 kernel runs expectedly quickly, while 8192x8192 kernel is surprisingly slow. It is not due to particular database options, since with defaults it runs a couple of times slower.

(with-default
    (with-engine clblast-single *command-queue*
      (facts
       "Matrix-matrix multiplication. Matrices of 8192x8192 (268 MB) are usually
demanding enough."
       (let [cnt 8000]
         (with-release [host-a (sge cnt cnt (range (* cnt cnt)))
                        host-b (sge cnt cnt (range (* cnt cnt)))
                        host-c (sge cnt cnt (range (* cnt cnt)))
                        gpu-a (transfer! host-a (clge cnt cnt))
                        gpu-b (transfer! host-a (clge cnt cnt))
                        gpu-c (transfer! host-a (clge cnt cnt))]

           (println "CPU:")
           (time (mm! 3 host-a host-b 2 host-c)) => host-c
           (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))
           (println "GPU:")
           (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))) => truthy)))))
CPU:
"Elapsed time: 16537.271501 msecs"
GPU:
"Elapsed time: 280.819371 msecs"
true
u.n.e.g.tutorial-opencl-test>
(with-default
    (with-engine clblast-single *command-queue*
      (facts
       "Matrix-matrix multiplication. Matrices of 8192x8192 (268 MB) are usually
demanding enough."
       (let [cnt 8192]
         (with-release [host-a (sge cnt cnt (range (* cnt cnt)))
                        host-b (sge cnt cnt (range (* cnt cnt)))
                        host-c (sge cnt cnt (range (* cnt cnt)))
                        gpu-a (transfer! host-a (clge cnt cnt))
                        gpu-b (transfer! host-a (clge cnt cnt))
                        gpu-c (transfer! host-a (clge cnt cnt))]

           (println "CPU:")
           (time (mm! 3 host-a host-b 2 host-c)) => host-c
           (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))
           (println "GPU:")
           (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish! *command-queue*))) => truthy)))))
CPU:
"Elapsed time: 17800.345095 msecs"
GPU:
"Elapsed time: 1453.541557 msecs"
true
@blueberry
Copy link
Author

blueberry commented May 3, 2016

A few more dimensions:

For reference, 290X has 44 compute units, while 270X has 20 CUs, the architecture and memory specifications are the same.

  1. R9 290X 4G DDR5 (sampled with dozens of measurements)
    8000: 276 ms
    8192: 1.460 ms
    4000: 39 ms
    4192: 50 ms
    2000: 5.80 ms
    2048: 5.15 ms
  2. R9 270X 4G DDR5
    8000: 626 ms
    8192: CL_MAP_FAILURE
    8100: 13.357 msecs
    4000: 573 ms
    4096: 623 ms
    4200: 95 ms
    2000: 15 ms
    2048: 15 ms

@blueberry
Copy link
Author

blueberry commented May 3, 2016

BTW, that 8000 result would place CLBlast on R9 290X at 3.7 TFLOPS. Seems rather fast, even in comparison with cuBLAS, since this is an affordable card.

@blueberry
Copy link
Author

blueberry commented May 3, 2016

Another update: it definitely is related to tuning parameters, to a certain degree. If I copy Tahiti parameters to Hawaii, the speeds are logical, although not particularly fast:
8000: 632 ms
8192: 577 ms
EDIT: not at all, this measures MY engine unrelated to clblast

@blueberry
Copy link
Author

EDIT: The original measurements are still here. When I experimented with various parameters, I measured my own engine, that's why the results were always the same :)

@blueberry
Copy link
Author

blueberry commented May 3, 2016

So, the real measurements for 8000/8192/10000, in milliseconds:
With copying Tahiti: 400/3.100/800
With copying Pitcairn: 580/7.600/1.150

After playing the human tuner a bit more I discovered that speed varies, but the proportions are always the same: 2^N kernels and some others near them are very slow, while the rest are super fast.
So, the original hawaii tuning parameters were the fastest, the trouble is with some slow regions.

For example, 11.111 x 11.111 matrix computes in 742 ms.
13.000 X 13.000 throws CL_MAP_FAILURE
8000 x 8.000 is at 280 ms
In the neighbourhood of 8192, the slow zone starts from 8129 and ends at 8192. 8193 is at 308 ms!
This looks to me as a case of calling the wrong kernel (note that the slow region width is 64. perhaps in those cases there are some if/then/elses that do harm more than they help)? I thought that 8192 would be THE case for showing speed...

@blueberry
Copy link
Author

And a quick info that sgemv does not have this problem.

@CNugteren
Copy link
Owner

Thanks for all the info. There seems to be an issue with GEMM on AMD GPUs, perhaps also related to the DGEMM bug you found for specific tuning parameters. The curious thing is: the tuner automatically verifies the results and should not include the results in case it found errors. So it seems strange that some configurations do lead to errors.

Also, about your performance problems: are you sure you measure properly, i.e. waiting for the event to complete? If you compile the CLBlast performance tests (-DTESTS=ON) you'll be able to run the measurements from a single included binary or even run the provided Rscript. Have you looked at the results for other AMD hardware included in doc/performance? My own M370X also suffers for larger sizes.

By the way, the tuner tunes for 1024x1024x1024 as a default, so it could very well be the case that certain larger sizes are slower. But it is strange that power-of-2 is slower than non-power-of-2. I'll investigate this in more detail later.

@blueberry
Copy link
Author

Yes, I am waiting for the whole queue to complete with (finish!).

I couldn't compile with -DTESTS=ON, since CLBlast does not seem to like cblas headers provided by my atlas 3.10.2 installation (I have set CBLAS_ROOT, even the CBLAS_LIBRARIES and CBLAS_INCLUDE_DIRS. I include the beginning of the wall of errors that I get below. I also couldn't install clBLAS package provided in Arch's AUR. Is there a way to compile just the performance tests and skip the dependencies (I guess not judging by an earlier issue)?

Now, looking at the performance report for M370X, I can see the similar drop for large powers of 2, so at least I know that it is not due to a mistake that I made in building the library or using it.
What still puzzles me is that only the powers-of-2 neighbourhood is affected - for other large matrices the speed is superb!

So, as long as you can reproduce and investigate that problem, I am ok with living with that issue for some time :) My concern was that this was something that appears randomly and could go unsolved...

Scanning dependencies of target clblast_test_xgemm
[100%] Building CXX object CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o
[100%] Linking CXX executable clblast_test_xgemm
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXscal(unsigned long, std::complex<float>, std::vector<std::complex<float>, std::allocator<std::complex<float> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x470): undefined reference to `cblas_cscal(int, void const*, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXscal(unsigned long, std::complex<double>, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x500): undefined reference to `cblas_zscal(int, void const*, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXaxpy(unsigned long, std::complex<float>, std::vector<std::complex<float>, std::allocator<std::complex<float> > > const&, unsigned long, unsigned long, std::vector<std::complex<float>, std::allocator<std::complex<float> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x674): undefined reference to `cblas_caxpy(int, void const*, void const*, int, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXaxpy(unsigned long, std::complex<double>, std::vector<std::complex<double>, std::allocator<std::complex<double> > > const&, unsigned long, unsigned long, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x730): undefined reference to `cblas_zaxpy(int, void const*, void const*, int, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXdot(unsigned long, std::vector<float, std::allocator<float> >&, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x78c): undefined reference to `cblas_sdot(int, float const*, int, float const*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXdot(unsigned long, std::vector<double, std::allocator<double> >&, unsigned long, std::vector<double, std::allocator<double> > const&, unsigned long, unsigned long, std::vector<double, std::allocator<double> > const&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x7cc): undefined reference to `cblas_ddot(int, double const*, int, double const*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXnrm2(unsigned long, std::vector<float, std::allocator<float> >&, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long)':

@blueberry
Copy link
Author

And another thing: can you reproduce fast computations for non-power-of-2 large matrices on M370X?

@CNugteren
Copy link
Owner

Perhaps a separate issue, but I've just made a commit to development to fix linking with ATLAS. The problem was that I was trying to link against libatlas.so, whereas this should have been libcblas.so (also provided by ATLAS). I've tested on my own Linux system and it works.

@CNugteren
Copy link
Owner

CNugteren commented May 4, 2016

I also did some experiments on Tahiti (HD7970) with SGEMM, and I observe the same. There is a drop in performance for larger power-of-2 kernels:

                               | <--    CLBlast  --> | <--   clBLAS    --> |
        m;        n;        k;      ms_1; GFLOPS_1;        ms_2; GFLOPS_2;  
      512;      512;      512;      1.94;    138.1;        1.01;    266.5;  
       1K;       1K;       1K;      2.39;    898.6;        1.67;   1284.6;  
     1536;     1536;     1536;      3.73;   1945.0;        4.55;   1593.9;  
       2K;       2K;       2K;      8.30;   2069.2;       10.23;   1679.8;  
     2560;     2560;     2560;     17.50;   1917.0;       19.32;   1736.7;  
       3K;       3K;       3K;     39.02;   1486.1;       33.05;   1754.6;  
     3584;     3584;     3584;     61.00;   1509.4;       52.07;   1768.2;  
       4K;       4K;       4K;    154.20;    891.3;       77.64;   1770.2;  
     4608;     4608;     4608;    148.50;   1317.8;      111.78;   1750.7;  
       5K;       5K;       5K;    216.57;   1239.5;      155.67;   1724.4;  
     5632;     5632;     5632;    289.15;   1235.6;      217.77;   1640.7;  
       6K;       6K;       6K;    481.91;    962.5;      283.38;   1636.9;  
     6656;     6656;     6656;    505.95;   1165.6;      365.78;   1612.3;  
       7K;       7K;       7K;    681.96;   1080.1;      482.15;   1527.7;  
     7680;     7680;     7680;    840.51;   1077.9;      596.21;   1519.5;  
       8K;       8K;       8K;   1728.65;    636.1;      744.92;   1476.0;  
     8704;     8704;     8704;   1314.90;   1003.0;      930.35;   1417.6;  
       9K;       9K;       9K;   1637.18;    956.2;     1145.12;   1367.1;  
     9728;     9728;     9728;   1923.55;    957.2;     1410.96;   1304.9;  
      10K;      10K;      10K;   2811.10;    763.9;     1647.62;   1303.4;  

And there is also much more stable performance for non-power-of-2 kernels:

                               | <--    CLBlast  --> | <--   clBLAS    --> |
        m;        n;        k;      ms_1; GFLOPS_1;        ms_2; GFLOPS_2;  
      500;      500;      500;      2.12;    117.7;        0.93;    268.5;  
     1000;     1000;     1000;      3.75;    533.6;        2.94;    679.6;  
     1500;     1500;     1500;      4.89;   1379.1;        4.56;   1479.1;  
     2000;     2000;     2000;      9.90;   1616.1;        9.78;   1635.4;  
     2500;     2500;     2500;     22.26;   1404.0;       19.33;   1617.0;  
     3000;     3000;     3000;     40.93;   1319.3;       31.65;   1706.3;  
     3500;     3500;     3500;     62.50;   1371.9;       49.56;   1730.4;  
     4000;     4000;     4000;     91.16;   1404.1;       73.26;   1747.1;  
     4500;     4500;     4500;    136.27;   1337.4;      105.83;   1722.2;  
     5000;     5000;     5000;    182.58;   1369.3;      148.86;   1679.5;  
     5500;     5500;     5500;    228.96;   1453.3;      196.24;   1695.6;  
     6000;     6000;     6000;    283.70;   1522.7;      266.69;   1619.9;  

But I am also suspecting incorrect behaviour for larger matrices of specific sizes. For example, lowering the value of K to 1024 with M = N = 4096 yields the following:

                               | <--    CLBlast  --> | <--   clBLAS    --> |
        m;        n;        k;      ms_1; GFLOPS_1;        ms_2; GFLOPS_2;  
       4K;       4K;       1K;     38.44;   3575.8;       19.89;   6911.5;  

This is beyond the specs of the device. So either I am computing the GFLOPS number wrongly or both CLBlast and clBLAS cause a crash and faulty results.

Thus, there is definitely an issue on AMD hardware. Some conclusions so far:

  • There are no faulty kernels for small sizes: the unit tests included in CLBlast work fine for me on a Tahiti GPU. Could you also run them on your device?
  • It seems that the kernels on their own work fine, at least for the settings used while tuning (1024 x 1024 x 1024). Could you see what happens if you tune for larger matrices, in particular larger values of K? Let's say 4K by 4K by 4K? Do you still see this performance drop?
  • CLBlast doesn't have a special power-of-2 kernel. Actually, there are some pre-processing kernels (copy, pad, pad_transpose, transpose) that make sure all matrices are a power-of-2: the main GEMM kernel only handles powers of 2. Perhaps something is faulty in these pre-processing kernels or in the CPU code related to it? I'll investigate this.

@CNugteren
Copy link
Owner

This is beyond the specs of the device. So either I am computing the GFLOPS number wrongly or both CLBlast and clBLAS cause a crash and faulty results.

This was actually a small mistake in computing the GFLOPS number and has been fixed in the development branch.

Now back to the issue at hand.

I was able to reproduce behaviour on the R9 M370X GPU: it also shows poor performance for 4K by 4K by 4K, but performance picks up again for slightly larger sizes. So it is an issue occurring in quite a range of AMD hardware.

I also double-checked: the performance drop is in the main GEMM kernel, not in one of the supporting kernels which I mentioned in the above post. And this main GEMM kernel is exactly the same for powers-of-2 as for other values - it might not even be re-compiled if it is already in the cache. So the only thing that changes are the matrix sizes. Perhaps some unfortunate behaviour is going on were all memory accesses go through a single memory controller for large powers-of-2? I couldn't pinpoint the cause for this in the code though.

I see two possible solutions:

  • In case of a large power-of-2 GEMM, the padding is extended slightly to act as a non-power-of-2. This is simple to implement, but we'll have to be careful that future AMD devices still see this behaviour.
  • Tuning for 1K by 1K by 1K is not representative for larger values: we should start tuning for larger matrices instead. Problem: this is a lot slower, since some configurations are now extremely slow. This doesn't require any code changes, but will require re-tuning on the AMD devices. And what if this will significantly degrade performance of smaller matrices?

@blueberry
Copy link
Author

blueberry commented May 8, 2016

Since you can reproduce the bug, I guess that the solution that you think would be most appropriate is also good enough for me.
I also think that, if there are no bugs in the kernels, the most obvious source of the problem is that the memory is aligning such that it puts too much stress on one channel. As far as I know, all GCN devices have similar characteristics, so any solution that works on the three cards that we have would work on any of the same/similar architecture. If/when we can find a better/more universal solution, we could switch to something different.

AMD recommends using staggered offsets http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472340

I can also try to tune to larger sizes and see how that goes.
I guess there is a switch when I start the tuner to choose the matrix size?

@CNugteren
Copy link
Owner

Yeah, that's what I meant I guess. This used to be called 'partition camping' on the first CUDA-capable GPUs, but hasn't been a problem since then. Apparently it is for the AMD architecture (which I am not too familiar with). I'll read the docs and try to understand the problem fully, because apparently some configurations of the GEMM kernel don't have this issue.

The tuner binaries have command-line arguments indeed. When you run them you'll see what the options are, but I believe something like ./clblast_tuner_xgemm -n 4096 -m 4096 -k 4096 will do the trick.

@CNugteren
Copy link
Owner

CNugteren commented May 15, 2016

I have just added support for staggered offsets as you (and AMD) suggested. The changes are in commit 9065b34. I still see minor performance drops, but not as bad as before. So I guess that solves this issue. Can you verify on your machine perhaps? Ideally we should also re-run the tuners for the AMD devices...

I also cannot reproduce the performance numbers in the included graphs on AMD hardware for SGEMM. So I'll need to investigate performance further, it is not at peak yet for AMD apparently. Update: the latest version of development contains updated tuning results for my test device (Radeon M370X) and an updated performance graph (in doc/performance). It shows similar performance to clBLAS and no performance drop for 4K or 8K matrix dimensions.

I also still see occasional failures and incorrect results for specific tuning parameters, so there is still a bug somewhere. But we can continue that investigation in #51.

@blueberry
Copy link
Author

I have just tried it, and now there is no performance drop on my card, too! I'd say that this is solved! Thank you :)

Maybe it'd be a good time for 0.7.1, or even 0.8.0, since this is a very important fix for AMD cards, and @gpu could perphaps then release JOCLBlast 0.7.1?

@gpu
Copy link
Contributor

gpu commented May 16, 2016

If CLBlast is updated to a new version, I'll also update JOCLBlast. If this is not considered "worth a new version", I could still integrate it, maybe as "0.7.0b" ...

@CNugteren
Copy link
Owner

Good to hear it is fixed. It makes sense to release 0.7.1 indeed, especially considering that the other correctness bug you reported (#51) is now fixed. I'll go ahead and fix the Visual Studio DLL issue first (#59) and afterwards make a 0.7.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants