-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Power-of-2 kernels *much* slower than random dimensional kernels #53
Comments
A few more dimensions: For reference, 290X has 44 compute units, while 270X has 20 CUs, the architecture and memory specifications are the same.
|
BTW, that 8000 result would place CLBlast on R9 290X at 3.7 TFLOPS. Seems rather fast, even in comparison with cuBLAS, since this is an affordable card. |
Another update: it definitely is related to tuning parameters, to a certain degree. If I copy Tahiti parameters to Hawaii, the speeds are logical, although not particularly fast: |
EDIT: The original measurements are still here. When I experimented with various parameters, I measured my own engine, that's why the results were always the same :) |
So, the real measurements for 8000/8192/10000, in milliseconds: After playing the human tuner a bit more I discovered that speed varies, but the proportions are always the same: 2^N kernels and some others near them are very slow, while the rest are super fast. For example, 11.111 x 11.111 matrix computes in 742 ms. |
And a quick info that sgemv does not have this problem. |
Thanks for all the info. There seems to be an issue with GEMM on AMD GPUs, perhaps also related to the DGEMM bug you found for specific tuning parameters. The curious thing is: the tuner automatically verifies the results and should not include the results in case it found errors. So it seems strange that some configurations do lead to errors. Also, about your performance problems: are you sure you measure properly, i.e. waiting for the event to complete? If you compile the CLBlast performance tests ( By the way, the tuner tunes for 1024x1024x1024 as a default, so it could very well be the case that certain larger sizes are slower. But it is strange that power-of-2 is slower than non-power-of-2. I'll investigate this in more detail later. |
Yes, I am waiting for the whole queue to complete with (finish!). I couldn't compile with -DTESTS=ON, since CLBlast does not seem to like cblas headers provided by my atlas 3.10.2 installation (I have set CBLAS_ROOT, even the CBLAS_LIBRARIES and CBLAS_INCLUDE_DIRS. I include the beginning of the wall of errors that I get below. I also couldn't install clBLAS package provided in Arch's AUR. Is there a way to compile just the performance tests and skip the dependencies (I guess not judging by an earlier issue)? Now, looking at the performance report for M370X, I can see the similar drop for large powers of 2, so at least I know that it is not due to a mistake that I made in building the library or using it. So, as long as you can reproduce and investigate that problem, I am ok with living with that issue for some time :) My concern was that this was something that appears randomly and could go unsolved... Scanning dependencies of target clblast_test_xgemm
[100%] Building CXX object CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o
[100%] Linking CXX executable clblast_test_xgemm
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXscal(unsigned long, std::complex<float>, std::vector<std::complex<float>, std::allocator<std::complex<float> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x470): undefined reference to `cblas_cscal(int, void const*, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXscal(unsigned long, std::complex<double>, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x500): undefined reference to `cblas_zscal(int, void const*, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXaxpy(unsigned long, std::complex<float>, std::vector<std::complex<float>, std::allocator<std::complex<float> > > const&, unsigned long, unsigned long, std::vector<std::complex<float>, std::allocator<std::complex<float> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x674): undefined reference to `cblas_caxpy(int, void const*, void const*, int, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXaxpy(unsigned long, std::complex<double>, std::vector<std::complex<double>, std::allocator<std::complex<double> > > const&, unsigned long, unsigned long, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x730): undefined reference to `cblas_zaxpy(int, void const*, void const*, int, void*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXdot(unsigned long, std::vector<float, std::allocator<float> >&, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x78c): undefined reference to `cblas_sdot(int, float const*, int, float const*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXdot(unsigned long, std::vector<double, std::allocator<double> >&, unsigned long, std::vector<double, std::allocator<double> > const&, unsigned long, unsigned long, std::vector<double, std::allocator<double> > const&, unsigned long, unsigned long)':
xgemm.cc:(.text+0x7cc): undefined reference to `cblas_ddot(int, double const*, int, double const*, int)'
CMakeFiles/clblast_test_xgemm.dir/test/correctness/routines/level3/xgemm.cc.o: In function `clblast::cblasXnrm2(unsigned long, std::vector<float, std::allocator<float> >&, unsigned long, std::vector<float, std::allocator<float> > const&, unsigned long, unsigned long)': |
And another thing: can you reproduce fast computations for non-power-of-2 large matrices on M370X? |
Perhaps a separate issue, but I've just made a commit to |
I also did some experiments on Tahiti (HD7970) with SGEMM, and I observe the same. There is a drop in performance for larger power-of-2 kernels:
And there is also much more stable performance for non-power-of-2 kernels:
But I am also suspecting incorrect behaviour for larger matrices of specific sizes. For example, lowering the value of K to 1024 with M = N = 4096 yields the following:
This is beyond the specs of the device. So either I am computing the GFLOPS number wrongly or both CLBlast and clBLAS cause a crash and faulty results. Thus, there is definitely an issue on AMD hardware. Some conclusions so far:
|
This was actually a small mistake in computing the GFLOPS number and has been fixed in the Now back to the issue at hand. I was able to reproduce behaviour on the R9 M370X GPU: it also shows poor performance for 4K by 4K by 4K, but performance picks up again for slightly larger sizes. So it is an issue occurring in quite a range of AMD hardware. I also double-checked: the performance drop is in the main GEMM kernel, not in one of the supporting kernels which I mentioned in the above post. And this main GEMM kernel is exactly the same for powers-of-2 as for other values - it might not even be re-compiled if it is already in the cache. So the only thing that changes are the matrix sizes. Perhaps some unfortunate behaviour is going on were all memory accesses go through a single memory controller for large powers-of-2? I couldn't pinpoint the cause for this in the code though. I see two possible solutions:
|
Since you can reproduce the bug, I guess that the solution that you think would be most appropriate is also good enough for me. AMD recommends using staggered offsets http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472340 I can also try to tune to larger sizes and see how that goes. |
Yeah, that's what I meant I guess. This used to be called 'partition camping' on the first CUDA-capable GPUs, but hasn't been a problem since then. Apparently it is for the AMD architecture (which I am not too familiar with). I'll read the docs and try to understand the problem fully, because apparently some configurations of the GEMM kernel don't have this issue. The tuner binaries have command-line arguments indeed. When you run them you'll see what the options are, but I believe something like |
I have just added support for staggered offsets as you (and AMD) suggested. The changes are in commit 9065b34. I still see minor performance drops, but not as bad as before. So I guess that solves this issue. Can you verify on your machine perhaps? Ideally we should also re-run the tuners for the AMD devices...
I also still see occasional failures and incorrect results for specific tuning parameters, so there is still a bug somewhere. But we can continue that investigation in #51. |
I have just tried it, and now there is no performance drop on my card, too! I'd say that this is solved! Thank you :) Maybe it'd be a good time for 0.7.1, or even 0.8.0, since this is a very important fix for AMD cards, and @gpu could perphaps then release JOCLBlast 0.7.1? |
If CLBlast is updated to a new version, I'll also update JOCLBlast. If this is not considered "worth a new version", I could still integrate it, maybe as "0.7.0b" ... |
I got this from multiple builds of CLBlas during the past week or so. For reference, my hand-written sgemm for 8192x8192 runs in 600ms. As you can see, on R9 290X with 4G of RAM, 8000x8000 kernel runs expectedly quickly, while 8192x8192 kernel is surprisingly slow. It is not due to particular database options, since with defaults it runs a couple of times slower.
The text was updated successfully, but these errors were encountered: