-
-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16. #560
Comments
What happens if you leave out the first call to |
OK, not there yet, but at least we went from 15 to 7ms :-) Are you compiling CLBlast from source yourself? If so, could you add |
Yeah, It's a good step.
And re-run test code, get detail below.
It seems that the GEMV API does not select the regular kernel of gemv, so it is not the same kernel as the kernel in clblast_tuner_xgemv; now how should I call the API to ensure that the most optimized kernel is used? |
Indeed, this shows that it is using the
BTW, this also shows in the first line the tuning parameters used, and indeed I think it is now either a matter of tuning the |
In fact, my goal is very simple, which is to do a matrix-vector multiplication of [2048, 16384] * [16384], and the performance is consistent with the performance in tuning. |
OK, thanks for testing. I took a bit more time to investigate. First of all, I would suggest you revert the
So if you set those parameters, and run normally (with the transpose option on like you did above) you should see that it runs the From your debug output (
This corresponds with your measurements:
The only thing I can't explain are your slow measurements after you added the
But I would ignore that for nice, given that you modified the code, something else might be wrong now. So in conclusion do all of the following and you should get good results:
|
Thank you for your answer
|
Good to hear that you now get the speed as advertised. Regarding choosing the kernel: that is not possible in CLBlast. The general Xgemv kernel should in theory be slower than the 'fast' variants, but there might be edge-cases of course, such as your example with half-precision. Furthermore, in this case the speed different between the general Xgemv kernel and the XgemvFast kernel is very minimal: both are likely limited by the memory bandwidth of your device, not by its computational power. So no, there won't be an API to choose. If you really think this is an issue we should instead look into making the XgemvFastRot kernel as fast as the Xgemv kernel itself, if possible, but I don't have the time for that myself. If you do really want to use the general Xgemv kernel you can set an offset into the buffer, e.g. an offset of 1. You'll need one extra byte of memory but that should be fine. |
Hi,
I recently used CLblast to speed up Android devices, but found a problem. The performance of using the Gemv API cannot reach the performance of the tune program, and the performance difference is very large. Please help me see if there are any problems with my API usage, or other omissions;
clblast_tuned_xgemv performance:
API performance:
The text was updated successfully, but these errors were encountered: