Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16. #560

Closed
liangzelang opened this issue Oct 10, 2024 · 10 comments

Comments

@liangzelang
Copy link

liangzelang commented Oct 10, 2024

Hi,
I recently used CLblast to speed up Android devices, but found a problem. The performance of using the Gemv API cannot reach the performance of the tune program, and the performance difference is very large. Please help me see if there are any problems with my API usage, or other omissions;

clblast_tuned_xgemv performance:
img_v3_02fh_5ef424a0-8555-4082-91ff-d45b9e07d06g

API performance:
img_v3_02fh_d85c4997-dea7-4ade-a4aa-889ac74d933g


// API test
#include <CL/cl.h>
#include <clblast.h>
#include <clblast_half.h>
#include <vector>
#include <chrono>
#include <iostream>

static void printDeviceInfo(cl_device_id device) {
    char buffer[1024];
    // device name
    clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(buffer), buffer, nullptr);
    std::cout << "Device Name: " << buffer << std::endl;

    // device vendor
    clGetDeviceInfo(device, CL_DEVICE_VENDOR, sizeof(buffer), buffer, nullptr);
    std::cout << "Device Vendor: " << buffer << std::endl;

    // device version
    clGetDeviceInfo(device, CL_DEVICE_VERSION, sizeof(buffer), buffer, nullptr);
    std::cout << "Device Version: " << buffer << std::endl;

    // drvier version
    clGetDeviceInfo(device, CL_DRIVER_VERSION, sizeof(buffer), buffer, nullptr);
    std::cout << "Driver Version: " << buffer << std::endl;

    // OpenCL version
    clGetDeviceInfo(device, CL_DEVICE_OPENCL_C_VERSION, sizeof(buffer), buffer, nullptr);
    std::cout << "OpenCL C Version: " << buffer << std::endl;
}

int main() {

    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;

    clGetPlatformIDs(1, &platform, nullptr);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, nullptr);
    printDeviceInfo(device);
    context = clCreateContext(nullptr, 1, &device, nullptr, nullptr, nullptr);
    queue = clCreateCommandQueue(context, device, 0, nullptr);

    const size_t m = 2048;
    const size_t n = 16384;

    std::vector<__fp16> host_a(m * n, 1.0f);
    std::vector<__fp16> host_x(n, 1.0f);
    std::vector<__fp16> host_y(m, 0.0f);

    cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_a.size() * sizeof(__fp16), host_a.data(), nullptr);
    cl_mem x_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_x.size() * sizeof(__fp16), host_x.data(), nullptr);
    cl_mem y_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, host_y.size() * sizeof(__fp16), nullptr, nullptr);
    clFinish(queue); // Ensure memory operations are complete

    using Parameters = std::unordered_map<std::string, size_t>;

    // params
    Parameters tuned_params = {
        {"WGS1", 64},
        {"WPT1", 1}
    };

    // use tuned params
    clblast::OverrideParameters(device, "Xgemv", clblast::Precision::kHalf, tuned_params);

    // Performance measurement
    double totalTime = 0.0;
    for (int i = 0; i < 10; i++) {
        auto start = std::chrono::steady_clock::now();
        
        auto status = clblast::Gemv<half>(clblast::Layout::kRowMajor, clblast::Transpose::kNo, m, n, 
                        1.0f, 
                        a_buffer, 0, n, 
                        x_buffer, 0, 1, 
                        0.0f, 
                        y_buffer, 0, 1, 
                        &queue, nullptr);

        clFinish(queue);
        totalTime += std::chrono::duration<double,std::milli>(std::chrono::steady_clock::now() - start).count();;
    }

    double averageTime = totalTime / 10.0;

    // time
    std::cout << "GEMV execution time: " << averageTime << " ms" << std::endl;

    // release 
    clReleaseMemObject(a_buffer);
    clReleaseMemObject(x_buffer);
    clReleaseMemObject(y_buffer);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

@CNugteren
Copy link
Owner

What happens if you leave out the first call to clblast::Gemv for the measurement? Or better perhaps, just print out the individual times of each run. The first call will have to compile the kernel as well, so it will definitely be slower.

@liangzelang
Copy link
Author

What happens if you leave out the first call to clblast::Gemv for the measurement? Or better perhaps, just print out the individual times of each run. The first call will have to compile the kernel as well, so it will definitely be slower.

Thx your reply. I change test code and get perfermance which alse has performance gap with clblast_tuned_xgemv.
image

    // Performance measurement
    double totalTime = 0.0;
    for (int i = 0; i < 10; i++) {
        auto start = std::chrono::steady_clock::now();
        
        auto status = clblast::Gemv<half>(clblast::Layout::kRowMajor, clblast::Transpose::kNo, m, n, 
                        1.0f, 
                        a_buffer, 0, n, 
                        x_buffer, 0, 1, 
                        0.0f, 
                        y_buffer, 0, 1, 
                        &queue, nullptr);

        clFinish(queue);
        auto elapsed_time = std::chrono::duration<double,std::milli>(std::chrono::steady_clock::now() - start).count();
        std::cout << "No. "<< i << " GEMV execution time: " << elapsed_time << " ms" << std::endl;
        totalTime += elapsed_time;
    }

    double averageTime = totalTime / 10.0;

@CNugteren
Copy link
Owner

OK, not there yet, but at least we went from 15 to 7ms :-)

Are you compiling CLBlast from source yourself? If so, could you add -DVERBOSE=ON (see https://github.com/CNugteren/CLBlast/blob/master/CMakeLists.txt#L77) to the CMake command-line? That should give more information at run-time which could help debugging what's going on (but it will slow things down, so disable it in the future for speed-measurements).

@liangzelang
Copy link
Author

Yeah, It's a good step.
And I ddd compile option '-DVERBOSE=ON' to the CMake command-line, re-run clblast_tuner_xgemv, get detail below.

./clblast_tuner_xgemv -m 2048 -n 16384 -precision 16 -warmup
* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 16 (half)
    -m 2048 [=default]
    -n 16384
    -alpha 2.000000 [=default]
    -beta 2.000000 [=default]
    -fraction 1.00 [=default]
    -runs 4 [=default]
    -max_l2_norm 0.00 [=default]

* Found 12 configuration(s)
* Parameters explored: WGS1 WPT1

|   ID | total |     param |      local      |      global     |       compiles |         time |   GB/s |            status |
x------x-------x-----------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|  ref |     - |         - |      64       1 |    2048       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 85.60 ms
             OK |      2.68 ms |      - |      reference OK |
x------x-------x-----------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|    1 |    12 |   32    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 58.92 ms
   OK     59 ms |      2.79 ms |   24.1 |     results match |
|    2 |    12 |   32    2 |      32       1 |    1024       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 80.33 ms
   OK     80 ms |      4.09 ms |   16.4 |     results match |
|    3 |    12 |   32    4 |      32       1 |     512       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 93.19 ms
   OK     93 ms |      5.42 ms |   12.4 |     results match |
|    4 |    12 |   64    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 76.97 ms
   OK     77 ms |      2.52 ms |   26.6 |     results match |
|    5 |    12 |   64    2 |      64       1 |    1024       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 79.71 ms
   OK     80 ms |      4.08 ms |   16.4 |     results match |
|    6 |    12 |   64    4 |      64       1 |     512       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 84.82 ms
   OK     85 ms |      6.28 ms |   10.7 |     results match |
|    7 |    12 |  128    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 83.18 ms
   OK     83 ms |      3.19 ms |   21.0 |     results match |
|    8 |    12 |  128    2 |     128       1 |    1024       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 101.09 ms
   OK    101 ms |      4.05 ms |   16.6 |     results match |
|    9 |    12 |  128    4 |     128       1 |     512       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 97.02 ms
   OK     97 ms |      6.68 ms |   10.1 |     results match |
|   10 |    12 |  256    1 |     256       1 |    2048       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 83.57 ms
   OK     84 ms |      3.25 ms |   20.7 |     results match |
|   11 |    12 |  256    2 |     256       1 |    1024       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 105.01 ms
   OK    105 ms |      5.99 ms |   11.2 |     results match |
|   12 |    12 |  256    4 |     256       1 |     512       1 |[DEBUG] Compiling routine 'Xgemv-16 (half)'
[DEBUG] Completed compilation in 107.88 ms
   OK    108 ms |     10.99 ms |    6.1 |     results match |
x------x-------x-----------x-----------------x-----------------x----------------x--------------x--------x-------------------x


* Got average result of 4.95 ms: 13.6 GB/s
* Found best result 2.52 ms: 26.6 GB/s
* Best parameters: PRECISION=16 WGS1=64 WPT1=1

* Writing a total of 12 results to 'clblast_xgemv_16.json'
* Completed tuning process

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 16 (half)
    -m 2048 [=default]
    -n 16384
    -alpha 2.000000 [=default]
    -beta 2.000000 [=default]
    -fraction 1.00 [=default]
    -runs 4 [=default]
    -max_l2_norm 0.00 [=default]

* Found 30 configuration(s)
* Parameters explored: WGS2 WPT2 VW2

|   ID | total |          param |      local      |      global     |       compiles |         time |   GB/s |            status |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|  ref |     - |              - |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 68.97 ms
             OK |      2.85 ms |      - |      reference OK |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|    1 |    30 |   16    1    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 72.94 ms
   OK     73 ms |      3.62 ms |   18.6 |     results match |
|    2 |    30 |   16    2    1 |      16       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 71.29 ms
   OK     71 ms |      3.40 ms |   19.8 |     results match |
|    3 |    30 |   16    2    2 |      16       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 60.66 ms
   OK     61 ms |      7.16 ms |    9.4 |     results match |
|    4 |    30 |   16    4    1 |      16       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 75.73 ms
   OK     76 ms |      4.08 ms |   16.4 |     results match |
|    5 |    30 |   16    4    2 |      16       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 76.76 ms
   OK     77 ms |      9.19 ms |    7.3 |     results match |
|    6 |    30 |   16    4    4 |      16       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 88.05 ms
   OK     88 ms |      2.98 ms |   22.6 |     results match |
|    7 |    30 |   32    1    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 64.88 ms
   OK     65 ms |      3.08 ms |   21.8 |     results match |
|    8 |    30 |   32    2    1 |      32       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 67.07 ms
   OK     67 ms |      2.82 ms |   23.8 |     results match |
|    9 |    30 |   32    2    2 |      32       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 60.72 ms
   OK     61 ms |      7.22 ms |    9.3 |     results match |
|   10 |    30 |   32    4    1 |      32       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 72.63 ms
   OK     73 ms |      9.96 ms |    6.7 |     results match |
|   11 |    30 |   32    4    2 |      32       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 87.35 ms
   OK     87 ms |     10.48 ms |    6.4 |     results match |
|   12 |    30 |   32    4    4 |      32       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 119.52 ms
   OK    120 ms |      3.11 ms |   21.6 |     results match |
|   13 |    30 |   64    1    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 86.57 ms
   OK     87 ms |      3.04 ms |   22.1 |     results match |
|   14 |    30 |   64    2    1 |      64       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 55.22 ms
   OK     55 ms |      8.50 ms |    7.9 |     results match |
|   15 |    30 |   64    2    2 |      64       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 83.43 ms
   OK     84 ms |      8.46 ms |    7.9 |     results match |
|   16 |    30 |   64    4    1 |      64       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 81.57 ms
   OK     82 ms |     11.35 ms |    5.9 |     results match |
|   17 |    30 |   64    4    2 |      64       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 93.39 ms
   OK     94 ms |     11.65 ms |    5.8 |     results match |
|   18 |    30 |   64    4    4 |      64       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 98.83 ms
   OK     99 ms |      9.08 ms |    7.4 |     results match |
|   19 |    30 |  128    1    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 70.95 ms
   OK     71 ms |      9.23 ms |    7.3 |     results match |
|   20 |    30 |  128    2    1 |     128       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 110.43 ms
   OK    111 ms |     10.94 ms |    6.1 |     results match |
|   21 |    30 |  128    2    2 |     128       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 109.04 ms
   OK    109 ms |      9.23 ms |    7.3 |     results match |
|   22 |    30 |  128    4    1 |     128       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 78.63 ms
   OK     79 ms |     14.24 ms |    4.7 |     results match |
|   23 |    30 |  128    4    2 |     128       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 101.45 ms
   OK    102 ms |     12.29 ms |    5.5 |     results match |
|   24 |    30 |  128    4    4 |     128       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 98.24 ms
   OK     98 ms |     10.84 ms |    6.2 |     results match |
|   25 |    30 |  256    1    1 |     256       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 110.24 ms
   OK    110 ms |      9.08 ms |    7.4 |     results match |
|   26 |    30 |  256    2    1 |     256       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 91.98 ms
   OK     92 ms |     10.23 ms |    6.6 |     results match |
|   27 |    30 |  256    2    2 |     256       1 |    1024       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 85.84 ms
   OK     86 ms |      9.30 ms |    7.2 |     results match |
|   28 |    30 |  256    4    1 |     256       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 96.91 ms
   OK     97 ms |     14.51 ms |    4.6 |     results match |
|   29 |    30 |  256    4    2 |     256       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 108.61 ms
   OK    109 ms |     12.54 ms |    5.4 |     results match |
|   30 |    30 |  256    4    4 |     256       1 |     512       1 |[DEBUG] Compiling routine 'XgemvFast-16 (half)'
[DEBUG] Completed compilation in 148.64 ms
   OK    149 ms |     11.01 ms |    6.1 |     results match |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x


* Got average result of 8.42 ms: 8.0 GB/s
* Found best result 2.82 ms: 23.8 GB/s
* Best parameters: PRECISION=16 VW2=1 WGS2=32 WPT2=2

* Writing a total of 30 results to 'clblast_xgemv_fast_16.json'
* Completed tuning process

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 16 (half)
    -m 2048 [=default]
    -n 16384
    -alpha 2.000000 [=default]
    -beta 2.000000 [=default]
    -fraction 1.00 [=default]
    -runs 4 [=default]
    -max_l2_norm 0.00 [=default]

* Found 68 configuration(s)
* Parameters explored: WGS3 WPT3 VW3

|   ID | total |          param |      local      |      global     |       compiles |         time |   GB/s |            status |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|  ref |     - |              - |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 68.83 ms
             OK |     29.79 ms |      - |      reference OK |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|    1 |    68 |   16    1    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 113.89 ms
   OK    114 ms |     16.46 ms |    4.1 |     results match |
|    2 |    68 |   16    2    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 122.35 ms
   OK    123 ms |      9.97 ms |    6.7 |     results match |
|    3 |    68 |   16    2    2 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 116.31 ms
   OK    117 ms |      9.92 ms |    6.8 |     results match |
|    4 |    68 |   16    4    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 128.62 ms
   OK    129 ms |      7.90 ms |    8.5 |     results match |
|    5 |    68 |   16    4    2 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 96.89 ms
   OK     97 ms |      7.91 ms |    8.5 |     results match |
|    6 |    68 |   16    4    4 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 96.44 ms
   OK     97 ms |      6.44 ms |   10.4 |     results match |
|    7 |    68 |   16    8    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 90.97 ms
   OK     91 ms |      7.28 ms |    9.2 |     results match |
|    8 |    68 |   16    8    2 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 93.21 ms
   OK     93 ms |      7.27 ms |    9.2 |     results match |
|    9 |    68 |   16    8    4 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 103.82 ms
   OK    104 ms |      5.68 ms |   11.8 |     results match |
|   10 |    68 |   16    8    8 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 88.75 ms
   OK     89 ms |      5.00 ms |   13.4 |     results match |
|   11 |    68 |   16   16    1 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 99.82 ms
   OK    100 ms |      6.45 ms |   10.4 |     results match |
|   12 |    68 |   16   16    2 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 91.08 ms
   OK     91 ms |      8.62 ms |    7.8 |     results match |
|   13 |    68 |   16   16    4 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 92.61 ms
   OK     93 ms |      5.82 ms |   11.5 |     results match |
|   14 |    68 |   16   16    8 |      16       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 97.42 ms
   OK     98 ms |      4.30 ms |   15.6 |     results match |
|   15 |    68 |   32    1    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 91.90 ms
   OK     92 ms |     19.14 ms |    3.5 |     results match |
|   16 |    68 |   32    2    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 114.62 ms
   OK    115 ms |     11.24 ms |    6.0 |     results match |
|   17 |    68 |   32    2    2 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 108.41 ms
   OK    109 ms |     10.69 ms |    6.3 |     results match |
|   18 |    68 |   32    4    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 116.84 ms
   OK    117 ms |      8.97 ms |    7.5 |     results match |
|   19 |    68 |   32    4    2 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 87.37 ms
   OK     88 ms |      8.19 ms |    8.2 |     results match |
|   20 |    68 |   32    4    4 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 94.95 ms
   OK     95 ms |      7.08 ms |    9.5 |     results match |
|   21 |    68 |   32    8    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 102.34 ms
   OK    102 ms |      9.85 ms |    6.8 |     results match |
|   22 |    68 |   32    8    2 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 105.42 ms
   OK    106 ms |      7.68 ms |    8.7 |     results match |
|   23 |    68 |   32    8    4 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 86.68 ms
   OK     87 ms |      6.01 ms |   11.2 |     results match |
|   24 |    68 |   32    8    8 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 89.31 ms
   OK     89 ms |      5.13 ms |   13.1 |     results match |
|   25 |    68 |   32   16    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 87.72 ms
   OK     88 ms |      8.77 ms |    7.7 |     results match |
|   26 |    68 |   32   16    2 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 81.46 ms
   OK     82 ms |      9.18 ms |    7.3 |     results match |
|   27 |    68 |   32   16    4 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 87.66 ms
   OK     88 ms |      6.31 ms |   10.6 |     results match |
|   28 |    68 |   32   16    8 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 94.89 ms
   OK     95 ms |      4.66 ms |   14.4 |     results match |
|   29 |    68 |   32   32    1 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 110.25 ms
   OK    110 ms |     17.14 ms |    3.9 |     results match |
|   30 |    68 |   32   32    2 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 122.06 ms
   OK    122 ms |      9.31 ms |    7.2 |     results match |
|   31 |    68 |   32   32    4 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 113.45 ms
   OK    114 ms |      8.40 ms |    8.0 |     results match |
|   32 |    68 |   32   32    8 |      32       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 124.16 ms
   OK    124 ms |      6.11 ms |   11.0 |     results match |
|   33 |    68 |   64    1    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 89.04 ms
   OK     89 ms |     28.32 ms |    2.4 |     results match |
|   34 |    68 |   64    2    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 114.67 ms
   OK    115 ms |     17.68 ms |    3.8 |     results match |
|   35 |    68 |   64    2    2 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 112.37 ms
   OK    113 ms |     15.83 ms |    4.2 |     results match |
|   36 |    68 |   64    4    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 115.35 ms
   OK    116 ms |     10.09 ms |    6.7 |     results match |
|   37 |    68 |   64    4    2 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 112.57 ms
   OK    113 ms |     10.81 ms |    6.2 |     results match |
|   38 |    68 |   64    4    4 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 110.11 ms
   OK    110 ms |      9.28 ms |    7.2 |     results match |
|   39 |    68 |   64    8    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 91.05 ms
   OK     91 ms |      9.72 ms |    6.9 |     results match |
|   40 |    68 |   64    8    2 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 104.39 ms
   OK    105 ms |      8.28 ms |    8.1 |     results match |
|   41 |    68 |   64    8    4 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 89.69 ms
   OK     90 ms |      7.12 ms |    9.4 |     results match |
|   42 |    68 |   64    8    8 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 100.22 ms
   OK    100 ms |      6.27 ms |   10.7 |     results match |
|   43 |    68 |   64   16    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 92.27 ms
   OK     92 ms |      8.61 ms |    7.8 |     results match |
|   44 |    68 |   64   16    2 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 82.20 ms
   OK     82 ms |      9.70 ms |    6.9 |     results match |
|   45 |    68 |   64   16    4 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 105.39 ms
   OK    105 ms |      6.79 ms |    9.9 |     results match |
|   46 |    68 |   64   16    8 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 91.83 ms
   OK     92 ms |      5.56 ms |   12.1 |     results match |
|   47 |    68 |   64   32    1 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 102.74 ms
   OK    103 ms |     12.17 ms |    5.5 |     results match |
|   48 |    68 |   64   32    2 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 120.79 ms
   OK    121 ms |      9.47 ms |    7.1 |     results match |
|   49 |    68 |   64   32    4 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 117.69 ms
   OK    118 ms |      9.20 ms |    7.3 |     results match |
|   50 |    68 |   64   32    8 |      64       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 118.99 ms
   OK    119 ms |      6.09 ms |   11.0 |     results match |
|   51 |    68 |  128    1    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 94.05 ms
   OK     94 ms |     38.67 ms |    1.7 |     results match |
|   52 |    68 |  128    2    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 110.02 ms
   OK    110 ms |     26.56 ms |    2.5 |     results match |
|   53 |    68 |  128    2    2 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 114.94 ms
   OK    115 ms |     22.73 ms |    3.0 |     results match |
|   54 |    68 |  128    4    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 113.98 ms
   OK    114 ms |     13.37 ms |    5.0 |     results match |
|   55 |    68 |  128    4    2 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 120.72 ms
   OK    121 ms |     15.21 ms |    4.4 |     results match |
|   56 |    68 |  128    4    4 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 111.60 ms
   OK    112 ms |     15.29 ms |    4.4 |     results match |
|   57 |    68 |  128    8    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 115.06 ms
   OK    115 ms |     13.03 ms |    5.2 |     results match |
|   58 |    68 |  128    8    2 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 123.61 ms
   OK    124 ms |     10.56 ms |    6.4 |     results match |
|   59 |    68 |  128    8    4 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 129.78 ms
   OK    130 ms |      9.25 ms |    7.3 |     results match |
|   60 |    68 |  128    8    8 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 90.46 ms
   OK     91 ms |      9.36 ms |    7.2 |     results match |
|   61 |    68 |  128   16    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 95.79 ms
   OK     96 ms |     12.46 ms |    5.4 |     results match |
|   62 |    68 |  128   16    2 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 127.50 ms
   OK    128 ms |     12.92 ms |    5.2 |     results match |
|   63 |    68 |  128   16    4 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 120.74 ms
   OK    121 ms |      8.59 ms |    7.8 |     results match |
|   64 |    68 |  128   16    8 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 90.59 ms
   OK     91 ms |      7.12 ms |    9.4 |     results match |
|   65 |    68 |  128   32    1 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 110.72 ms
   OK    111 ms |     15.31 ms |    4.4 |     results match |
|   66 |    68 |  128   32    2 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 128.48 ms
   OK    129 ms |     12.87 ms |    5.2 |     results match |
|   67 |    68 |  128   32    4 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 124.53 ms
   OK    125 ms |     11.55 ms |    5.8 |     results match |
|   68 |    68 |  128   32    8 |     128       1 |    2048       1 |[DEBUG] Compiling routine 'XgemvFastRot-16 (half)'
[DEBUG] Completed compilation in 125.20 ms
   OK    125 ms |      7.56 ms |    8.9 |     results match |
x------x-------x----------------x-----------------x-----------------x----------------x--------------x--------x-------------------x


* Got average result of 10.68 ms: 6.3 GB/s
* Found best result 4.30 ms: 15.6 GB/s
* Best parameters: PRECISION=16 VW3=8 WGS3=16 WPT3=16

* Writing a total of 68 results to 'clblast_xgemv_fast_rot_16.json'
* Completed tuning process

And re-run test code, get detail below.

Device Name: QUALCOMM Adreno(TM)
Device Vendor: QUALCOMM
Device Version: OpenCL 3.0 Adreno(TM) 740
Driver Version: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.41.03.44
OpenCL C Version: OpenCL C 3.0 Adreno(TM) 740
[DEBUG] Searching database for kernel 'Xgemv'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'default' and type 'default'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] Searching database for kernel 'XgemvFast'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Searching database for kernel 'XgemvFastRot'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Searching database for kernel 'TrsvRoutine'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'default' and type 'default'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Compiling routine 'GEMV-16 (half)'
[DEBUG] Completed compilation in 66.92 ms
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 15.28 ms
No. 0 GEMV execution time: 89.0274 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.40 ms
No. 1 GEMV execution time: 7.44886 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.49 ms
No. 2 GEMV execution time: 7.5188 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.19 ms
No. 3 GEMV execution time: 7.26458 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.44 ms
No. 4 GEMV execution time: 7.5925 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.72 ms
No. 5 GEMV execution time: 7.83318 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.66 ms
No. 6 GEMV execution time: 7.78521 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.74 ms
No. 7 GEMV execution time: 7.93667 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.68 ms
No. 8 GEMV execution time: 7.98313 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.48 ms
No. 9 GEMV execution time: 7.74526 ms
GEMV execution time: 15.8136 ms

It seems that the GEMV API does not select the regular kernel of gemv, so it is not the same kernel as the kernel in clblast_tuner_xgemv; now how should I call the API to ensure that the most optimized kernel is used?

@CNugteren
Copy link
Owner

Indeed, this shows that it is using the XgemvFastRot kernel:

[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 7.48 ms

BTW, this also shows in the first line the tuning parameters used, and indeed GEMV_Xgemv_64_1 corresponds to what you've set, so that part is fine.

I think it is now either a matter of tuning the XgemvFastRot kernel, or changing your input such that it uses the other kernel. For the latter, you should change on of the two first arguments, which you've now set to clblast::Layout::kRowMajor and clblast::Transpose::kNo. If you use either (not both!) clblast::Layout::kColMajor or clblast::Transpose::kYes it should use the kernel you wanted to use.

@liangzelang
Copy link
Author

Thanks, I did some experiments based on your suggestions;

  1. Modify the transpose option in the test code, as follows; but the execution log shows that it is not gemv kernel but gemvfast kernel;
// test code
    const size_t m = 16384;
    const size_t n = 2048;

    std::vector<__fp16> host_a(16384 * 2048, 1.0f);
    std::vector<__fp16> host_x(16384, 2.0f);
    std::vector<__fp16> host_y(2048, 0.0f);

    cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_a.size() * sizeof(__fp16), host_a.data(), nullptr);
    cl_mem x_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_x.size() * sizeof(__fp16), host_x.data(), nullptr);
    cl_mem y_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, host_y.size() * sizeof(__fp16), nullptr, nullptr);
    clFinish(queue); // Ensure memory operations are complete

    using Parameters = std::unordered_map<std::string, size_t>;

    // params
    Parameters tuned_params = {
        {"WGS1", 64},
        {"WPT1", 1}
    };

    // use tuned params
    clblast::OverrideParameters(device, "Xgemv", clblast::Precision::kHalf, tuned_params);

    // Performance measurement
    double totalTime = 0.0;
    for (int i = 0; i < 10; i++) {
        auto start = std::chrono::steady_clock::now();
        
        auto status = clblast::Gemv<half>(clblast::Layout::kRowMajor, clblast::Transpose::kYes, m, n, 
                        1.0f,
                        a_buffer, 0, n,
                        x_buffer, 0, 1,
                        0.0f,
                        y_buffer, 0, 1,
                        &queue, nullptr);

        if (status != clblast::StatusCode::kSuccess) {
            printf("[TEST] gemv Error: %d.", static_cast<int>(status));
        }
        clFinish(queue);
        auto elapsed_time = std::chrono::duration<double,std::milli>(std::chrono::steady_clock::now() - start).count();
        std::cout << "No. "<< i << " GEMV execution time: " << elapsed_time << " ms" << std::endl;
        totalTime += elapsed_time;
    }

LOG:

Device Name: QUALCOMM Adreno(TM)
Device Vendor: QUALCOMM
Device Version: OpenCL 3.0 Adreno(TM) 740
Driver Version: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.41.03.44
OpenCL C Version: OpenCL C 3.0 Adreno(TM) 740
[DEBUG] Searching database for kernel 'Xgemv'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'default' and type 'default'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] Searching database for kernel 'XgemvFast'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Searching database for kernel 'XgemvFastRot'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'QUALCOMM' and type 'GPU'
[DEBUG] Found devices of architecture type 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found parameters for device type 'QUALCOMM Adreno(TM)'
[DEBUG] Searching database for kernel 'TrsvRoutine'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'default' and type 'default'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Compiling routine 'GEMV-16 (half)'
[DEBUG] Completed compilation in 74.25 ms
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 14.74 ms
No. 0 GEMV execution time: 95.5716 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.14 ms
No. 1 GEMV execution time: 7.17828 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.56 ms
No. 2 GEMV execution time: 7.61557 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.39 ms
No. 3 GEMV execution time: 7.47651 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.47 ms
No. 4 GEMV execution time: 7.5688 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 10.16 ms
No. 5 GEMV execution time: 10.2999 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.41 ms
No. 6 GEMV execution time: 7.54901 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.48 ms
No. 7 GEMV execution time: 7.61453 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.90 ms
No. 8 GEMV execution time: 8.11635 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 8.29 ms
No. 9 GEMV execution time: 8.89568 ms
GEMV execution time: 16.7886 ms

  1. Modify kRowMajor -> kColMajor , also not using gemv kernel.

LOG:


[DEBUG] Completed compilation in 74.21 ms
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 15.32 ms
No. 0 GEMV execution time: 96.132 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.32 ms
No. 1 GEMV execution time: 7.35323 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.36 ms
No. 2 GEMV execution time: 7.41568 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.23 ms
No. 3 GEMV execution time: 7.32276 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.42 ms
No. 4 GEMV execution time: 7.51208 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.19 ms
No. 5 GEMV execution time: 7.32781 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.79 ms
No. 6 GEMV execution time: 7.92859 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 8.37 ms
No. 7 GEMV execution time: 8.50667 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.50 ms
No. 8 GEMV execution time: 7.69479 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.49 ms
No. 9 GEMV execution time: 7.80151 ms
GEMV execution time: 16.4995 ms

  1. Directly modify the CLBlast Source code, force not to select gemvfast and gemvfast_rot, but performance is alse worse than clblast_tuner_xgemv.
    image
BUG] Searching database for kernel 'TrsvRoutine'
[DEBUG] Device type 'GPU'; vendor 'QUALCOMM'
[DEBUG] Device name 'QUALCOMM Adreno(TM)'; architecture 'OpenCL C 3.0 Adreno(TM) 740'
[DEBUG] Found architectures of vendor 'default' and type 'default'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Compiling routine 'GEMV-16 (half)'
[DEBUG] Completed compilation in 74.15 ms
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 36.36 ms
No. 0 GEMV execution time: 117.046 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 27.49 ms
No. 1 GEMV execution time: 27.5867 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 30.11 ms
No. 2 GEMV execution time: 30.7563 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 25.28 ms
No. 3 GEMV execution time: 26.005 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 27.33 ms
No. 4 GEMV execution time: 27.9904 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 27.94 ms
No. 5 GEMV execution time: 28.9358 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 25.14 ms
No. 6 GEMV execution time: 25.399 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 25.99 ms
No. 7 GEMV execution time: 26.2475 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 26.91 ms
No. 8 GEMV execution time: 27.7486 ms
[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 26.72 ms
No. 9 GEMV execution time: 27.7269 ms
GEMV execution time: 36.5442 ms

@liangzelang
Copy link
Author

In fact, my goal is very simple, which is to do a matrix-vector multiplication of [2048, 16384] * [16384], and the performance is consistent with the performance in tuning.

@CNugteren
Copy link
Owner

OK, thanks for testing. I took a bit more time to investigate.

First of all, I would suggest you revert the false && things, because by design the 'fast' kernel should be faster than the more general kernel. From your tuning experiments it doesn't seem to matter that much, but I would propose to set these tuning parameters and use the fast kernel normally:
(from your logs:)

* Found best result 2.82 ms: 23.8 GB/s
* Best parameters: PRECISION=16 VW2=1 WGS2=32 WPT2=2

* Writing a total of 30 results to 'clblast_xgemv_fast_16.json'

So if you set those parameters, and run normally (with the transpose option on like you did above) you should see that it runs the XgemvFast kernel and hopefully it will run in under 3 ms. If you really want the more general non-fast kernel you'll need to do something non-standard to your input, such as a value of m or n that is not divisible by 32 or 64 or so, or an offset to the start of the array. But I would not recommend this if you don't need it.

From your debug output (GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24) you can see that it uses the parameters 2, 32, 2 for the XgemvFast kernel. From your tuning logs (I removed some debug info) it looks like that kernel should run in around 7ms:

|    9 |    30 |   32    2    2 |      32       1 |    1024       1 |   OK     61 ms |      7.22 ms |    9.3 |     results match |

This corresponds with your measurements:

[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'XgemvFast'
[DEBUG] Completed kernel in 7.23 ms

The only thing I can't explain are your slow measurements after you added the false &&:

[DEBUG] GEMV_Xgemv_64_1_XgemvFast_2_32_2_XgemvFastRot_2_16_8_TrsvRoutine_24
[DEBUG] Running kernel 'Xgemv'
[DEBUG] Completed kernel in 26.72 ms

But I would ignore that for nice, given that you modified the code, something else might be wrong now.

So in conclusion do all of the following and you should get good results:

  • Undo your false && changes.
  • Set the VW2=1 WGS2=32 WPT2=2 parameters for the XgemvFast kernel
  • Run with transposed input or with a different kernel layout.

@liangzelang
Copy link
Author

liangzelang commented Oct 11, 2024

Thank you for your answer

  1. According to your suggestion, I transposed the A matrix in advance and configured clblast::Transpose::kYes, so that the API can select the XgemvFast kernel and use the corresponding tuned parameter. The performance is basically the same as clblast_tuned_xgemv.

    const size_t m = 16384;
    const size_t n = 2048;

    std::vector<__fp16> host_a(m * n, 1.0f);
    std::vector<__fp16> host_x(16384, 1.0f);
    std::vector<__fp16> host_y(2048, 0.0f);

    cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_a.size() * sizeof(__fp16), host_a.data(), nullptr);
    cl_mem x_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, host_x.size() * sizeof(__fp16), host_x.data(), nullptr);
    cl_mem y_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, host_y.size() * sizeof(__fp16), nullptr, nullptr);
    clFinish(queue); // Ensure memory operations are complete

    using Parameters = std::unordered_map<std::string, size_t>;

    // params
    Parameters tuned_params = {
        {"WGS1", 64},
        {"WPT1", 1},
        {"VW2", 1},
        {"WGS2", 32},
        {"WPT2",1}
    };

    // use tuned params
    clblast::OverrideParameters(device, "XgemvFast", clblast::Precision::kHalf, tuned_params);

    // Performance measurement
    double totalTime = 0.0;
    for (int i = 0; i < 10; i++) {
        auto start = std::chrono::steady_clock::now();
        
        auto status = clblast::Gemv<half>(clblast::Layout::kRowMajor, clblast::Transpose::kYes, m, n, 
                        1.0f, 
                        a_buffer, 0, n, 
                        x_buffer, 0, 1, 
                        0.0f, 
                        y_buffer, 0, 1, 
                        &queue, nullptr);
         // check
        if (status != clblast::StatusCode::kSuccess) {
            std::cerr << "[TEST] Gemv error: " << static_cast<int>(status) << std::endl;
        }

        clFinish(queue);
        auto elapsed_time = std::chrono::duration<double,std::milli>(std::chrono::steady_clock::now() - start).count();
        std::cout << "No. "<< i << " GEMV execution time: " << elapsed_time << " ms" << std::endl;
        totalTime += elapsed_time;
    }

    double averageTime = totalTime / 10.0;

    // time
    std::cout << "GEMV execution time: " << averageTime << " ms" << std::endl;

image

  1. But there is a question, that is, during tuning, the actual performance of the Xgemv kernel is almost same as or even better than XgemvFast; even if I can now select XgemvFast by transposing the matrix in advance, it still consumes extra time; So as a user, how can we control API calls to the expected operators, because the previous recommend methods can‘t’ actually call Xgemv?

@CNugteren
Copy link
Owner

Good to hear that you now get the speed as advertised.

Regarding choosing the kernel: that is not possible in CLBlast. The general Xgemv kernel should in theory be slower than the 'fast' variants, but there might be edge-cases of course, such as your example with half-precision. Furthermore, in this case the speed different between the general Xgemv kernel and the XgemvFast kernel is very minimal: both are likely limited by the memory bandwidth of your device, not by its computational power. So no, there won't be an API to choose. If you really think this is an issue we should instead look into making the XgemvFastRot kernel as fast as the Xgemv kernel itself, if possible, but I don't have the time for that myself. If you do really want to use the general Xgemv kernel you can set an offset into the buffer, e.g. an offset of 1. You'll need one extra byte of memory but that should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants