Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite OpenCL kernel for performance optimizations #18

Merged
merged 1 commit into from
Aug 31, 2020
Merged

Rewrite OpenCL kernel for performance optimizations #18

merged 1 commit into from
Aug 31, 2020

Conversation

jserv
Copy link
Contributor

@jserv jserv commented Jul 28, 2020

The OpenCL kernel has been rewritten for the following improvements:

  1. Completely remove unnecessary intermediate buffers;
  2. Fully vectorize Blake2b;
  3. Schedule registers in Blake2b;
  4. Load all sigma constants in a single instruction and use macros for constant evaluation;
  5. Assume the messages no exceeding 17 exabytes and apply optimizations;
  6. Implement AMD fastpath for rotr64;
  7. Specify __constant for both optimization and error checking;

It is known to boost performance on several NVIDIA and AMD GPUs.

@guilhermelawless
Copy link
Contributor

Thanks for your patch. Could you provide some details about your test setup? I'm not seeing any statistically significant improvements on my end with a Vega 64.

@guilhermelawless guilhermelawless self-requested a review July 30, 2020 15:21
@jserv
Copy link
Contributor Author

jserv commented Jul 30, 2020

Thanks for your patch. Could you provide some details about your test setup? I'm not seeing any statistically significant improvements on my end with a Vega 64.

I suffered from lacking of reasonable benchmarking tool for PoW. nanopow was taken for my experiments when I managed to improve OpenCL kernel. Can you suggest more comprehensive and usable benchmark suite?

@guilhermelawless
Copy link
Contributor

I've just been using this simple script: https://github.com/guilhermelawless/blake2b-pow-bench , if you target high enough difficulty (at least Nano's base difficulty at the moment) then it shouldn't need multiple processes, but doesn't hurt to have 5 or so.

@jserv
Copy link
Contributor Author

jserv commented Jul 31, 2020

I've just been using this simple script: https://github.com/guilhermelawless/blake2b-pow-bench , if you target high enough difficulty (at least Nano's base difficulty at the moment) then it shouldn't need multiple processes, but doesn't hurt to have 5 or so.

I did try blake2b-pow-bench but suddenly found the result varied a lot. For example, the command I used was benchmark.sh 1 100 localhost:7000, taking 350 to 420 seconds in my environment. I am not sure if it makes sense to record the time including worker round-trip and the repeated creation of curl process.

@jserv jserv changed the title OpenCL improvements OpenCL kernel improvements Aug 4, 2020
@jserv
Copy link
Contributor Author

jserv commented Aug 4, 2020

@guilhermelawless, The proposed change against OpenCL kernel has been tested by @inkeliz. See inkeliz/nanopow#2 for details.
Known tested GPU:

  • AMD Radeon RX 5700XT
  • NVIDIA TITAN Xp

It is almost 80% faster on RX 5700XT.

@PlasmaPower
Copy link
Contributor

FYI I've made #21 which includes your latest optimizations in inkeliz/nanopow#2 plus a couple of my own to remove the blake2b state entirely.

@besoeasy
Copy link

besoeasy commented Aug 4, 2020

tested on RX 5700 XT 8 GB ( around 75% Faster )

@jserv jserv changed the title OpenCL kernel improvements Rewrite OpenCL kernel for performance optimizations Aug 14, 2020
@PlasmaPower
Copy link
Contributor

What do you mean by "Reduce the batch size"? Are you referring to the local work size? It doesn't seem to be modified in this PR.

Also, nice find on the AMD rotr, how much does that improve performance?

@jserv
Copy link
Contributor Author

jserv commented Aug 14, 2020

Also, nice find on the AMD rotr, how much does that improve performance?

Check this: inkeliz/nanopow#4

@jserv
Copy link
Contributor Author

jserv commented Aug 14, 2020

What do you mean by "Reduce the batch size"? Are you referring to the local work size? It doesn't seem to be modified in this PR.

Thanks for pointing the out-of-date change, which was meant to be my internal commits. I just revised the commit messages.

@PlasmaPower
Copy link
Contributor

I'd be tempted to split up the rotr into a function per rotate amount (so rotr16, rotr24, rotr32, and rotr63). Especially for the 32 bit rotr which is really just returning the uints you're already extracting in the AMD version in the opposite order.

@zhyatt zhyatt requested a review from SergiySW August 17, 2020 15:58
Copy link

@SergiySW SergiySW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested changes with AMD & Nvidia cards, seems very efficient boost. Depending on GPU architecture it can be even 5-9 times faster than reference (comparisions for nano-node CLI --debug_opencl).
It would be good to have similar PR to nano-node repository as well.

@PlasmaPower
Copy link
Contributor

I believe the type signature change from uchar to ulong for attempt and result breaks this as used in this repo. At least on my machine ocl complains about a type mismatch. That said, I inherited the uchar args from the stock nano-node kernel, and I think ulongs are the better option. We just need to update the rust code a bit to fix it.

@SergiySW SergiySW self-requested a review August 17, 2020 16:38
@SergiySW
Copy link

^ Confirmed for AMD card with nano-work-server

@guilhermelawless
Copy link
Contributor

@jserv could you apply this patch so we can have compatibility with AMD again? This reverts to reinterpreting the kernel arguments within the kernel code itself. We can do a follow-up PR to change the args into ulong with the appropriate Rust changes, since they're not trivial.

diff --git a/src/work.cl b/src/work.cl
index 14fd247..36ed43d 100644
--- a/src/work.cl
+++ b/src/work.cl
@@ -100,12 +100,12 @@ static inline ulong blake2b(ulong const nonce, __constant ulong *h)
 #undef G2v_split
 #undef ROUND

-__kernel void nano_work(__constant ulong *attempt,
-                        __global ulong *result_a,
+__kernel void nano_work(__constant uchar *attempt,
+                        __global uchar *result_a,
                         __constant uchar *item_a,
                         const ulong difficulty)
 {
-    const ulong attempt_l = *attempt + get_global_id(0);
+    const ulong attempt_l = *((__constant ulong *)attempt) + get_global_id(0);
     if (blake2b(attempt_l, item_a) >= difficulty)
-        *result_a = attempt_l;
+        *((__global ulong *)result_a) = attempt_l;
 }

Thanks!

@jserv
Copy link
Contributor Author

jserv commented Aug 25, 2020

@jserv could you apply this patch so we can have compatibility with AMD again? This reverts to reinterpreting the kernel arguments within the kernel code itself. We can do a follow-up PR to change the args into ulong with the appropriate Rust changes, since they're not trivial.

DONE. I have rebased and force-pushed.

@guilhermelawless
Copy link
Contributor

@jserv the following changes were necessary to be able to run on AMD:

diff --git a/src/work.cl b/src/work.cl
index b6a7b6c..e23a987 100644
--- a/src/work.cl
+++ b/src/work.cl
@@ -65,14 +65,14 @@ static inline ulong rotr64(ulong x, int shift)
                   vv[13 / 2].s1, vv[14 / 2].s0);                               \
     } while (0)

-static inline ulong blake2b(ulong const nonce, ulong4 const hash)
+static inline ulong blake2b(ulong const nonce, ulong * const hash)
 {
     ulong2 vv[8] = {
         {nano_xor_iv0, iv1}, {iv2, iv3},          {iv4, iv5},
         {iv6, iv7},          {iv0, iv1},          {iv2, iv3},
         {nano_xor_iv4, iv5}, {nano_xor_iv6, iv7},
     };
-    ulong *h = &hash;
+    ulong *h = hash;

     ROUND(nonce, h[0], h[1], h[2], h[3], 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
     ROUND(0, 0, h[3], 0, 0, 0, 0, 0, h[0], 0, nonce, h[1], 0, 0, 0, h[2]);
@@ -100,6 +100,6 @@ __kernel void nano_work(__constant uchar *attempt,
                         const ulong difficulty)
 {
     const ulong attempt_l = *((__constant ulong *) attempt) + get_global_id(0);
-    if (blake2b(attempt_l, vload4(0, item_a)) >= difficulty)
+    if (blake2b(attempt_l, item_a) >= difficulty)
         *((__global ulong *) result_a) = attempt_l;
 }

blake2b() will implicitly cast as ulong* which is fine for now.

@jserv
Copy link
Contributor Author

jserv commented Aug 25, 2020

Thank @guilhermelawless for revising. I minimized the changes.

src/work.cl Outdated Show resolved Hide resolved
@PlasmaPower
Copy link
Contributor

How about what I have in #21? That works for both AMD and Nvidia and I haven't seen any performance improvements since then.

The OpenCL kernel has been rewritten for the following improvements:
1. Completely remove unnecessary intermediate buffers;
2. Fully vectorize Blake2b;
3. Schedule registers in Blake2b;
4. Load all sigma constants in a single instruction and use macros
   for constant evaluation;
5. Assume the messages no exceeding 17 exabytes and apply optimizations;
6. Implement AMD fastpath for rotr64;
7. Specify __constant, for both optimization and error checking;

It is known to boost performance on several NVIDIA and AMD GPUs.

Co-authored-by: Lee Bousfield <[email protected]>
@guilhermelawless guilhermelawless self-requested a review August 25, 2020 15:30
@guilhermelawless
Copy link
Contributor

guilhermelawless commented Aug 25, 2020

Seems to be working, thanks! We'll be merging this and making a release soon, @PlasmaPower would you like to do the required changes to have everything ulong before that goes out? Otherwise I will take a look but will take a while longer.

@PlasmaPower
Copy link
Contributor

@guilhermelawless if I do end up making that change, I can do it post-release in a separate PR, since it shouldn't affect speed or anything just code quality.

@guilhermelawless guilhermelawless merged commit e83d345 into nanocurrency:master Aug 31, 2020
guilhermelawless added a commit to guilhermelawless/nano-node that referenced this pull request Aug 31, 2020
Credit and thanks go to @jserv and @PlasmaPower for the contribution.

Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements:
1. Completely remove unnecessary intermediate buffers;
2. Fully vectorize Blake2b;
3. Schedule registers in Blake2b;
4. Load all sigma constants in a single instruction and use macros for constant evaluation;
5. Assume the messages no exceeding 17 exabytes and apply optimizations;
6. Implement AMD fastpath for rotr64;
7. Specify __constant for both optimization and error checking;

Co-authored-by: Jim Huang <[email protected]>
Co-authored-by: Lee Bousfield <[email protected]>
Signed-off-by: Guilherme Lawless <[email protected]>
Signed-off-by: Sergey Kroshnin <[email protected]>
guilhermelawless added a commit to nanocurrency/nano-node that referenced this pull request Sep 2, 2020
Credit and thanks go to @jserv and @PlasmaPower for the contribution.

Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements:
1. Completely remove unnecessary intermediate buffers;
2. Fully vectorize Blake2b;
3. Schedule registers in Blake2b;
4. Load all sigma constants in a single instruction and use macros for constant evaluation;
5. Assume the messages no exceeding 17 exabytes and apply optimizations;
6. Implement AMD fastpath for rotr64;
7. Specify __constant for both optimization and error checking;

Co-authored-by: Jim Huang <[email protected]>
Co-authored-by: Lee Bousfield <[email protected]>
Signed-off-by: Guilherme Lawless <[email protected]>
Signed-off-by: Sergey Kroshnin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants