Rewrite OpenCL kernel for performance optimizations #18

jserv · 2020-07-28T04:27:40Z

The OpenCL kernel has been rewritten for the following improvements:

Completely remove unnecessary intermediate buffers;
Fully vectorize Blake2b;
Schedule registers in Blake2b;
Load all sigma constants in a single instruction and use macros for constant evaluation;
Assume the messages no exceeding 17 exabytes and apply optimizations;
Implement AMD fastpath for rotr64;
Specify __constant for both optimization and error checking;

It is known to boost performance on several NVIDIA and AMD GPUs.

guilhermelawless · 2020-07-30T15:21:22Z

Thanks for your patch. Could you provide some details about your test setup? I'm not seeing any statistically significant improvements on my end with a Vega 64.

jserv · 2020-07-30T15:34:41Z

Thanks for your patch. Could you provide some details about your test setup? I'm not seeing any statistically significant improvements on my end with a Vega 64.

I suffered from lacking of reasonable benchmarking tool for PoW. nanopow was taken for my experiments when I managed to improve OpenCL kernel. Can you suggest more comprehensive and usable benchmark suite?

guilhermelawless · 2020-07-30T16:06:32Z

I've just been using this simple script: https://github.com/guilhermelawless/blake2b-pow-bench , if you target high enough difficulty (at least Nano's base difficulty at the moment) then it shouldn't need multiple processes, but doesn't hurt to have 5 or so.

jserv · 2020-07-31T01:02:45Z

I've just been using this simple script: https://github.com/guilhermelawless/blake2b-pow-bench , if you target high enough difficulty (at least Nano's base difficulty at the moment) then it shouldn't need multiple processes, but doesn't hurt to have 5 or so.

I did try blake2b-pow-bench but suddenly found the result varied a lot. For example, the command I used was benchmark.sh 1 100 localhost:7000, taking 350 to 420 seconds in my environment. I am not sure if it makes sense to record the time including worker round-trip and the repeated creation of curl process.

jserv · 2020-08-04T03:54:09Z

@guilhermelawless, The proposed change against OpenCL kernel has been tested by @inkeliz. See inkeliz/nanopow#2 for details.
Known tested GPU:

AMD Radeon RX 5700XT
NVIDIA TITAN Xp

It is almost 80% faster on RX 5700XT.

PlasmaPower · 2020-08-04T20:05:41Z

FYI I've made #21 which includes your latest optimizations in inkeliz/nanopow#2 plus a couple of my own to remove the blake2b state entirely.

besoeasy · 2020-08-04T21:02:41Z

tested on RX 5700 XT 8 GB ( around 75% Faster )

PlasmaPower · 2020-08-14T18:26:28Z

What do you mean by "Reduce the batch size"? Are you referring to the local work size? It doesn't seem to be modified in this PR.

Also, nice find on the AMD rotr, how much does that improve performance?

jserv · 2020-08-14T18:30:33Z

Also, nice find on the AMD rotr, how much does that improve performance?

Check this: inkeliz/nanopow#4

jserv · 2020-08-14T18:34:06Z

What do you mean by "Reduce the batch size"? Are you referring to the local work size? It doesn't seem to be modified in this PR.

Thanks for pointing the out-of-date change, which was meant to be my internal commits. I just revised the commit messages.

PlasmaPower · 2020-08-14T20:27:42Z

I'd be tempted to split up the rotr into a function per rotate amount (so rotr16, rotr24, rotr32, and rotr63). Especially for the 32 bit rotr which is really just returning the uints you're already extracting in the AMD version in the opposite order.

SergiySW

Tested changes with AMD & Nvidia cards, seems very efficient boost. Depending on GPU architecture it can be even 5-9 times faster than reference (comparisions for nano-node CLI --debug_opencl).
It would be good to have similar PR to nano-node repository as well.

PlasmaPower · 2020-08-17T16:21:53Z

I believe the type signature change from uchar to ulong for attempt and result breaks this as used in this repo. At least on my machine ocl complains about a type mismatch. That said, I inherited the uchar args from the stock nano-node kernel, and I think ulongs are the better option. We just need to update the rust code a bit to fix it.

SergiySW · 2020-08-17T16:38:56Z

^ Confirmed for AMD card with nano-work-server

guilhermelawless · 2020-08-25T11:27:47Z

@jserv could you apply this patch so we can have compatibility with AMD again? This reverts to reinterpreting the kernel arguments within the kernel code itself. We can do a follow-up PR to change the args into ulong with the appropriate Rust changes, since they're not trivial.

diff --git a/src/work.cl b/src/work.cl
index 14fd247..36ed43d 100644
--- a/src/work.cl
+++ b/src/work.cl
@@ -100,12 +100,12 @@ static inline ulong blake2b(ulong const nonce, __constant ulong *h)
 #undef G2v_split
 #undef ROUND

-__kernel void nano_work(__constant ulong *attempt,
-                        __global ulong *result_a,
+__kernel void nano_work(__constant uchar *attempt,
+                        __global uchar *result_a,
                         __constant uchar *item_a,
                         const ulong difficulty)
 {
-    const ulong attempt_l = *attempt + get_global_id(0);
+    const ulong attempt_l = *((__constant ulong *)attempt) + get_global_id(0);
     if (blake2b(attempt_l, item_a) >= difficulty)
-        *result_a = attempt_l;
+        *((__global ulong *)result_a) = attempt_l;
 }

Thanks!

jserv · 2020-08-25T13:03:31Z

@jserv could you apply this patch so we can have compatibility with AMD again? This reverts to reinterpreting the kernel arguments within the kernel code itself. We can do a follow-up PR to change the args into ulong with the appropriate Rust changes, since they're not trivial.

DONE. I have rebased and force-pushed.

guilhermelawless · 2020-08-25T13:40:39Z

@jserv the following changes were necessary to be able to run on AMD:

diff --git a/src/work.cl b/src/work.cl
index b6a7b6c..e23a987 100644
--- a/src/work.cl
+++ b/src/work.cl
@@ -65,14 +65,14 @@ static inline ulong rotr64(ulong x, int shift)
                   vv[13 / 2].s1, vv[14 / 2].s0);                               \
     } while (0)

-static inline ulong blake2b(ulong const nonce, ulong4 const hash)
+static inline ulong blake2b(ulong const nonce, ulong * const hash)
 {
     ulong2 vv[8] = {
         {nano_xor_iv0, iv1}, {iv2, iv3},          {iv4, iv5},
         {iv6, iv7},          {iv0, iv1},          {iv2, iv3},
         {nano_xor_iv4, iv5}, {nano_xor_iv6, iv7},
     };
-    ulong *h = &hash;
+    ulong *h = hash;

     ROUND(nonce, h[0], h[1], h[2], h[3], 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
     ROUND(0, 0, h[3], 0, 0, 0, 0, 0, h[0], 0, nonce, h[1], 0, 0, 0, h[2]);
@@ -100,6 +100,6 @@ __kernel void nano_work(__constant uchar *attempt,
                         const ulong difficulty)
 {
     const ulong attempt_l = *((__constant ulong *) attempt) + get_global_id(0);
-    if (blake2b(attempt_l, vload4(0, item_a)) >= difficulty)
+    if (blake2b(attempt_l, item_a) >= difficulty)
         *((__global ulong *) result_a) = attempt_l;
 }

blake2b() will implicitly cast as ulong* which is fine for now.

jserv · 2020-08-25T14:23:20Z

Thank @guilhermelawless for revising. I minimized the changes.

src/work.cl

PlasmaPower · 2020-08-25T15:09:23Z

How about what I have in #21? That works for both AMD and Nvidia and I haven't seen any performance improvements since then.

The OpenCL kernel has been rewritten for the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant, for both optimization and error checking; It is known to boost performance on several NVIDIA and AMD GPUs. Co-authored-by: Lee Bousfield <[email protected]>

guilhermelawless · 2020-08-25T15:31:53Z

Seems to be working, thanks! We'll be merging this and making a release soon, @PlasmaPower would you like to do the required changes to have everything ulong before that goes out? Otherwise I will take a look but will take a while longer.

PlasmaPower · 2020-08-25T15:49:10Z

@guilhermelawless if I do end up making that change, I can do it post-release in a separate PR, since it shouldn't affect speed or anything just code quality.

@jserv

Credit and thanks go to @jserv and @PlasmaPower for the contribution. Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant for both optimization and error checking; Co-authored-by: Jim Huang <[email protected]> Co-authored-by: Lee Bousfield <[email protected]> Signed-off-by: Guilherme Lawless <[email protected]> Signed-off-by: Sergey Kroshnin <[email protected]>

@jserv

Credit and thanks go to @jserv and @PlasmaPower for the contribution. Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant for both optimization and error checking; Co-authored-by: Jim Huang <[email protected]> Co-authored-by: Lee Bousfield <[email protected]> Signed-off-by: Guilherme Lawless <[email protected]> Signed-off-by: Sergey Kroshnin <[email protected]>

guilhermelawless self-requested a review July 30, 2020 15:21

jserv mentioned this pull request Aug 1, 2020

OpenCL improvements inkeliz/nanopow#1

Merged

jserv changed the title ~~OpenCL improvements~~ OpenCL kernel improvements Aug 4, 2020

jserv changed the title ~~OpenCL kernel improvements~~ Rewrite OpenCL kernel for performance optimizations Aug 14, 2020

jserv mentioned this pull request Aug 14, 2020

Use new kernel based on jserv's work #21

Closed

zhyatt requested a review from SergiySW August 17, 2020 15:58

SergiySW approved these changes Aug 17, 2020

View reviewed changes

SergiySW self-requested a review August 17, 2020 16:38

Joohansson mentioned this pull request Aug 19, 2020

Improve local PoW on desktop app by including an openCL work server Nault/Nault#169

Open

guilhermelawless reviewed Aug 25, 2020

View reviewed changes

src/work.cl Outdated Show resolved Hide resolved

guilhermelawless self-requested a review August 25, 2020 15:30

guilhermelawless assigned jserv and PlasmaPower Aug 25, 2020

SergiySW approved these changes Aug 25, 2020

View reviewed changes

guilhermelawless approved these changes Aug 25, 2020

View reviewed changes

guilhermelawless merged commit e83d345 into nanocurrency:master Aug 31, 2020

guilhermelawless mentioned this pull request Aug 31, 2020

Use improved OpenCL kernel nanocurrency/nano-node#2902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite OpenCL kernel for performance optimizations #18

Rewrite OpenCL kernel for performance optimizations #18

jserv commented Jul 28, 2020 •

edited

Loading

guilhermelawless commented Jul 30, 2020

jserv commented Jul 30, 2020

guilhermelawless commented Jul 30, 2020

jserv commented Jul 31, 2020

jserv commented Aug 4, 2020 •

edited

Loading

PlasmaPower commented Aug 4, 2020

besoeasy commented Aug 4, 2020

PlasmaPower commented Aug 14, 2020

jserv commented Aug 14, 2020

jserv commented Aug 14, 2020

PlasmaPower commented Aug 14, 2020

SergiySW left a comment

PlasmaPower commented Aug 17, 2020

SergiySW commented Aug 17, 2020

guilhermelawless commented Aug 25, 2020

jserv commented Aug 25, 2020

guilhermelawless commented Aug 25, 2020

jserv commented Aug 25, 2020

PlasmaPower commented Aug 25, 2020

guilhermelawless commented Aug 25, 2020 •

edited

Loading

PlasmaPower commented Aug 25, 2020

Rewrite OpenCL kernel for performance optimizations #18

Rewrite OpenCL kernel for performance optimizations #18

Conversation

jserv commented Jul 28, 2020 • edited Loading

guilhermelawless commented Jul 30, 2020

jserv commented Jul 30, 2020

guilhermelawless commented Jul 30, 2020

jserv commented Jul 31, 2020

jserv commented Aug 4, 2020 • edited Loading

PlasmaPower commented Aug 4, 2020

besoeasy commented Aug 4, 2020

PlasmaPower commented Aug 14, 2020

jserv commented Aug 14, 2020

jserv commented Aug 14, 2020

PlasmaPower commented Aug 14, 2020

SergiySW left a comment

Choose a reason for hiding this comment

PlasmaPower commented Aug 17, 2020

SergiySW commented Aug 17, 2020

guilhermelawless commented Aug 25, 2020

jserv commented Aug 25, 2020

guilhermelawless commented Aug 25, 2020

jserv commented Aug 25, 2020

PlasmaPower commented Aug 25, 2020

guilhermelawless commented Aug 25, 2020 • edited Loading

PlasmaPower commented Aug 25, 2020

jserv commented Jul 28, 2020 •

edited

Loading

jserv commented Aug 4, 2020 •

edited

Loading

guilhermelawless commented Aug 25, 2020 •

edited

Loading