-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite OpenCL kernel for performance optimizations #18
Conversation
Thanks for your patch. Could you provide some details about your test setup? I'm not seeing any statistically significant improvements on my end with a Vega 64. |
I suffered from lacking of reasonable benchmarking tool for PoW. nanopow was taken for my experiments when I managed to improve OpenCL kernel. Can you suggest more comprehensive and usable benchmark suite? |
I've just been using this simple script: https://github.com/guilhermelawless/blake2b-pow-bench , if you target high enough difficulty (at least Nano's base difficulty at the moment) then it shouldn't need multiple processes, but doesn't hurt to have 5 or so. |
I did try blake2b-pow-bench but suddenly found the result varied a lot. For example, the command I used was |
@guilhermelawless, The proposed change against OpenCL kernel has been tested by @inkeliz. See inkeliz/nanopow#2 for details.
It is almost 80% faster on RX 5700XT. |
FYI I've made #21 which includes your latest optimizations in inkeliz/nanopow#2 plus a couple of my own to remove the blake2b state entirely. |
tested on RX 5700 XT 8 GB ( around 75% Faster ) |
What do you mean by "Reduce the batch size"? Are you referring to the local work size? It doesn't seem to be modified in this PR. Also, nice find on the AMD rotr, how much does that improve performance? |
Check this: inkeliz/nanopow#4 |
Thanks for pointing the out-of-date change, which was meant to be my internal commits. I just revised the commit messages. |
I'd be tempted to split up the rotr into a function per rotate amount (so rotr16, rotr24, rotr32, and rotr63). Especially for the 32 bit rotr which is really just returning the uints you're already extracting in the AMD version in the opposite order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested changes with AMD & Nvidia cards, seems very efficient boost. Depending on GPU architecture it can be even 5-9 times faster than reference (comparisions for nano-node CLI --debug_opencl).
It would be good to have similar PR to nano-node repository as well.
I believe the type signature change from uchar to ulong for attempt and result breaks this as used in this repo. At least on my machine ocl complains about a type mismatch. That said, I inherited the uchar args from the stock nano-node kernel, and I think ulongs are the better option. We just need to update the rust code a bit to fix it. |
^ Confirmed for AMD card with nano-work-server |
@jserv could you apply this patch so we can have compatibility with AMD again? This reverts to reinterpreting the kernel arguments within the kernel code itself. We can do a follow-up PR to change the args into
Thanks! |
DONE. I have rebased and force-pushed. |
@jserv the following changes were necessary to be able to run on AMD:
|
Thank @guilhermelawless for revising. I minimized the changes. |
How about what I have in #21? That works for both AMD and Nvidia and I haven't seen any performance improvements since then. |
The OpenCL kernel has been rewritten for the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant, for both optimization and error checking; It is known to boost performance on several NVIDIA and AMD GPUs. Co-authored-by: Lee Bousfield <[email protected]>
Seems to be working, thanks! We'll be merging this and making a release soon, @PlasmaPower would you like to do the required changes to have everything |
@guilhermelawless if I do end up making that change, I can do it post-release in a separate PR, since it shouldn't affect speed or anything just code quality. |
Credit and thanks go to @jserv and @PlasmaPower for the contribution. Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant for both optimization and error checking; Co-authored-by: Jim Huang <[email protected]> Co-authored-by: Lee Bousfield <[email protected]> Signed-off-by: Guilherme Lawless <[email protected]> Signed-off-by: Sergey Kroshnin <[email protected]>
Credit and thanks go to @jserv and @PlasmaPower for the contribution. Originally pushed to nanocurrency/nano-work-server#18, this kernel was rewritten with the following improvements: 1. Completely remove unnecessary intermediate buffers; 2. Fully vectorize Blake2b; 3. Schedule registers in Blake2b; 4. Load all sigma constants in a single instruction and use macros for constant evaluation; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; 6. Implement AMD fastpath for rotr64; 7. Specify __constant for both optimization and error checking; Co-authored-by: Jim Huang <[email protected]> Co-authored-by: Lee Bousfield <[email protected]> Signed-off-by: Guilherme Lawless <[email protected]> Signed-off-by: Sergey Kroshnin <[email protected]>
The OpenCL kernel has been rewritten for the following improvements:
It is known to boost performance on several NVIDIA and AMD GPUs.