Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL improvements #1

Merged
merged 1 commit into from
Aug 1, 2020
Merged

OpenCL improvements #1

merged 1 commit into from
Aug 1, 2020

Conversation

jserv
Copy link
Contributor

@jserv jserv commented Jul 31, 2020

This patch attempts to tweak OpenCL kernel for the follow aspects:

  1. Reduce unnecessary memory access;
  2. Remove non-reacheable code;
  3. Specialize character-wise set;
  4. Add loop unrolling hints;
  5. Assume the messages no exceeding 17 exabytes and apply optimizations;

It is known to bring about 15% speedup on NVIDIA TITAN Xp.

This patch attempts to tweak OpenCL kernel for the follow aspects:
1. Reduce unnecessary memory access;
2. Remove non-reacheable code;
3. Specialize character-wise set;
4. Add loop unrolling hints;
5. Assume the messages no exceeding 17 exabytes and apply optimizations;

It is known to bring about 15% speedup on NVIDIA TITAN Xp.
@jserv
Copy link
Contributor Author

jserv commented Jul 31, 2020

Test environment:

  • Intel Xeon E5-2650 v4 @ 2.20GHz
  • NVIDIA TITAN Xp

original OpenCL kernel:

go test -tags cl
ok  	github.com/inkeliz/nanopow	1.872s

new OpenCL kernel:

go test -tags cl
ok  	github.com/inkeliz/nanopow	1.515s

@inkeliz
Copy link
Owner

inkeliz commented Aug 1, 2020

I run some tests to compare the new against the old one.

Method 1 2 3 4 5 AVG DIFF AVG
Old 3.750095s 3.7518617s 3.8026438s 3.8125715s 3.7470841s 3.77285122s
New 3.5891672s 3.6811363s 3.5833844s 3.5990486s 3.6328716s 3.61712162s -155.7296ms
Old 679.7822ms 661.0906ms 694.4251ms 688.5022ms 675.3416ms 679.82834ms
New 661.81ms 683.2097ms 687.1964ms 685.6328ms 675.2828ms 678.62634ms -1.202ms
Old 852.6171ms 874.1722ms 870.2746ms 868.3854ms 854.5647ms 864.0028ms
New 793.5056ms 797.9352ms 796.9573ms 810.9946ms 809.6622ms 801.81098ms -62.19182ms
Old 1.5827132s 1.5719757s 1.5685329s 1.5682206s 1.5640621s 1.5711009s
New 1.4616328s 1.4678935s 1.4640432s 1.4699088s 1.4798121s 1.46865808s -102.44282ms
Old 991.2934ms 988.3621ms 988.3479ms 990.0114ms 996.3482ms 990.8726ms
New 925.7825ms 927.2461ms 924.6406ms 925.4016ms 930.2903ms 926.67222ms -64.20038ms
Old 1.6506272s 1.6486656s 1.6452237s 1.6407873s 1.6407119s 1.64520314s
New 1.5455168s 1.5445504s 1.5351531s 1.5346411s 1.5691196s 1.5457962s -99.40694ms
Old 1.0519654s 1.0464868s 1.0527362s 1.0494802s 1.0576293s 1.05165958s
New 1.0775959s 1.0484951s 1.041163s 1.0528584s 1.0782591s 1.0596743s +8.01472ms
Old 554.1943ms 546.9022ms 553.3509ms 548.3345ms 547.8762ms 550.13162ms
New 511.8196ms 514.2215ms 524.3791ms 519.5652ms 521.5288ms 518.30284ms -31.82878ms
Old 3.2346438s 3.2263787s 3.2378015s 3.2312395s 3.2161679s 3.22924628s
New 3.0250108s 3.0311579s 3.0208667s 3.0197474s 3.0588887s 3.0311343s -198.11198ms
Old 4.4505093s 4.4264818s 4.4464126s 4.4521405s 4.4411519s 4.44333922s
New 4.1567282s 4.1498609s 4.1479333s 4.1556903s 4.1721991s 4.15648236s -286.85686ms
Old 1.8628471s 1.8601246s 1.8771903s 1.8664901s 1.870448s 1.86742002s
New 1.7487176s 1.7448945s 1.7521662s 1.7497653s 1.7518566s 1.74948004s -117.93998ms
Old 1.3224505s 1.3190605s 1.3184977s 1.3240782s 1.3220376s 1.3212249s
New 1.2311718s 1.2261551s 1.2272802s 1.2277292s 1.2365423s 1.22977572s -91.44918ms
Old 9.7026788s 9.7195007s 9.7487819s 9.7586041s 9.691135s 9.7241401s
New 9.0772711s 9.088506s 9.0921451s 9.0751652s 9.0828598s 9.08318944s -640.95066ms
Old 105.9825ms 105.9823ms 107.4155ms 118.6828ms 109.3666ms 109.48594ms
New 99.6019ms 100.5798ms 100.1262ms 99.1483ms 102.0811ms 100.30746ms -9.17848ms
Old 716.8839ms 729.2025ms 718.3843ms 962.0821ms 709.6402ms 767.2386ms
New 721.7745ms 712.4591ms 715.8993ms 721.7627ms 704.1815ms 715.21542ms -52.02318ms
Old 683.2373ms 679.2493ms 679.2507ms 699.8307ms 672.9333ms 682.90026ms
New 628.411ms 642.2091ms 640.1953ms 633.8161ms 626.4564ms 634.21758ms -48.68268ms
Old 159.2372ms 158.1943ms 158.193ms 159.25ms 158.7205ms 158.719ms
New 147.4526ms 147.4495ms 148.4283ms 148.9531ms 147.5185ms 147.9604ms -10.7586ms
Old 832.1129ms 838.0431ms 837.9707ms 855.5811ms 835.097ms 839.76096ms
New 787.4685ms 790.1214ms 789.2874ms 784.3313ms 781.3869ms 786.5191ms -53.24186ms
Old 2.2339463s 2.2379453s 2.3319162s 2.4193271s 2.256169s 2.29586078s
New 2.272717s 2.2634248s 2.255138s 2.2448571s 2.2688344s 2.26099426s -34.86652ms
Old 623.6024ms 614.3534ms 627.9626ms 688.0205ms 615.2757ms 633.84292ms
New 614.4284ms 630.9495ms 616.3047ms 620.8077ms 614.8711ms 619.47228ms -14.37064ms

I'm using RX 5700XT with the current Nano difficult network settings (8x). It seems that almost every case the new OpenCL is faster than the old one, if few exceptions (one on the list above).

I run custom tests multiple times, using each OpenCL implementations. The times are listed above. It could save up to 0.64 seconds, and it's slower by 0.08 seconds in one case. In most case, the differences are not significative, but the results are consistent, saving some few milliseconds.

In a total of 20 PoWs, the new implementation saves around 2 seconds, consistently. I also generated 200 PoWs, which took 272.680s on the old one, and 259.198s using the new one. So, that saves ~13 seconds. I'll try to do more tests at night. 👍

Thank you for the PR. (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants