OpenCL improvements #1

jserv · 2020-07-31T07:17:27Z

This patch attempts to tweak OpenCL kernel for the follow aspects:

Reduce unnecessary memory access;
Remove non-reacheable code;
Specialize character-wise set;
Add loop unrolling hints;
Assume the messages no exceeding 17 exabytes and apply optimizations;

It is known to bring about 15% speedup on NVIDIA TITAN Xp.

This patch attempts to tweak OpenCL kernel for the follow aspects: 1. Reduce unnecessary memory access; 2. Remove non-reacheable code; 3. Specialize character-wise set; 4. Add loop unrolling hints; 5. Assume the messages no exceeding 17 exabytes and apply optimizations; It is known to bring about 15% speedup on NVIDIA TITAN Xp.

jserv · 2020-07-31T07:21:55Z

Test environment:

Intel Xeon E5-2650 v4 @ 2.20GHz
NVIDIA TITAN Xp

original OpenCL kernel:

go test -tags cl
ok  	github.com/inkeliz/nanopow	1.872s

new OpenCL kernel:

go test -tags cl
ok  	github.com/inkeliz/nanopow	1.515s

inkeliz · 2020-08-01T17:48:46Z

I run some tests to compare the new against the old one.

Method	1	2	3	4	5	AVG	DIFF AVG
Old	3.750095s	3.7518617s	3.8026438s	3.8125715s	3.7470841s	3.77285122s
New	3.5891672s	3.6811363s	3.5833844s	3.5990486s	3.6328716s	3.61712162s	-155.7296ms
Old	679.7822ms	661.0906ms	694.4251ms	688.5022ms	675.3416ms	679.82834ms
New	661.81ms	683.2097ms	687.1964ms	685.6328ms	675.2828ms	678.62634ms	-1.202ms
Old	852.6171ms	874.1722ms	870.2746ms	868.3854ms	854.5647ms	864.0028ms
New	793.5056ms	797.9352ms	796.9573ms	810.9946ms	809.6622ms	801.81098ms	-62.19182ms
Old	1.5827132s	1.5719757s	1.5685329s	1.5682206s	1.5640621s	1.5711009s
New	1.4616328s	1.4678935s	1.4640432s	1.4699088s	1.4798121s	1.46865808s	-102.44282ms
Old	991.2934ms	988.3621ms	988.3479ms	990.0114ms	996.3482ms	990.8726ms
New	925.7825ms	927.2461ms	924.6406ms	925.4016ms	930.2903ms	926.67222ms	-64.20038ms
Old	1.6506272s	1.6486656s	1.6452237s	1.6407873s	1.6407119s	1.64520314s
New	1.5455168s	1.5445504s	1.5351531s	1.5346411s	1.5691196s	1.5457962s	-99.40694ms
Old	1.0519654s	1.0464868s	1.0527362s	1.0494802s	1.0576293s	1.05165958s
New	1.0775959s	1.0484951s	1.041163s	1.0528584s	1.0782591s	1.0596743s	+8.01472ms
Old	554.1943ms	546.9022ms	553.3509ms	548.3345ms	547.8762ms	550.13162ms
New	511.8196ms	514.2215ms	524.3791ms	519.5652ms	521.5288ms	518.30284ms	-31.82878ms
Old	3.2346438s	3.2263787s	3.2378015s	3.2312395s	3.2161679s	3.22924628s
New	3.0250108s	3.0311579s	3.0208667s	3.0197474s	3.0588887s	3.0311343s	-198.11198ms
Old	4.4505093s	4.4264818s	4.4464126s	4.4521405s	4.4411519s	4.44333922s
New	4.1567282s	4.1498609s	4.1479333s	4.1556903s	4.1721991s	4.15648236s	-286.85686ms
Old	1.8628471s	1.8601246s	1.8771903s	1.8664901s	1.870448s	1.86742002s
New	1.7487176s	1.7448945s	1.7521662s	1.7497653s	1.7518566s	1.74948004s	-117.93998ms
Old	1.3224505s	1.3190605s	1.3184977s	1.3240782s	1.3220376s	1.3212249s
New	1.2311718s	1.2261551s	1.2272802s	1.2277292s	1.2365423s	1.22977572s	-91.44918ms
Old	9.7026788s	9.7195007s	9.7487819s	9.7586041s	9.691135s	9.7241401s
New	9.0772711s	9.088506s	9.0921451s	9.0751652s	9.0828598s	9.08318944s	-640.95066ms
Old	105.9825ms	105.9823ms	107.4155ms	118.6828ms	109.3666ms	109.48594ms
New	99.6019ms	100.5798ms	100.1262ms	99.1483ms	102.0811ms	100.30746ms	-9.17848ms
Old	716.8839ms	729.2025ms	718.3843ms	962.0821ms	709.6402ms	767.2386ms
New	721.7745ms	712.4591ms	715.8993ms	721.7627ms	704.1815ms	715.21542ms	-52.02318ms
Old	683.2373ms	679.2493ms	679.2507ms	699.8307ms	672.9333ms	682.90026ms
New	628.411ms	642.2091ms	640.1953ms	633.8161ms	626.4564ms	634.21758ms	-48.68268ms
Old	159.2372ms	158.1943ms	158.193ms	159.25ms	158.7205ms	158.719ms
New	147.4526ms	147.4495ms	148.4283ms	148.9531ms	147.5185ms	147.9604ms	-10.7586ms
Old	832.1129ms	838.0431ms	837.9707ms	855.5811ms	835.097ms	839.76096ms
New	787.4685ms	790.1214ms	789.2874ms	784.3313ms	781.3869ms	786.5191ms	-53.24186ms
Old	2.2339463s	2.2379453s	2.3319162s	2.4193271s	2.256169s	2.29586078s
New	2.272717s	2.2634248s	2.255138s	2.2448571s	2.2688344s	2.26099426s	-34.86652ms
Old	623.6024ms	614.3534ms	627.9626ms	688.0205ms	615.2757ms	633.84292ms
New	614.4284ms	630.9495ms	616.3047ms	620.8077ms	614.8711ms	619.47228ms	-14.37064ms

I'm using RX 5700XT with the current Nano difficult network settings (8x). It seems that almost every case the new OpenCL is faster than the old one, if few exceptions (one on the list above).

I run custom tests multiple times, using each OpenCL implementations. The times are listed above. It could save up to 0.64 seconds, and it's slower by 0.08 seconds in one case. In most case, the differences are not significative, but the results are consistent, saving some few milliseconds.

In a total of 20 PoWs, the new implementation saves around 2 seconds, consistently. I also generated 200 PoWs, which took 272.680s on the old one, and 259.198s using the new one. So, that saves ~13 seconds. I'll try to do more tests at night. 👍

Thank you for the PR. (:

jserv mentioned this pull request Jul 31, 2020

Rewrite OpenCL kernel for performance optimizations nanocurrency/nano-work-server#18

Merged

inkeliz merged commit cd443d0 into inkeliz:master Aug 1, 2020

inkeliz mentioned this pull request Aug 4, 2020

Further OpenCL kernel improvements #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL improvements #1

OpenCL improvements #1

jserv commented Jul 31, 2020

jserv commented Jul 31, 2020

inkeliz commented Aug 1, 2020 •

edited

Loading

OpenCL improvements #1

OpenCL improvements #1

Conversation

jserv commented Jul 31, 2020

jserv commented Jul 31, 2020

inkeliz commented Aug 1, 2020 • edited Loading

inkeliz commented Aug 1, 2020 •

edited

Loading