[Cluster] Param tweak. #1931

vondele · 2019-01-05T12:14:54Z

Small tweak of parameters, yielding some Elo.

The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node.

Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400.

Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277  [0.586] 400
Elo difference: 60.54 +/- 18.49

Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281  [0.604] 400
Elo difference: 73.16 +/- 17.94

Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256  [0.635] 400
Elo difference: 96.19 +/- 19.68

Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285  [0.631] 400
Elo difference: 93.39 +/- 17.09

Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274  [0.635] 400
Elo difference: 96.19 +/- 18.06

Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248  [0.665] 400
Elo difference: 119.11 +/- 19.89

Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261  [0.639] 400
Elo difference: 99.01 +/- 19.18

Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256  [0.662] 400
Elo difference: 117.16 +/- 19.20

Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247  [0.671] 400
Elo difference: 124.01 +/- 19.86

Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241  [0.684] 400
Elo difference: 133.95 +/- 20.17

Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258  [0.657] 400
Elo difference: 113.29 +/- 19.11

As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes).

info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5

Small tweak of parameters, yielding some Elo. The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node. Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400. ``` Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277 [0.586] 400 Elo difference: 60.54 +/- 18.49 Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281 [0.604] 400 Elo difference: 73.16 +/- 17.94 Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256 [0.635] 400 Elo difference: 96.19 +/- 19.68 Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285 [0.631] 400 Elo difference: 93.39 +/- 17.09 Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274 [0.635] 400 Elo difference: 96.19 +/- 18.06 Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248 [0.665] 400 Elo difference: 119.11 +/- 19.89 Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261 [0.639] 400 Elo difference: 99.01 +/- 19.18 Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256 [0.662] 400 Elo difference: 117.16 +/- 19.20 Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247 [0.671] 400 Elo difference: 124.01 +/- 19.86 Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241 [0.684] 400 Elo difference: 133.95 +/- 20.17 Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258 [0.657] 400 Elo difference: 113.29 +/- 19.11 ``` As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes). ``` info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5 ```

vondele · 2019-01-05T12:17:13Z

Some of the data above:

Ipmanchess · 2019-01-05T13:20:40Z

Is this right when i sum it up like:

128cpu's x 32threads = 4096 threads Cluster using
and gives from startposition at 2min. : 4.801.341.606nodes/sec.

Do you still have this log..would like to put it on my website.

Thanks,
Ipman.

noobpwnftw · 2019-01-05T13:20:56Z

So are we getting about a half (~10elo per doubling from 256 to 512 threads) scaling comparing to local cores?

vondele · 2019-01-05T14:19:59Z

@Ipmanchess some raw data here https://github.com/vondele/Stockfish/tree/clusterData/clusterData

Ipmanchess · 2019-01-05T14:25:10Z

@vondele Thanks..i was more looking for this bench log like your last line in first message or on my website
complete scroll down : http://www.ipmanchess.yolasite.com/amd---intel-chess-bench.php

Okay..got it!! many thanks!!

Ipmanchess · 2019-01-05T14:48:29Z

@vondele i thought this run was done on noobpwnftw systems.. but it was not..

So can i ask some info about this cluster..which cpu's where used ,and his clockspeed
It was done with Stockfish 040119 popcnt
I have the nodes/sec.
And which name i can use to put in my bench-list!

Thanks,
Ipman.

snicolet · 2019-01-06T14:51:02Z

Merged via 8c4338a, congrats and thanks for the progress graph :-)

SkipPhase removal was tested by protonspring (official-stockfish#1835). STC (+1 offset) (3 threads) LLR: 2.95 (-2.94,2.94) [-3.00,1.00] Total: 28428 W: 6278 L: 6170 D: 15980 http://tests.stockfishchess.org/tests/view/5bfe01c20ebc5902bceda021 STC (+1 offset) (8 threads) LLR: 2.95 (-2.94,2.94) [-3.00,1.00] Total: 26002 W: 5082 L: 4970 D: 15950 http://tests.stockfishchess.org/tests/view/5bfe132c0ebc5902bceda12f SkipSize for threads 1-20 can be captured exactly by skipSize = int(std::log(idx + 1) / std::log(1.92)); where logarithmic growth seems natural (with a base similar to the branching factor). The formula extends to larger thread counts.

snicolet merged commit 8c4338a into official-stockfish:cluster Jan 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cluster] Param tweak. #1931

[Cluster] Param tweak. #1931

vondele commented Jan 5, 2019

vondele commented Jan 5, 2019

Ipmanchess commented Jan 5, 2019

noobpwnftw commented Jan 5, 2019

vondele commented Jan 5, 2019

Ipmanchess commented Jan 5, 2019 •

edited

Loading

Ipmanchess commented Jan 5, 2019

snicolet commented Jan 6, 2019

[Cluster] Param tweak. #1931

[Cluster] Param tweak. #1931

Conversation

vondele commented Jan 5, 2019

vondele commented Jan 5, 2019

Ipmanchess commented Jan 5, 2019

noobpwnftw commented Jan 5, 2019

vondele commented Jan 5, 2019

Ipmanchess commented Jan 5, 2019 • edited Loading

Ipmanchess commented Jan 5, 2019

snicolet commented Jan 6, 2019

Ipmanchess commented Jan 5, 2019 •

edited

Loading