-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cluster] Param tweak. #1931
[Cluster] Param tweak. #1931
Conversation
Small tweak of parameters, yielding some Elo. The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node. Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400. ``` Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277 [0.586] 400 Elo difference: 60.54 +/- 18.49 Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281 [0.604] 400 Elo difference: 73.16 +/- 17.94 Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256 [0.635] 400 Elo difference: 96.19 +/- 19.68 Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285 [0.631] 400 Elo difference: 93.39 +/- 17.09 Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274 [0.635] 400 Elo difference: 96.19 +/- 18.06 Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248 [0.665] 400 Elo difference: 119.11 +/- 19.89 Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261 [0.639] 400 Elo difference: 99.01 +/- 19.18 Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256 [0.662] 400 Elo difference: 117.16 +/- 19.20 Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247 [0.671] 400 Elo difference: 124.01 +/- 19.86 Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241 [0.684] 400 Elo difference: 133.95 +/- 20.17 Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258 [0.657] 400 Elo difference: 113.29 +/- 19.11 ``` As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes). ``` info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5 ```
Is this right when i sum it up like: 128cpu's x 32threads = 4096 threads Cluster using Do you still have this log..would like to put it on my website. Thanks, |
So are we getting about a half (~10elo per doubling from 256 to 512 threads) scaling comparing to local cores? |
@vondele Thanks..i was more looking for this bench log like your last line in first message or on my website Okay..got it!! many thanks!! |
@vondele i thought this run was done on noobpwnftw systems.. but it was not.. So can i ask some info about this cluster..which cpu's where used ,and his clockspeed Thanks, |
Merged via 8c4338a, congrats and thanks for the progress graph :-) |
SkipPhase removal was tested by protonspring (official-stockfish#1835). STC (+1 offset) (3 threads) LLR: 2.95 (-2.94,2.94) [-3.00,1.00] Total: 28428 W: 6278 L: 6170 D: 15980 http://tests.stockfishchess.org/tests/view/5bfe01c20ebc5902bceda021 STC (+1 offset) (8 threads) LLR: 2.95 (-2.94,2.94) [-3.00,1.00] Total: 26002 W: 5082 L: 4970 D: 15950 http://tests.stockfishchess.org/tests/view/5bfe132c0ebc5902bceda12f SkipSize for threads 1-20 can be captured exactly by skipSize = int(std::log(idx + 1) / std::log(1.92)); where logarithmic growth seems natural (with a base similar to the branching factor). The formula extends to larger thread counts.
Small tweak of parameters, yielding some Elo.
The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node.
Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400.
As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes).