-
Notifications
You must be signed in to change notification settings - Fork 0
Fishtest faq
Even if you don't know programming yet, you can help the Stockfish developers to improve the chess engine by connecting your computer to the fishnet. Your computer will then play some chess games in the background to help the developers testing ideas and improvements.
Instructions on how to connect your computer on the fishtest network are given there:
- Running the worker: overview
- Running the worker on Windows
- Running the worker on Linux
- Running the worker on Macintosh
- Running the worker in the Amazon AWS EC2 cloud
Each result of each game played by your computer when connected to fishtest is immediately sent to the fishtest database and is used to update the statistics of the current test. So you can take your computer off fishtest at anytime with a free mind, if you have more important work to do, without wasting the games you have already finished.
The following questions are more technical and aimed at potential Stockfish developers:
You should first check if the test has not been run previously. You can look at tests history, follow the corresponding link on the left of fishtest main view.
Most tests should use the two-stage approach, starting with stage 1, and if that passes, using the reschedule button to create the stage 2 test.
- Stage 1 - TC 10+0.1, stop rule SPRT - fast button
- Stage 2 - TC 60+0.6, stop rule SPRT - slow button
SPRT stands for sequential probability ratio test. In SPRT, we have a null hypothesis that the two engines are equal in strength, while an alternative hypothesis that one of the engine is stronger. With SPRT, we can test the hypothesis with the least expected number of games, that is, we don't attempt to fix the number of games to be played. The parameters to the test control the Type 1 and Type 2 errors. Essentially, we run matches sequentially, for each match we update a value from a likelihood function. The test is terminated when the value is below a lower-bound threshold or above an upper-bound threshold. The threshold is calculated based on the two parameters given to the test (please read the paragraph "Testing methodology" in the page Creating my first test for details).
You can use the NumGames stop rule, with 20000 games TC 10+0.1, and schedule a few tests around the direction you want to tune in. If you find a tuning that looks good, you can then schedule a two stage SPRT test.
Generally, four or five tries is the limit. It's a good balance between exploring the change and not giving lucky tries too much of a chance to pass.
No. For various reasons, please base your tests off current SF master.
A union is the bundling of patches that failed SPRT but with positive or near positive score. Sometime retesting the union as a whole passes SPRT. Due to the nature of the approach and because of each individual patch failed already, a union has some constraints:
- Maximum 2 patches per union
- Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.
If your branch name is passed_pawn
, you can enter passed_pawn^
, passed_pawn^^
, ... in the branch field of the test submission page at https://tests.stockfishchess.org/tests/run .
Note for patch authors: it is necessary, when testing patches with more than 8 threads, to disable "thread binding" in thread.cpp. Not doing so would have a negative effect on multi NUMA node (more than one physical CPU) Windows contributors machines with more than 8 cores, due to the parallelization of our tests scripts for fishtest. This would bias the statistical value of the test.
The lines to comment out in thread.cpp are the following:
if (Options["Threads"] > 8)
WinProcGroup::bindThisThread(idx);
See for instance https://github.com/WOnder93/Stockfish/commit/97c95b7cf63ff9211544195f7621091ffbcbb459
Stockfish simply is not supposed to be fed incorrect fens, or fens with illegal positions. Full validation code is complex to write, and there is no established mechanism to communicate such an error back to the GUI.
On the other hand, the GUI must carefully check fens. If you find a GUI through which you can crash Stockfish or any other engine, then by all means report it to that GUI's developers.
Yes, there is a branch developed with a MPI cluster implementation of Stockfish, allowing stockfish to run on clusters of compute nodes connected with a high-speed network. See https://github.com/official-stockfish/Stockfish/pull/1571 for some discussion of the initial implementation and https://github.com/official-stockfish/Stockfish/pull/1931 for some early performance results.
Feedback on this branch is welcome! Here are some git commands for people interested to test this MPI/Cluster idea:
- If you don't have the cluster branch yet on your local git repository, you can download the latest state of the
official-stockfish/cluster
branch (See also https://github.com/official-stockfish/Stockfish/tree/cluster), then switch to it with the following commands:
git fetch official cluster:cluster
git checkout -f cluster
- After switching to the cluster branch as above, see the README.md for detailed instructions on how to compile and run the branch. TL;DR:
make clean
make -j ARCH=x86-64-modern COMPILER=mpic++ build
mpirun -np 4 ./stockfish bench
First note that regression tests are not actually run to detect regressions. SF quality control is very stringent and regressive patches are very unlikely to make it into master. No, they are run to get an idea of SF's progress over time, which is impressive. See
https://github.com/glinscott/fishtest/wiki/Regression-Tests
But still... what if the Elo outcome of a regression test is disappointingly low? Usually there is little reason to worry.
-
First of all: wait till the test is finished. Drawing conclusions from an unfinished test is statistically meaningless.
-
Look at the error bars. The previous test may have been a lucky run, and the current one is perhaps an unlucky one. Note that the error bar is for the Elo relative to the fixed release (base). Differences between two such Elo estimates have nearly double the statistical error (2-3 Elo).
-
SFdev vs SF11 : NNUE vs classical evaluation is very sensitive to the hardware mix present at the time of testing. If a fleet of AVX512 workers is present/absent, Elo will be larger/smaller.
-
Error bars are designed to be right 95% of the time. So, conversely, 1 in 20 tests will be an outlier.
-
Regression tests are run with a book based on human games (8moves_v3.pgn) whereas for testing patches, a book is used that maximizes Elo resolution (noob_3moves.epd). So the results of regression tests are compressed, currently by a factor of around 0.65, relative to what would be obtained with the testing book.
-
Elo estimates of single patches (SPRT runs) typically come with large error bars. Take this into account when adding Elo estimates. Furthermore, Elo estimates of passing patches are biased. The SPRT Elo estimates are only unbiased if one takes all patches into account, both passed and non-passed ones. As a result, the Elo gain measured by a regression test will typically be less than the sum of the estimated Elo gains of the individual patches since the previous regression test.
If the Travis-CI results do not show up in pull requests, the maintainers can try the points 1 and 3 suggested by user "javeme" in this post comment (revoking access and authorizing it again): https://travis-ci.community/t/github-pr-is-being-built-but-result-is-not-shown/7025/2 .