Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Closed
wants to merge 1 commit into from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Feb 10, 2022

This is squashed https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 + fix for NEON implementation.

Architecture:

The currently used naming scheme was already getting messy, and with these changes it would be pretty much unreadable, so I allowed myself to change it to something more symbolic and easier to distinguish. I name this architecture "SFNNv4". The name may change if there are good proposals before this PR is merged.

The diagram of the "SFNNv4" architecture:

SFNNv4_architecture

The most important architectural changes are the following:

  • 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
  • The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
    • The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

Training procedure:

The net was created by doing 2 sessions of training.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found here

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because China) or can be assembled in the following way:

  1. Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script.
  2. Download T60T70wIsRightFarseer.binpack
  3. Download farseerT74.binpack
  4. Download farseerT75.binpack
  5. Download farseerT76.binpack
  6. Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7

LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131 

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85

LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22 

Bench: 4919707

@vondele
Copy link
Member

vondele commented Feb 10, 2022

nice work, thanks for the docs.

I'm fine with using the name SF15a, that's a reasonable scheme to add new archs.

I see that there is an empty file added nnue/architectures/halfka_256x2-32-32.h, I assume that's not intentional?

src/nnue/evaluate_nnue.cpp Show resolved Hide resolved
src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved
src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved
src/nnue/layers/clipped_relu.h Outdated Show resolved Hide resolved
src/nnue/layers/clipped_relu.h Outdated Show resolved Hide resolved
src/nnue/nnue_architecture.h Outdated Show resolved Hide resolved
src/nnue/nnue_architecture.h Show resolved Hide resolved
@ppigazzini
Copy link
Contributor

ppigazzini commented Feb 10, 2022

We have a speedup (I don't know if due to the new net arch), perhaps we should update the fishtest reference Nps after merging this PR.

speedup benches sf_11 vs sf_15a

  • arch=bmi2 (Dual Xeon workstation)
sf_base =  1566536 +/- 6407
sf_test =  1308257 +/- 5827
diff    =  -258279 +/- 4440
speedup = -0.164873
  • arch=bmi2 (Xeon server)
sf_base =  1214010 +/- 5258
sf_test =  1082431 +/- 5401
diff    =  -131578 +/- 5131
speedup = -0.108384
  • arch=modern (core i7 3770k)
sf_base =  1946273 +/- 22064
sf_test =  1518906 +/- 11325
diff    =  -427366 +/- 11006
speedup = -0.219582

Last fishtest reference Nps update
official-stockfish/fishtest@c5e0de5

  • arch=bmi2 (Dual Xeon workstation)
base =    1729936 +/- 9448
test =    1381212 +/- 8223
diff =    -348723 +/- 6159
speedup = -0.201582
  • arch=bmi2 (Xeon server)
base =    1377734 +/- 14202
test =    1151238 +/- 9600
diff =    -226496 +/- 20606
speedup = -0.164398
  • arch=modern (core i7 3770k)
base =    1967470 +/- 12952
test =    1435152 +/- 5597
diff =    -532318 +/- 8311
speedup = -0.270560

@vondele
Copy link
Member

vondele commented Feb 10, 2022

Changing reference number for nps leads to jumps in the regression tests. Probably better to do that when we start with a new reference.

Update network to nn-6877cd24400e.nnue.

Bench: 4919707
@Sopel97
Copy link
Member Author

Sopel97 commented Feb 10, 2022

@vondele Thanks for doing a proper review. Fixed inconsistent naming (also in some other places), removed the empty files. Export works correctly.

@Sopel97 Sopel97 changed the title Update architecture to "SF15a". Update network to nn-6877cd24400e.nnue. Update architecture to "SFNNUEv15a". Update network to nn-6877cd24400e.nnue. Feb 10, 2022
@Sopel97 Sopel97 changed the title Update architecture to "SFNNUEv15a". Update network to nn-6877cd24400e.nnue. Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. Feb 10, 2022
@vondele vondele closed this in cb9c259 Feb 10, 2022
@ppigazzini
Copy link
Contributor

The new net arch is a very tiny speedup vs previous master

  • arch=bmi2 (Dual Xeon workstation)
sf_base =  1274703 +/- 3505
sf_test =  1283857 +/- 3982
diff    =     9154 +/- 1421
speedup = 0.007181
  • arch=modern (core i7 3770k)
sf_base =  1419143 +/- 5147
sf_test =  1430115 +/- 6309
diff    =    10971 +/- 1903
speedup = 0.007731

@mstembera
Copy link
Contributor

I have a question regarding both the new and previous architectures... Most normal NN's (with the exception of encoder/decoder nets for lossy compression/decompression) are composed of layers such that the input dimension of any layer is >= the output dimension of that same layer. This makes intuitive sense since you can think of the layers as projecting the data from a high dimensional space to a lower dimensional space as it propagates. Having layers that go the other way isn't really done because doing so would require information that is already lost. I am wondering then why we have a layer that goes from 15 to 32 (previously from 8 to 32)?

@Sopel97
Copy link
Member Author

Sopel97 commented Feb 10, 2022

It can still provide more information because the layer is followed by an activation function, in our case clamping to [0, 1]. Consider a trivial, illustatory example where in a 1->4 we have 4 identical weights x and biases vector of [0, -1, -2, -3]. For example for x=2 and input=0.7 we would then have activated_output=[1.0, 0.4, 0.0, 0.0].

uwuplant pushed a commit to uwuplant/Stockfish that referenced this pull request Feb 10, 2022
Architecture:

The diagram of the "SFNNv4" architecture:
https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png

The most important architectural changes are the following:

* 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
* The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
* The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way:

Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view
Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z
Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z
Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85
LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22

closes official-stockfish#3927

Bench: 4919707
@mstembera
Copy link
Contributor

Hmm this is interesting to think about. I don't think going from 1 to 4 creates additional dimensions in the data. It just embeds the 1 dimensional data into 4 dimensions. I visualize a 1 dimensional string bent and twisted in a 4 dimensional space but the string itself is still 1 dimensional. The activation function being non linear just means the mapping is non linear.
I would like to confirm this by making the 15->32 layer a 15->15 layer but until I learn how to produce new nets I'm all talk :)

@NKONSTANTAKIS
Copy link

Correct me if I'm wrong, but I think that just this embedment of data is providing extra calculus space for the whole procedure. This way the 1-dimensional data expanded into 4d and rearranged into 1d just isn't same anymore. One might think that with this procedure we lose accuracy but our initial raw 1d output was never supposed to be perfect, so we lose... inaccuracy! In that regard I consider the extra step to be improving the cohesion and harmony of values, quite reminiscent of a mini internal SPSA.

@NightlyKing
Copy link
Contributor

Hmm this is interesting to think about. I don't think going from 1 to 4 creates additional dimensions in the data. It just embeds the 1 dimensional data into 4 dimensions. I visualize a 1 dimensional string bent and twisted in a 4 dimensional space but the string itself is still 1 dimensional. The activation function being non linear just means the mapping is non linear. I would like to confirm this by making the 15->32 layer a 15->15 layer but until I learn how to produce new nets I'm all talk :)

If you have any trouble figuring out the trainer feel free to stop by in the discord and ask for help. For such things it is really great because help can be offered 'live' without cluttering github issues. (Documentation regarding training is in need of an update right now).

@vondele
Copy link
Member

vondele commented Feb 11, 2022

a wide shallow network can still approximate general functions, it might be this works well here.
There is some info here https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider

uwuplant pushed a commit to uwuplant/Stockfish that referenced this pull request Feb 11, 2022
Architecture:

The diagram of the "SFNNv4" architecture:
https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png

The most important architectural changes are the following:

* 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
* The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
* The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way:

Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view
Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z
Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z
Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85
LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22

closes official-stockfish#3927

Bench: 4919707
@mstembera
Copy link
Contributor

@vondele Thanks for the link but my comment isn't regarding wide or deep networks. I'm just pointing out that an architecture that pinches a layer to a narrow number of dimensions only to widen again in a subsequent layer is likely to be sub optimal. This is because you can't recover the dimensions once you get rid of them. If I learn how to make nets I will try something like 1024->16->16->1 or 1024->32->32->1 instead of the current 1024->16->32->1.

ppigazzini added a commit to ppigazzini/fishtest that referenced this pull request Apr 19, 2022
Fishtest with Stockfish 11 had 1.6MNps as reference Nps and 0.7MNps as
threshold for the slow worker.
Set the new reference Nps according to the average 17% slowdown.
Set the new threshold for slow worker according to the 21% slowdown.

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1517458 +/- 9427
sf_test =  1259909 +/- 10794
diff    =  -257549 +/- 9477
speedup = -0.169724
```
- arch=modern (core i7 3770k)
```
sf_base =  1864275 +/- 20969
sf_test =  1466315 +/- 7262
diff    =  -397959 +/- 14643
speedup = -0.213466
```

The speedups are nearly the same measured after the switch to the new net arch
official-stockfish/Stockfish#3927

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1566536 +/- 6407
sf_test =  1308257 +/- 5827
diff    =  -258279 +/- 4440
speedup = -0.164873
```
- arch=modern (core i7 3770k)
```
sf_base =  1946273 +/- 22064
sf_test =  1518906 +/- 11325
diff    =  -427366 +/- 11006
speedup = -0.219582
```
ppigazzini added a commit to ppigazzini/fishtest that referenced this pull request Apr 19, 2022
Fishtest with Stockfish 11 had 1.6MNps as reference Nps and 0.7MNps as
threshold for the slow worker.
Set the new reference Nps according to the average 17% slowdown.
Set the new threshold for slow worker according to the 21% slowdown.

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1517458 +/- 9427
sf_test =  1259909 +/- 10794
diff    =  -257549 +/- 9477
speedup = -0.169724
```
- arch=modern (core i7 3770k)
```
sf_base =  1864275 +/- 20969
sf_test =  1466315 +/- 7262
diff    =  -397959 +/- 14643
speedup = -0.213466
```

The speedups are nearly the same measured after the switch to the new net arch
official-stockfish/Stockfish#3927

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1566536 +/- 6407
sf_test =  1308257 +/- 5827
diff    =  -258279 +/- 4440
speedup = -0.164873
```
- arch=modern (core i7 3770k)
```
sf_base =  1946273 +/- 22064
sf_test =  1518906 +/- 11325
diff    =  -427366 +/- 11006
speedup = -0.219582
```
ppigazzini added a commit to official-stockfish/fishtest that referenced this pull request Apr 19, 2022
Fishtest with Stockfish 11 had 1.6MNps as reference Nps and 0.7MNps as
threshold for the slow worker.
Set the new reference Nps according to the average 17% slowdown.
Set the new threshold for slow worker according to the 21% slowdown.

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1517458 +/- 9427
sf_test =  1259909 +/- 10794
diff    =  -257549 +/- 9477
speedup = -0.169724
```
- arch=modern (core i7 3770k)
```
sf_base =  1864275 +/- 20969
sf_test =  1466315 +/- 7262
diff    =  -397959 +/- 14643
speedup = -0.213466
```

The speedups are nearly the same measured after the switch to the new net arch
official-stockfish/Stockfish#3927

- arch=bmi2 (Dual Xeon workstation)
```
sf_base =  1566536 +/- 6407
sf_test =  1308257 +/- 5827
diff    =  -258279 +/- 4440
speedup = -0.164873
```
- arch=modern (core i7 3770k)
```
sf_base =  1946273 +/- 22064
sf_test =  1518906 +/- 11325
diff    =  -427366 +/- 11006
speedup = -0.219582
```
dav1312 pushed a commit to dav1312/Stockfish that referenced this pull request Oct 21, 2022
Architecture:

The diagram of the "SFNNv4" architecture:
https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png

The most important architectural changes are the following:

* 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
* The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
* The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way:

Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view
Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z
Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z
Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85
LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22

closes official-stockfish#3927

Bench: 4919707
mstembera referenced this pull request in linrock/Stockfish Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants