Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Sopel97 · 2022-02-10T16:53:33Z

This is squashed https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 + fix for NEON implementation.

Architecture:

The currently used naming scheme was already getting messy, and with these changes it would be pretty much unreadable, so I allowed myself to change it to something more symbolic and easier to distinguish. I name this architecture "SFNNv4". The name may change if there are good proposals before this PR is merged.

The diagram of the "SFNNv4" architecture:

The most important architectural changes are the following:

1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
- The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

Training procedure:

The net was created by doing 2 sessions of training.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found here

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because China) or can be assembled in the following way:

Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack
Download farseerT74.binpack
Download farseerT75.binpack
Download farseerT76.binpack
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7

LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85

LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22

Bench: 4919707

vondele · 2022-02-10T17:41:05Z

nice work, thanks for the docs.

I'm fine with using the name SF15a, that's a reasonable scheme to add new archs.

I see that there is an empty file added nnue/architectures/halfka_256x2-32-32.h, I assume that's not intentional?

src/nnue/evaluate_nnue.cpp

src/nnue/layers/affine_transform.h

src/nnue/layers/clipped_relu.h

src/nnue/nnue_architecture.h

ppigazzini · 2022-02-10T17:48:48Z

We have a speedup (I don't know if due to the new net arch), perhaps we should update the fishtest reference Nps after merging this PR.

speedup benches sf_11 vs sf_15a

arch=bmi2 (Dual Xeon workstation)

sf_base =  1566536 +/- 6407
sf_test =  1308257 +/- 5827
diff    =  -258279 +/- 4440
speedup = -0.164873

arch=bmi2 (Xeon server)

sf_base =  1214010 +/- 5258
sf_test =  1082431 +/- 5401
diff    =  -131578 +/- 5131
speedup = -0.108384

arch=modern (core i7 3770k)

sf_base =  1946273 +/- 22064
sf_test =  1518906 +/- 11325
diff    =  -427366 +/- 11006
speedup = -0.219582

Last fishtest reference Nps update
official-stockfish/fishtest@c5e0de5

arch=bmi2 (Dual Xeon workstation)

base =    1729936 +/- 9448
test =    1381212 +/- 8223
diff =    -348723 +/- 6159
speedup = -0.201582

arch=bmi2 (Xeon server)

base =    1377734 +/- 14202
test =    1151238 +/- 9600
diff =    -226496 +/- 20606
speedup = -0.164398

arch=modern (core i7 3770k)

base =    1967470 +/- 12952
test =    1435152 +/- 5597
diff =    -532318 +/- 8311
speedup = -0.270560

vondele · 2022-02-10T17:52:27Z

Changing reference number for nps leads to jumps in the regression tests. Probably better to do that when we start with a new reference.

Update network to nn-6877cd24400e.nnue. Bench: 4919707

Sopel97 · 2022-02-10T17:53:55Z

@vondele Thanks for doing a proper review. Fixed inconsistent naming (also in some other places), removed the empty files. Export works correctly.

ppigazzini · 2022-02-10T20:32:56Z

The new net arch is a very tiny speedup vs previous master

arch=bmi2 (Dual Xeon workstation)

sf_base =  1274703 +/- 3505
sf_test =  1283857 +/- 3982
diff    =     9154 +/- 1421
speedup = 0.007181

arch=modern (core i7 3770k)

sf_base =  1419143 +/- 5147
sf_test =  1430115 +/- 6309
diff    =    10971 +/- 1903
speedup = 0.007731

mstembera · 2022-02-10T23:12:38Z

I have a question regarding both the new and previous architectures... Most normal NN's (with the exception of encoder/decoder nets for lossy compression/decompression) are composed of layers such that the input dimension of any layer is >= the output dimension of that same layer. This makes intuitive sense since you can think of the layers as projecting the data from a high dimensional space to a lower dimensional space as it propagates. Having layers that go the other way isn't really done because doing so would require information that is already lost. I am wondering then why we have a layer that goes from 15 to 32 (previously from 8 to 32)?

Sopel97 · 2022-02-10T23:27:25Z

It can still provide more information because the layer is followed by an activation function, in our case clamping to [0, 1]. Consider a trivial, illustatory example where in a 1->4 we have 4 identical weights x and biases vector of [0, -1, -2, -3]. For example for x=2 and input=0.7 we would then have activated_output=[1.0, 0.4, 0.0, 0.0].

Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes official-stockfish#3927 Bench: 4919707

mstembera · 2022-02-11T00:47:17Z

Hmm this is interesting to think about. I don't think going from 1 to 4 creates additional dimensions in the data. It just embeds the 1 dimensional data into 4 dimensions. I visualize a 1 dimensional string bent and twisted in a 4 dimensional space but the string itself is still 1 dimensional. The activation function being non linear just means the mapping is non linear.
I would like to confirm this by making the 15->32 layer a 15->15 layer but until I learn how to produce new nets I'm all talk :)

NKONSTANTAKIS · 2022-02-11T04:03:10Z

Correct me if I'm wrong, but I think that just this embedment of data is providing extra calculus space for the whole procedure. This way the 1-dimensional data expanded into 4d and rearranged into 1d just isn't same anymore. One might think that with this procedure we lose accuracy but our initial raw 1d output was never supposed to be perfect, so we lose... inaccuracy! In that regard I consider the extra step to be improving the cohesion and harmony of values, quite reminiscent of a mini internal SPSA.

NightlyKing · 2022-02-11T04:18:55Z

Hmm this is interesting to think about. I don't think going from 1 to 4 creates additional dimensions in the data. It just embeds the 1 dimensional data into 4 dimensions. I visualize a 1 dimensional string bent and twisted in a 4 dimensional space but the string itself is still 1 dimensional. The activation function being non linear just means the mapping is non linear. I would like to confirm this by making the 15->32 layer a 15->15 layer but until I learn how to produce new nets I'm all talk :)

If you have any trouble figuring out the trainer feel free to stop by in the discord and ask for help. For such things it is really great because help can be offered 'live' without cluttering github issues. (Documentation regarding training is in need of an update right now).

vondele · 2022-02-11T06:20:14Z

a wide shallow network can still approximate general functions, it might be this works well here.
There is some info here https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider

Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes official-stockfish#3927 Bench: 4919707

mstembera · 2022-02-14T22:07:32Z

@vondele Thanks for the link but my comment isn't regarding wide or deep networks. I'm just pointing out that an architecture that pinches a layer to a narrow number of dimensions only to widen again in a subsequent layer is likely to be sub optimal. This is because you can't recover the dimensions once you get rid of them. If I learn how to make nets I will try something like 1024->16->16->1 or 1024->32->32->1 instead of the current 1024->16->32->1.

Fishtest with Stockfish 11 had 1.6MNps as reference Nps and 0.7MNps as threshold for the slow worker. Set the new reference Nps according to the average 17% slowdown. Set the new threshold for slow worker according to the 21% slowdown. - arch=bmi2 (Dual Xeon workstation) ``` sf_base = 1517458 +/- 9427 sf_test = 1259909 +/- 10794 diff = -257549 +/- 9477 speedup = -0.169724 ``` - arch=modern (core i7 3770k) ``` sf_base = 1864275 +/- 20969 sf_test = 1466315 +/- 7262 diff = -397959 +/- 14643 speedup = -0.213466 ``` The speedups are nearly the same measured after the switch to the new net arch official-stockfish/Stockfish#3927 - arch=bmi2 (Dual Xeon workstation) ``` sf_base = 1566536 +/- 6407 sf_test = 1308257 +/- 5827 diff = -258279 +/- 4440 speedup = -0.164873 ``` - arch=modern (core i7 3770k) ``` sf_base = 1946273 +/- 22064 sf_test = 1518906 +/- 11325 diff = -427366 +/- 11006 speedup = -0.219582 ```

Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the https://github.com/official-stockfish/Stockfish/blob/5640ad48ae5881223b868362c1cbeb042947f7b4/script/interleave_binpacks.py script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes official-stockfish#3927 Bench: 4919707

Sopel97 force-pushed the new_net branch from de144cc to 56acf05 Compare February 10, 2022 16:57

Sopel97 force-pushed the new_net branch from 56acf05 to 0f4300b Compare February 10, 2022 17:47

vondele reviewed Feb 10, 2022

View reviewed changes

Update architecture to "SF15a".

99ed34b

Update network to nn-6877cd24400e.nnue. Bench: 4919707

Sopel97 force-pushed the new_net branch from 0f4300b to 99ed34b Compare February 10, 2022 17:53

Sopel97 changed the title ~~Update architecture to "SF15a". Update network to nn-6877cd24400e.nnue.~~ Update architecture to "SFNNUEv15a". Update network to nn-6877cd24400e.nnue. Feb 10, 2022

Sopel97 changed the title ~~Update architecture to "SFNNUEv15a". Update network to nn-6877cd24400e.nnue.~~ Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. Feb 10, 2022

vondele closed this in cb9c259 Feb 10, 2022

ppigazzini mentioned this pull request Apr 19, 2022

Set the SF15 reference Nps using SF11 as base official-stockfish/fishtest#1324

Merged

mstembera referenced this pull request in linrock/Stockfish Sep 15, 2023

L1=2048 L2=31

6ad3264

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Sopel97 commented Feb 10, 2022 •

edited

Loading

vondele commented Feb 10, 2022

ppigazzini commented Feb 10, 2022 •

edited

Loading

vondele commented Feb 10, 2022 •

edited

Loading

Sopel97 commented Feb 10, 2022 •

edited

Loading

ppigazzini commented Feb 10, 2022

mstembera commented Feb 10, 2022

Sopel97 commented Feb 10, 2022 •

edited

Loading

mstembera commented Feb 11, 2022

NKONSTANTAKIS commented Feb 11, 2022

NightlyKing commented Feb 11, 2022

vondele commented Feb 11, 2022

mstembera commented Feb 14, 2022

Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. #3927

Conversation

Sopel97 commented Feb 10, 2022 • edited Loading

Architecture:

The diagram of the "SFNNv4" architecture:

The most important architectural changes are the following:

Training procedure:

First session:

Second session:

Tests:

vondele commented Feb 10, 2022

ppigazzini commented Feb 10, 2022 • edited Loading

vondele commented Feb 10, 2022 • edited Loading

Sopel97 commented Feb 10, 2022 • edited Loading

ppigazzini commented Feb 10, 2022

mstembera commented Feb 10, 2022

Sopel97 commented Feb 10, 2022 • edited Loading

mstembera commented Feb 11, 2022

NKONSTANTAKIS commented Feb 11, 2022

NightlyKing commented Feb 11, 2022

vondele commented Feb 11, 2022

mstembera commented Feb 14, 2022

Sopel97 commented Feb 10, 2022 •

edited

Loading

ppigazzini commented Feb 10, 2022 •

edited

Loading

vondele commented Feb 10, 2022 •

edited

Loading

Sopel97 commented Feb 10, 2022 •

edited

Loading

Sopel97 commented Feb 10, 2022 •

edited

Loading