-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use block sparse input for the first layer. #4612
Conversation
Congrats! |
@AndrovT nice speedup. Locally i measure 13% speedup. As mentioned on discord, I think we need two things before we merge @Sopel97 any chance you could review this patch? |
Nice! Good to see this approach finally paying off. Do you know what the average density of non-zero values is with the current nets?
Have you checked if this actually matters? The finding of NNZs uses a non-branching implementation based on syzygy1/Cfish#204, which should not depend in any way on the order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I strongly suggest testing this with the current master network, as the order of weights shouldn't matter.
edit. I realize now that it does in fact matter, because inputs are processed in groups. In this case I'd like either
- a tool for automatic ordering of weights.
- reordering during initialization, with a very fast bench or smth.
- save statistics during training and use them during serialization
I'm not sure which one of these would be best, and without any of these it's gonna be annoying moving forward.
Second, I cannot verify avx512/vnni512 right now, but the code looks correct. I'll add benches later
All tests are performed with bench options "1024 1 18 default depth nnue". Processor: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Results of "bench 4096 1 30 default depth nnue":
|
Here some numbers comparing
TLDR: patch 13% speedup, SameNet 11% speedup, Net reorder benefit 2% (consistent with reports of this in Discord) on
|
I have made an extensive 64-bit CPU architecture tests, and the results are correct. Here: All tests performed : (on Linux Debian 12 + GCC; + a Mac) |
looks good to me now |
For reference, the reordering of the net was described here https://discord.com/channels/435943710472011776/813919248455827515/1117464914341134428 : Used https://github.com/AndrovT/Stockfish/tree/log-activations to find out the likelihood of activation of the neurons and this python script import torch
def permute_l1(nnue, permutation):
l1_size = nnue.layer_stacks.l1.in_features
assert l1_size == len(permutation)*2
permutation.extend([x + l1_size // 2 for x in permutation])
ft_permutation = permutation + list(range(l1_size, nnue.input.num_outputs))
nnue.input.weight.data = nnue.input.weight.data[:, ft_permutation]
nnue.input.bias.data = nnue.input.bias.data[ft_permutation]
nnue.layer_stacks.l1.weight.data = nnue.layer_stacks.l1.weight.data[:, permutation] Ideally this can be added in some way to the trainer. It might also be possible to improve the scheme with which the outputs are permuted (e.g. exploiting correlation between the outputs, not just their probability of being zero). |
Thanks, nice first contribution! |
I wonder if this swings the balance toward even larger nets? |
VizVez thinks the same. Great minds think alike ! |
Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes official-stockfish#4612 No functional change
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged. closes official-stockfish#4684 No functional change
Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes official-stockfish#4612 No functional change
Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes official-stockfish#4612 No functional change
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged. closes official-stockfish#4684 No functional change
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged. closes official-stockfish#4684 No functional change
Use block sparse input for the first fully connected layer on architectures with SSSE3. The net is the exact same as in #4611 except that the feature transform output is sorted by likelihood of that neuron being nonzero.
Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running
bench 16 1 13 varied_1000.epd depth NNUE
on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.Local x86-64-avx2 benchmark:
Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25
bench 2370027