Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use block sparse input for the first layer. #4612

Closed
wants to merge 6 commits into from

Conversation

AndrovT
Copy link
Contributor

@AndrovT AndrovT commented Jun 11, 2023

Use block sparse input for the first fully connected layer on architectures with SSSE3. The net is the exact same as in #4611 except that the feature transform output is sorted by likelihood of that neuron being nonzero.

Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.

Local x86-64-avx2 benchmark:

Result of 100 runs of 'bench 16 1 13 default depth NNUE'

base (...ockfish-base) =     959345  +/- 7477
test (...ckfish-patch) =    1054340  +/- 9640
diff                   =     +94995  +/- 3999

speedup        = +0.0990
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on

Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25

bench 2370027

@cj5716
Copy link
Contributor

cj5716 commented Jun 11, 2023

Congrats!

@vondele
Copy link
Member

vondele commented Jun 12, 2023

@AndrovT nice speedup. Locally i measure 13% speedup.

As mentioned on discord, I think we need two things before we merge
a) ensure the results are correct on architectures we don't test well in CI (e.g. vnni or avx512, neon, etc)
b) have a link to the script used for permuting the network weights. Ideally we have that script integrated in the nnue-pytorch repository.

@Sopel97 any chance you could review this patch?

@Sopel97
Copy link
Member

Sopel97 commented Jun 12, 2023

Nice! Good to see this approach finally paying off. Do you know what the average density of non-zero values is with the current nets?

feature transform output is sorted by likelihood of that neuron being nonzero

Have you checked if this actually matters? The finding of NNZs uses a non-branching implementation based on syzygy1/Cfish#204, which should not depend in any way on the order.

Copy link
Member

@Sopel97 Sopel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I strongly suggest testing this with the current master network, as the order of weights shouldn't matter.
edit. I realize now that it does in fact matter, because inputs are processed in groups. In this case I'd like either

  1. a tool for automatic ordering of weights.
  2. reordering during initialization, with a very fast bench or smth.
  3. save statistics during training and use them during serialization

I'm not sure which one of these would be best, and without any of these it's gonna be annoying moving forward.

Second, I cannot verify avx512/vnni512 right now, but the code looks correct. I'll add benches later

src/nnue/layers/affine_transform_sparse_input.h Outdated Show resolved Hide resolved
src/nnue/layers/affine_transform_sparse_input.h Outdated Show resolved Hide resolved
@MinetaS
Copy link
Contributor

MinetaS commented Jun 12, 2023

All tests are performed with bench options "1024 1 18 default depth nnue".
Binaries are built under PGO with Clang 15.0.7.

Processor: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz

  • x86-64-vnni512: 2.1% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1449135  +/- 2915
test (./stockfish       ) =    1479613  +/- 1946
diff                      =     +30478  +/- 1998

speedup        = +0.0210
P(speedup > 0) =  1.0000
  • x86-64-avx512: 0.6% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1446929  +/- 2227
test (./stockfish       ) =    1455462  +/- 2387
diff                      =      +8533  +/- 1514

speedup        = +0.0059
P(speedup > 0) =  1.0000
  • x86-64-vnni256: 7.2% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1394096  +/- 1398
test (./stockfish       ) =    1495079  +/- 1684
diff                      =    +100984  +/- 1507

speedup        = +0.0724
P(speedup > 0) =  1.0000
  • x86-64-bmi2: 4.1% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1424134  +/- 1850
test (./stockfish       ) =    1482731  +/- 2060
diff                      =     +58596  +/- 2264

speedup        = +0.0411
P(speedup > 0) =  1.0000
  • x86-64-sse41-popcnt: 6.4% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1215782  +/- 1163
test (./stockfish       ) =    1293076  +/- 1204
diff                      =     +77294  +/- 1340

speedup        = +0.0636
P(speedup > 0) =  1.0000
  • x86-64-ssse3: 6% speedup
Result of 100 runs
==================
base (./stockfish-master) =    1190862  +/- 1218
test (./stockfish       ) =    1262138  +/- 1190
diff                      =     +71276  +/- 1263

speedup        = +0.0599
P(speedup > 0) =  1.0000

Results of "bench 4096 1 30 default depth nnue":

Architecture Signature
x86-64-vnni512 442205461
x86-64-avx512 442205461
x86-64-vnni256 442205461
x86-64-bmi2 442205461
x86-64-sse41-popcnt 442205461
x86-64-ssse3 442205461

@vondele
Copy link
Member

vondele commented Jun 12, 2023

Here some numbers comparing

  • master
  • patch (this PR, including the reordered net)
  • SameNet (this PR but keeping the original net)

TLDR: patch 13% speedup, SameNet 11% speedup, Net reorder benefit 2% (consistent with reports of this in Discord)

on AMD Ryzen 9 3950X

Result of  10 runs
==================
base (./stockfish.master       ) =    1394168  +/- 12402
test (./stockfish.patch        ) =    1576029  +/- 10992
diff                             =    +181860  +/- 4862

speedup        = +0.1304
P(speedup > 0) =  1.0000
Result of  10 runs
==================
base (./stockfish.master       ) =    1381918  +/- 4253
test (./stockfish.SameNet      ) =    1529336  +/- 5972
diff                             =    +147418  +/- 4775

speedup        = +0.1067
P(speedup > 0) =  1.0000
Result of  10 runs
==================
base (./stockfish.patch        ) =    1563712  +/- 10769
test (./stockfish.SameNet      ) =    1536426  +/- 10142
diff                             =     -27286  +/- 3684

speedup        = -0.0174
P(speedup > 0) =  0.0000

@AndrovT AndrovT requested a review from Sopel97 June 12, 2023 17:57
@Technologov
Copy link

Technologov commented Jun 12, 2023

I have made an extensive 64-bit CPU architecture tests, and the results are correct. Here:
https://pastebin.com/kmPaw8mZ

All tests performed : (on Linux Debian 12 + GCC; + a Mac)
$ stockfish bench 16 1

@Sopel97
Copy link
Member

Sopel97 commented Jun 12, 2023

looks good to me now

@vondele
Copy link
Member

vondele commented Jun 12, 2023

For reference, the reordering of the net was described here https://discord.com/channels/435943710472011776/813919248455827515/1117464914341134428 :

Used https://github.com/AndrovT/Stockfish/tree/log-activations to find out the likelihood of activation of the neurons and this python script

import torch

def permute_l1(nnue, permutation):
    l1_size = nnue.layer_stacks.l1.in_features
    assert l1_size == len(permutation)*2

    permutation.extend([x + l1_size // 2 for x in permutation])
    ft_permutation = permutation + list(range(l1_size, nnue.input.num_outputs))

    nnue.input.weight.data = nnue.input.weight.data[:, ft_permutation]
    nnue.input.bias.data = nnue.input.bias.data[ft_permutation]
    nnue.layer_stacks.l1.weight.data = nnue.layer_stacks.l1.weight.data[:, permutation]

Ideally this can be added in some way to the trainer. It might also be possible to improve the scheme with which the outputs are permuted (e.g. exploiting correlation between the outputs, not just their probability of being zero).

@vondele vondele closed this in 38e6166 Jun 12, 2023
@vondele
Copy link
Member

vondele commented Jun 12, 2023

Thanks, nice first contribution!

@mstembera
Copy link
Contributor

I wonder if this swings the balance toward even larger nets?

@Technologov
Copy link

I wonder if this swings the balance toward even larger nets?

VizVez thinks the same. Great minds think alike !

rn5f107s2 pushed a commit to rn5f107s2/Stockfish that referenced this pull request Jun 16, 2023
Use block sparse input for the first fully connected layer on architectures with at least SSSE3.

Depending on the CPU architecture, this yields a speedup of up to 10%, e.g.

```
Result of 100 runs of 'bench 16 1 13 default depth NNUE'

base (...ockfish-base) =     959345  +/- 7477
test (...ckfish-patch) =    1054340  +/- 9640
diff                   =     +94995  +/- 3999

speedup        = +0.0990
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on
```

Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25

This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs,
but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue).

Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.

closes official-stockfish#4612

No functional change
vondele pushed a commit to vondele/Stockfish that referenced this pull request Jul 16, 2023
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged.

closes official-stockfish#4684

No functional change
Joachim26 pushed a commit to Joachim26/StockfishNPS that referenced this pull request Jul 20, 2023
Use block sparse input for the first fully connected layer on architectures with at least SSSE3.

Depending on the CPU architecture, this yields a speedup of up to 10%, e.g.

```
Result of 100 runs of 'bench 16 1 13 default depth NNUE'

base (...ockfish-base) =     959345  +/- 7477
test (...ckfish-patch) =    1054340  +/- 9640
diff                   =     +94995  +/- 3999

speedup        = +0.0990
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on
```

Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25

This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs,
but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue).

Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.

closes official-stockfish#4612

No functional change
linrock pushed a commit to linrock/Stockfish that referenced this pull request Aug 26, 2023
Use block sparse input for the first fully connected layer on architectures with at least SSSE3.

Depending on the CPU architecture, this yields a speedup of up to 10%, e.g.

```
Result of 100 runs of 'bench 16 1 13 default depth NNUE'

base (...ockfish-base) =     959345  +/- 7477
test (...ckfish-patch) =    1054340  +/- 9640
diff                   =     +94995  +/- 3999

speedup        = +0.0990
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on
```

Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25

This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs,
but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue).

Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.

closes official-stockfish#4612

No functional change
Joachim26 pushed a commit to Joachim26/StockfishNPS that referenced this pull request Oct 4, 2023
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged.

closes official-stockfish#4684

No functional change
Joachim26 pushed a commit to Joachim26/StockfishNPS that referenced this pull request Oct 4, 2023
Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged.

closes official-stockfish#4684

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants