Use block sparse input for the first layer. #4612

AndrovT · 2023-06-11T13:44:50Z

Use block sparse input for the first fully connected layer on architectures with SSSE3. The net is the exact same as in #4611 except that the feature transform output is sorted by likelihood of that neuron being nonzero.

Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.

Local x86-64-avx2 benchmark:

Result of 100 runs of 'bench 16 1 13 default depth NNUE'

base (...ockfish-base) =     959345  +/- 7477
test (...ckfish-patch) =    1054340  +/- 9640
diff                   =     +94995  +/- 3999

speedup        = +0.0990
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on

Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25

bench 2370027

into sparse-input

cj5716 · 2023-06-11T13:46:14Z

Congrats!

vondele · 2023-06-12T05:37:47Z

@AndrovT nice speedup. Locally i measure 13% speedup.

As mentioned on discord, I think we need two things before we merge
a) ensure the results are correct on architectures we don't test well in CI (e.g. vnni or avx512, neon, etc)
b) have a link to the script used for permuting the network weights. Ideally we have that script integrated in the nnue-pytorch repository.

@Sopel97 any chance you could review this patch?

Sopel97 · 2023-06-12T08:07:17Z

Nice! Good to see this approach finally paying off. Do you know what the average density of non-zero values is with the current nets?

feature transform output is sorted by likelihood of that neuron being nonzero

Have you checked if this actually matters? The finding of NNZs uses a non-branching implementation based on syzygy1/Cfish#204, which should not depend in any way on the order.

Sopel97

~~First, I strongly suggest testing this with the current master network, as the order of weights shouldn't matter.~~
edit. I realize now that it does in fact matter, because inputs are processed in groups. In this case I'd like either

a tool for automatic ordering of weights.
reordering during initialization, with a very fast bench or smth.
save statistics during training and use them during serialization

I'm not sure which one of these would be best, and without any of these it's gonna be annoying moving forward.

Second, I cannot verify avx512/vnni512 right now, but the code looks correct. I'll add benches later

src/nnue/layers/affine_transform_sparse_input.h

MinetaS · 2023-06-12T09:10:23Z

All tests are performed with bench options "1024 1 18 default depth nnue".
Binaries are built under PGO with Clang 15.0.7.

Processor: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz

x86-64-vnni512: 2.1% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1449135  +/- 2915
test (./stockfish       ) =    1479613  +/- 1946
diff                      =     +30478  +/- 1998

speedup        = +0.0210
P(speedup > 0) =  1.0000

x86-64-avx512: 0.6% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1446929  +/- 2227
test (./stockfish       ) =    1455462  +/- 2387
diff                      =      +8533  +/- 1514

speedup        = +0.0059
P(speedup > 0) =  1.0000

x86-64-vnni256: 7.2% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1394096  +/- 1398
test (./stockfish       ) =    1495079  +/- 1684
diff                      =    +100984  +/- 1507

speedup        = +0.0724
P(speedup > 0) =  1.0000

x86-64-bmi2: 4.1% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1424134  +/- 1850
test (./stockfish       ) =    1482731  +/- 2060
diff                      =     +58596  +/- 2264

speedup        = +0.0411
P(speedup > 0) =  1.0000

x86-64-sse41-popcnt: 6.4% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1215782  +/- 1163
test (./stockfish       ) =    1293076  +/- 1204
diff                      =     +77294  +/- 1340

speedup        = +0.0636
P(speedup > 0) =  1.0000

x86-64-ssse3: 6% speedup

Result of 100 runs
==================
base (./stockfish-master) =    1190862  +/- 1218
test (./stockfish       ) =    1262138  +/- 1190
diff                      =     +71276  +/- 1263

speedup        = +0.0599
P(speedup > 0) =  1.0000

Results of "bench 4096 1 30 default depth nnue":

Architecture	Signature
x86-64-vnni512	442205461
x86-64-avx512	442205461
x86-64-vnni256	442205461
x86-64-bmi2	442205461
x86-64-sse41-popcnt	442205461
x86-64-ssse3	442205461

src/nnue/layers/affine_transform_sparse_input.h

vondele · 2023-06-12T10:22:53Z

Here some numbers comparing

master
patch (this PR, including the reordered net)
SameNet (this PR but keeping the original net)

TLDR: patch 13% speedup, SameNet 11% speedup, Net reorder benefit 2% (consistent with reports of this in Discord)

on AMD Ryzen 9 3950X

Result of  10 runs
==================
base (./stockfish.master       ) =    1394168  +/- 12402
test (./stockfish.patch        ) =    1576029  +/- 10992
diff                             =    +181860  +/- 4862

speedup        = +0.1304
P(speedup > 0) =  1.0000

Result of  10 runs
==================
base (./stockfish.master       ) =    1381918  +/- 4253
test (./stockfish.SameNet      ) =    1529336  +/- 5972
diff                             =    +147418  +/- 4775

speedup        = +0.1067
P(speedup > 0) =  1.0000

Result of  10 runs
==================
base (./stockfish.patch        ) =    1563712  +/- 10769
test (./stockfish.SameNet      ) =    1536426  +/- 10142
diff                             =     -27286  +/- 3684

speedup        = -0.0174
P(speedup > 0) =  0.0000

Technologov · 2023-06-12T18:19:53Z

I have made an extensive 64-bit CPU architecture tests, and the results are correct. Here:
https://pastebin.com/kmPaw8mZ

All tests performed : (on Linux Debian 12 + GCC; + a Mac)
$ stockfish bench 16 1

Sopel97 · 2023-06-12T18:24:45Z

looks good to me now

vondele · 2023-06-12T18:31:25Z

For reference, the reordering of the net was described here https://discord.com/channels/435943710472011776/813919248455827515/1117464914341134428 :

Used https://github.com/AndrovT/Stockfish/tree/log-activations to find out the likelihood of activation of the neurons and this python script

import torch

def permute_l1(nnue, permutation):
    l1_size = nnue.layer_stacks.l1.in_features
    assert l1_size == len(permutation)*2

    permutation.extend([x + l1_size // 2 for x in permutation])
    ft_permutation = permutation + list(range(l1_size, nnue.input.num_outputs))

    nnue.input.weight.data = nnue.input.weight.data[:, ft_permutation]
    nnue.input.bias.data = nnue.input.bias.data[ft_permutation]
    nnue.layer_stacks.l1.weight.data = nnue.layer_stacks.l1.weight.data[:, permutation]

Ideally this can be added in some way to the trainer. It might also be possible to improve the scheme with which the outputs are permuted (e.g. exploiting correlation between the outputs, not just their probability of being zero).

vondele · 2023-06-12T18:47:20Z

Thanks, nice first contribution!

mstembera · 2023-06-13T06:07:32Z

I wonder if this swings the balance toward even larger nets?

Technologov · 2023-06-13T08:38:14Z

I wonder if this swings the balance toward even larger nets?

VizVez thinks the same. Great minds think alike !

Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes official-stockfish#4612 No functional change

Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged. closes official-stockfish#4684 No functional change

Use block sparse input for the first fully connected layer on architectures with at least SSSE3. Depending on the CPU architecture, this yields a speedup of up to 10%, e.g. ``` Result of 100 runs of 'bench 16 1 13 default depth NNUE' base (...ockfish-base) = 959345 +/- 7477 test (...ckfish-patch) = 1054340 +/- 9640 diff = +94995 +/- 3999 speedup = +0.0990 P(speedup > 0) = 1.0000 CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics Hyperthreading: on ``` Passed STC: https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 8864 W: 2479 L: 2223 D: 4162 Ptnml(0-2): 13, 829, 2504, 1061, 25 This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs, but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue). Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3. closes official-stockfish#4612 No functional change

Removes unused large input specialization for dense affine transform. It has been obsolete since official-stockfish#4612 was merged. closes official-stockfish#4684 No functional change

AndrovT added 3 commits June 11, 2023 03:24

Use block sparse input for the first layer.

773ac96

Merge branch 'master' of https://github.com/official-stockfish/Stockfish

6c753c0

into sparse-input

remove dbg_mean_of calll

d8e0174

AndrovT added 2 commits June 12, 2023 04:55

make find_nnz clearer

d53eee8

make find_nnz clearer 2

9ebb1ea

add MSVC support

be52d5f

Sopel97 suggested changes Jun 12, 2023

View reviewed changes

Matthies reviewed Jun 12, 2023

View reviewed changes

src/nnue/layers/affine_transform_sparse_input.h Outdated Show resolved Hide resolved

Sopel97 mentioned this pull request Jun 12, 2023

Add blocked sparse input affine transform approach to docs. official-stockfish/nnue-pytorch#249

Merged

AndrovT requested a review from Sopel97 June 12, 2023 17:57

vondele closed this in 38e6166 Jun 12, 2023

AndrovT mentioned this pull request Jul 15, 2023

Remove large input specialization for dense affine transform #4684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use block sparse input for the first layer. #4612

Use block sparse input for the first layer. #4612

AndrovT commented Jun 11, 2023 •

edited

Loading

cj5716 commented Jun 11, 2023 •

edited

Loading

vondele commented Jun 12, 2023

Sopel97 commented Jun 12, 2023

Sopel97 left a comment •

edited

Loading

MinetaS commented Jun 12, 2023

vondele commented Jun 12, 2023

Technologov commented Jun 12, 2023 •

edited

Loading

Sopel97 commented Jun 12, 2023

vondele commented Jun 12, 2023

vondele commented Jun 12, 2023

mstembera commented Jun 13, 2023

Technologov commented Jun 13, 2023

Use block sparse input for the first layer. #4612

Use block sparse input for the first layer. #4612

Conversation

AndrovT commented Jun 11, 2023 • edited Loading

cj5716 commented Jun 11, 2023 • edited Loading

vondele commented Jun 12, 2023

Sopel97 commented Jun 12, 2023

Sopel97 left a comment • edited Loading

Choose a reason for hiding this comment

MinetaS commented Jun 12, 2023

vondele commented Jun 12, 2023

Technologov commented Jun 12, 2023 • edited Loading

Sopel97 commented Jun 12, 2023

vondele commented Jun 12, 2023

vondele commented Jun 12, 2023

mstembera commented Jun 13, 2023

Technologov commented Jun 13, 2023

AndrovT commented Jun 11, 2023 •

edited

Loading

cj5716 commented Jun 11, 2023 •

edited

Loading

Sopel97 left a comment •

edited

Loading

Technologov commented Jun 12, 2023 •

edited

Loading