Implemented LEB128 in feature transformer #4617

MaximMolchanov · 2023-06-13T20:11:39Z

Implemented LEB128 compression. Current net size changed from 70 MiB to 39 Mib.

Sopel97 · 2023-06-13T20:14:39Z

src/nnue/nnue_common.h

+  inline IntType read_leb_128(std::istream& stream) {
+      static_assert(std::is_signed_v<IntType>, "Not implemented for unsigned types");
+      IntType result = 0, shift = 0;
+      while (true) {


an infinite loop here is not acceptable, asking for exploits

Rewritten, no infinite loop here.

Sopel97 · 2023-06-13T20:15:15Z

src/nnue/nnue_common.h

+      IntType result = 0, shift = 0;
+      while (true) {
+          std::uint8_t byte;
+          stream.read(reinterpret_cast<char*>(&byte), sizeof(std::uint8_t));


read error should be handled somehow

Should it also be done in read_little_endian and write_little_endian?

It's not necessary there. Here it's needed to prevent an infinite loop. Other mitigations could also suffice.

also, note that shifting by more than the type width is technically bad, so we should avoid that

Avoided shifting by more than the type now.

Sopel97 · 2023-06-13T20:16:22Z

Additionally I'd like to see some startup benchmarks. I don't expect much impact, but I've had larger nets time out in cutechess cli due to taking to long to load a net.

cngstefan · 2023-06-13T20:17:52Z

Also i would do a non-regression test at STC only to be sure that really nothing breaks.

locutus2 · 2023-06-13T20:18:51Z

Above post is me! Sorry wrong user.

vondele · 2023-06-13T20:24:50Z

Will this still do the right thing, for all byte ordering ? Since it removed write/read_little_endian ...

Sopel97 · 2023-06-13T20:43:13Z

It reads byte by byte so should be endianness-agnostic

MaximMolchanov · 2023-06-13T21:32:32Z

It looks like it should do right things for big endian but I don't know how to test it properly 🙃

Sopel97 · 2023-06-14T09:33:30Z

I'm concerned about the startup time

$ time -- ./stockfish-master.exe eval > /dev/null
real 0.15
user 0.00
sys 0.01

$ time -- ./stockfish-patch.exe eval > /dev/null
real 0.98
user 0.01
sys 0.00

Sopel97 · 2023-06-14T09:46:42Z

Serialization on python side takes around 30 seconds when implemented naively in python. I wonder if we perhaps should support both uncompressed and compressed networks, both in stockfish and the trainer. Would need to somehow squeeze a flag in the nnue format - might or might not be possible to keep backwards compatibility.

MaximMolchanov · 2023-06-14T11:01:17Z

@Sopel97 What about https://pypi.org/project/leb128/ ? Is it faster?

MaximMolchanov · 2023-06-14T11:03:30Z

I have same startup on my laptop:
master:

$ time stockfish eval > /dev/null

real    0m0,075s
user    0m0,037s
sys     0m0,038s

patch:

$ time stockfish eval > /dev/null

real    0m0,074s
user    0m0,039s
sys     0m0,036s

Sopel97 · 2023-06-14T11:09:25Z

@Sopel97 What about https://pypi.org/project/leb128/ ? Is it faster?

terrible, I gave up waiting after 2 minutes.

the two approaches, for the record



  def write_leb_128(self, value):
    value = int(value)
    while True:
      byte = value & 0x7f
      value = value >> 7
      if ((value == 0 and (byte & 0x40) == 0) or (value == -1 and (byte & 0x40) != 0)):
        self.buf.extend(byte.to_bytes(1, 'little'))
        return
      byte = byte | 0x80;
      self.buf.extend(byte.to_bytes(1, 'little'))

  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.write_leb_128(v)

  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.buf.extend(leb128.i.encode(v))

edit.

  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.buf.extend(leb128.i.encode(int(v)))

finished in around 30s too. It lacks array API, so is useless, and is implemented in python, which is just bad

Sopel97 · 2023-06-14T11:13:29Z

I have same startup on my laptop: master:

$ time stockfish eval > /dev/null

real    0m0,075s
user    0m0,037s
sys     0m0,038s

patch:

$ time stockfish eval > /dev/null

real    0m0,074s
user    0m0,039s
sys     0m0,036s

are you sure you tested this properly? stockfish there would not find the local stockfish executable but a global one

MaximMolchanov · 2023-06-14T11:15:26Z

Oh, nice catch. Yes, the difference is too big:

$ time ./stockfish eval > /dev/null

real    0m0,064s
user    0m0,028s
sys     0m0,036s

vs

$ time ./stockfish eval > /dev/null

real    0m0,495s
user    0m0,469s
sys     0m0,025s

Torom · 2023-06-14T11:16:20Z

Startup time on a Raspberry Pi 4:

master:

$ time ./stockfish eval > /dev/null

real    0m0.339s
user    0m0.241s
sys     0m0.097s

patch:

$ time ./stockfish.nn-leb-128 eval > /dev/null

real    0m1.977s
user    0m1.884s
sys     0m0.090s

XInTheDark · 2023-06-15T10:39:17Z

Similar difference observed on MacBook M1. Arguably it will not make a difference on actual gameplay, but still considerable enough?

master:

% time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.07s user 0.02s system 93% cpu 0.088 total

patch:

% time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.55s user 0.02s system 99% cpu 0.576 total

MaximMolchanov · 2023-06-15T17:42:43Z

Thanks everyone who posted benchmarks, I appreciate if you will do it one more time 😄

Here are my new results:

 time ./stockfish eval > /dev/null

real    0m0,121s
user    0m0,105s
sys     0m0,016s

XInTheDark · 2023-06-16T03:11:05Z

My new results:

 % time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.13s user 0.02s system 95% cpu 0.155 total

Huge improvement over original patch!

Sopel97 · 2023-06-16T08:31:25Z

$ time -- ./stockfish-patch.exe eval > /dev/null
real 0.25
user 0.00
sys 0.01

Performance looks ok now. I'm still torn on whether we should support both compressed and uncompressed networks, at least for the time being, as there's no tooling for compression yet. We have a bit of a leeway on the new format so should probably add a discernable flag.

vondele · 2023-06-16T08:56:36Z

IMO performance is good now.

I would prefer at this phase not to have the choice between compressed and not.

If this is OK, we should go for it. However, I fully agree we need the tooling in the trainer to read/write this format. For the tooling we might need the capability to convert non-compressed to compressed, just so that we can do training from older nets, and convert e.g. nets that are running now.

So, from the SF point of view, this looks OK, but can't be merged before we have the tooling part done, IMO.

Sopel97 · 2023-06-16T09:05:24Z

if we can agree on a discriminator in the file format I can work on the trainer side. I think we should be fine if we just add some long-ish magic string before each compressed layer. "COMPRESSED_LEB128" ?

vondele · 2023-06-16T17:20:26Z

I think the magic string is fine, if we need it. On the other hand we don't have magic strings for the architecture version, we expect users to pick the right arch on the command line. Up to you.

Sopel97 · 2023-06-17T10:26:23Z

Preliminary unoptimized serializer/deserializer. https://github.com/Sopel97/nnue-pytorch/tree/leb. Right now requires the leb128 package. It is slow, but I'll implement the encoding/decoding natively later.

example for compressed/uncompressed round-trips

python serialize.py --features=HalfKAv2_hm nn-fdc1d0fe6455.nnue a.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue b.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue c.nnue

I made some alterations to the format. Each saved compressed tensor has a small header, that consists of the magic string "COMPRESSED_LEB" encoded in utf-8, followed by a 4-byte little-endian integer telling the number of bytes the compressed data of this tensor occupies. This is the same as in this PR, with the addition of the magic string. The rationale for the magic string is that we're able to detect the encoding during network conversion, and could earlier assert when loading the network in the engine.

MaximMolchanov · 2023-06-18T06:31:47Z

Preliminary unoptimized serializer/deserializer. https://github.com/Sopel97/nnue-pytorch/tree/leb. Right now requires the leb128 package. It is slow, but I'll implement the encoding/decoding natively later.

example for compressed/uncompressed round-trips
python serialize.py --features=HalfKAv2_hm nn-fdc1d0fe6455.nnue a.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue b.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue c.nnue
I made some alterations to the format. Each saved compressed tensor has a small header, that consists of the magic string "COMPRESSED_LEB" encoded in utf-8, followed by a 4-byte little-endian integer telling the number of bytes the compressed data of this tensor occupies. This is the same as in this PR, with the addition of the magic string. The rationale for the magic string is that we're able to detect the encoding during network conversion, and could earlier assert when loading the network in the engine.

Here is my implementation of read_leb_128, should be about 2-3 times faster:

  def read_leb_128_array(self, dtype, shape):
    l = self.read_int32()
    d = self.f.read(l)
    if len(d) != l:
      raise Exception('Unexpected end of file when reading compressed data.')

    n = reduce(operator.mul, shape, 1)
    ints = np.zeros(n, dtype=dtype)
    k = 0
    for i in range(n):
      r = 0
      shift = 0
      while True:
        byte = d[k]
        k = k + 1
        r |= (byte & 0x7f) << shift
        shift += 7
        if (byte & 0x80) == 0:
          ints[i] = r if (byte & 0x40) == 0 else r | ~((1 << shift) - 1)
          break
    res = torch.FloatTensor(ints)
    res = res.reshape(shape)
    return res

EDIT:

I've succed to make it much faster (almost same time as none-compression) using numba, here is a patch (can be copied and applied using git apply command:

diff --git a/serialize.py b/serialize.py
index 8ce9ebe..3ed1162 100644
--- a/serialize.py
+++ b/serialize.py
@@ -11,7 +11,8 @@ import pytorch_lightning as pl
 from torch.utils.data import DataLoader
 from functools import reduce
 import operator
-import leb128
+import numpy as np
+from numba import njit
 
 def ascii_hist(name, x, bins=6):
   N,X = numpy.histogram(x, bins=bins)
@@ -25,6 +26,36 @@ def ascii_hist(name, x, bins=6):
     xi = '{0: <8.4g}'.format(xi).ljust(10)
     print('{0}| {1}'.format(xi,bar))
 
+@njit
+def encode_leb_128_array(arr):
+  res = []
+  for v in arr:
+    while True:
+      byte = v & 0x7f
+      v = v >> 7
+      if (v == 0 and byte & 0x40 == 0) or (v == -1 and byte & 0x40 != 0):
+        res.append(byte)
+        break
+      res.append(byte | 0x80)
+  return res
+
+@njit
+def decode_leb_128_array(d, n):
+  ints = np.zeros(n)
+  k = 0
+  for i in range(n):
+    r = 0
+    shift = 0
+    while True:
+      byte = d[k]
+      k = k + 1
+      r |= (byte & 0x7f) << shift
+      shift += 7
+      if (byte & 0x80) == 0:
+        ints[i] = r if (byte & 0x40) == 0 else r | ~((1 << shift) - 1)
+        break
+  return ints
+
 # hardcoded for now
 VERSION = 0x7AF32F20
 DEFAULT_DESCRIPTION = "Network trained with the https://github.com/glinscott/nnue-pytorch trainer."
@@ -79,9 +110,7 @@ class NNUEWriter():
     self.buf.extend(encoded_description)
 
   def write_leb_128_array(self, arr):
-    buf = bytearray()
-    for v in arr:
-      buf.extend(leb128.i.encode(int(v)))
+    buf = encode_leb_128_array(arr)
     self.int32(len(buf))
     self.buf.extend(buf)
 
@@ -195,9 +224,7 @@ class NNUEReader():
     if len(d) != l:
       raise Exception('Unexpected end of file when reading compressed data.')
 
-    inp = io.BytesIO(d)
-    ints = [leb128.i.decode_reader(inp)[0] for i in range(reduce(operator.mul, shape, 1))]
-    res = torch.FloatTensor(ints)
+    res = torch.FloatTensor(decode_leb_128_array(d, reduce(operator.mul, shape, 1)))
     res = res.reshape(shape)
     return res

Sopel97 · 2023-06-19T09:59:43Z

Using NUMBA indeed provides huge benefits, and I think it's reasonable to add it as a dependency, thanks. I made a finalized PR on the trainer side. official-stockfish/nnue-pytorch#251. Could you update this PR to take into account the format changes?

MaximMolchanov · 2023-06-19T17:21:32Z

Using NUMBA indeed provides huge benefits, and I think it's reasonable to add it as a dependency, thanks. I made a finalized PR on the trainer side. glinscott/nnue-pytorch#251. Could you update this PR to take into account the format changes?

Sure, added string writing and 'eating' while reading.

Also attaching previously success non-regression: https://tests.stockfishchess.org/tests/view/6488f524f42a44347ed7b763

mstembera · 2023-06-20T19:25:53Z

Maybe now the extra step of compressing the binaries on abrok from 43MB to 33MB is no longer worth it?

vondele · 2023-06-20T19:39:56Z

Note that since today, we have the latest dev builds available as pre-releases on github
https://github.com/official-stockfish/Stockfish/releases

As the binaries need to be distributed with license etc. having them in a zip or tar file is quite a good solution, IMO.

mstembera · 2023-06-20T19:43:21Z

Nice. Are these gcc or the faster clang?

vondele · 2023-06-20T19:44:52Z

right now gcc, but we should pin to a compiler version and can improve as needed.
Import is that testing and deployment use the same compiler, which right now seems easier with gcc.

Implemented LEB128 (de)compression for the feature transformer. Reduces embedded network size from 70 MiB to 39 Mib. The new nn-78bacfcee510.nnue corresponds to the master net compressed. closes official-stockfish#4617 No functional change

mstembera · 2023-07-09T22:02:33Z

@MaximMolchanov Sorry for the late question but I'm curious... I read the link https://en.wikipedia.org/wiki/LEB128 at the top of this PR but it doesn't really explain how LEB128 compares to other compression algos. That is why is LEB128 a particularly good choice for us? Also why do we use the signed version even though the unsigned pseudo code looks a bit simpler? Thanks!

MaximMolchanov · 2023-07-10T10:48:03Z

@mstembera

why is LEB128 a particularly good choice for us?

In short: LEB128 is useful when there are lots of small values and the maximum value is big enough. So, technically, if we have int16 and almost all values are in a range [-64...+63] - then LEB128 is fine.
For us the major point is in feature_transformer's weights. In the current net (nn-c38c3d8d3920) we have 46137344 weights, and there are 42002644 weights are in the range [-64..+63]. It means that for 42002644 values we can use 1 byte of memory instead of two and we reduce 42 MB (same idea stands for biases and psqtWeights, but the benefit of it is too small in comparison with weights)

Also why do we use the signed version even though the unsigned pseudo code looks a bit simpler?

We have int16 type for feature_transformer weights (also signed types for biases and psqtWeights), we have to change its type if we want to use compression for unsinged types. The first idea that coming to mind is to add +32768 to each value and use unsigned compression, but it will not reduce much memory because then those values form [-64...+63] will take two bytes.

how LEB128 compares to other compression algos

I didn't make 'too deep' research, I just tried to use an algorithm that I've known before, checked that it reduces net size for about 45% and made a PR (by the way, I suggested somewhere in comments or discord this idea a couple of years ago when the net size was 20MB and reduced size was about 11-12 MB, but I didn't implement it then). Also the algorithm is simple enough - we simply add an implementation of reading ints without architectural changes (and even there we faced some problems - it was slow without buffers and python implementation was not absolutely straightforward). Probably some better algos exists but remember that we have to support it also for python trainers. So as for me the relation between complication of code and MBs reduced is good enough.

mstembera · 2023-07-11T02:06:40Z

@MaximMolchanov Thank you for the nice explanation.

Implemented LEB128 (de)compression for the feature transformer. Reduces embedded network size from 70 MiB to 39 Mib. The new nn-78bacfcee510.nnue corresponds to the master net compressed. closes official-stockfish#4617 No functional change

Implemented LEB128 in feature transformer.

1efb359

Sopel97 suggested changes Jun 13, 2023

View reviewed changes

Updates.

c616689

More robust implementation.

58a000a

MaximMolchanov marked this pull request as draft June 13, 2023 22:45

maxim added 3 commits June 15, 2023 20:30

Buffered IO.

1ac883e

Merge branch 'master' into nn-leb-128

f1ce415

Net update.

968c7fe

ginkgo20 mentioned this pull request Jun 17, 2023

Is it possible to convert a NNUE file (weights and biases) to emulate an older version? official-stockfish/nnue-pytorch#244

Closed

Sopel97 mentioned this pull request Jun 19, 2023

Support for LEB128 compression of feature transformer parameters. official-stockfish/nnue-pytorch#251

Merged

maxim added 2 commits June 19, 2023 19:57

Added magic string.

58b3119

Merge branch 'master' into nn-leb-128

ea41a32

MaximMolchanov marked this pull request as ready for review June 19, 2023 17:22

vondele added the to be merged Will be merged shortly label Jun 19, 2023

vondele closed this in a46087e Jun 19, 2023

Implemented LEB128 in feature transformer #4617

Implemented LEB128 in feature transformer #4617

Conversation

MaximMolchanov commented Jun 13, 2023

Sopel97 Jun 13, 2023

Choose a reason for hiding this comment

MaximMolchanov Jun 13, 2023

Choose a reason for hiding this comment

Sopel97 Jun 13, 2023

Choose a reason for hiding this comment

MaximMolchanov Jun 13, 2023

Choose a reason for hiding this comment

Sopel97 Jun 13, 2023

Choose a reason for hiding this comment

Sopel97 Jun 13, 2023

Choose a reason for hiding this comment

MaximMolchanov Jun 13, 2023

Choose a reason for hiding this comment

Sopel97 commented Jun 13, 2023

cngstefan commented Jun 13, 2023

locutus2 commented Jun 13, 2023

vondele commented Jun 13, 2023

Sopel97 commented Jun 13, 2023

MaximMolchanov commented Jun 13, 2023

Sopel97 commented Jun 14, 2023

Sopel97 commented Jun 14, 2023 • edited Loading

MaximMolchanov commented Jun 14, 2023

MaximMolchanov commented Jun 14, 2023

Sopel97 commented Jun 14, 2023 • edited Loading

Sopel97 commented Jun 14, 2023

MaximMolchanov commented Jun 14, 2023

Torom commented Jun 14, 2023

XInTheDark commented Jun 15, 2023

MaximMolchanov commented Jun 15, 2023

XInTheDark commented Jun 16, 2023

Sopel97 commented Jun 16, 2023 • edited Loading

vondele commented Jun 16, 2023

Sopel97 commented Jun 16, 2023 • edited Loading

vondele commented Jun 16, 2023

Sopel97 commented Jun 17, 2023

MaximMolchanov commented Jun 18, 2023 • edited Loading

Sopel97 commented Jun 19, 2023

MaximMolchanov commented Jun 19, 2023

mstembera commented Jun 20, 2023

vondele commented Jun 20, 2023

mstembera commented Jun 20, 2023

vondele commented Jun 20, 2023

mstembera commented Jul 9, 2023

MaximMolchanov commented Jul 10, 2023

mstembera commented Jul 11, 2023

Sopel97 commented Jun 14, 2023 •

edited

Loading

Sopel97 commented Jun 14, 2023 •

edited

Loading

Sopel97 commented Jun 16, 2023 •

edited

Loading

Sopel97 commented Jun 16, 2023 •

edited

Loading

MaximMolchanov commented Jun 18, 2023 •

edited

Loading