Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented LEB128 in feature transformer #4617

Closed

Conversation

MaximMolchanov
Copy link
Contributor

Implemented LEB128 compression. Current net size changed from 70 MiB to 39 Mib.

inline IntType read_leb_128(std::istream& stream) {
static_assert(std::is_signed_v<IntType>, "Not implemented for unsigned types");
IntType result = 0, shift = 0;
while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an infinite loop here is not acceptable, asking for exploits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten, no infinite loop here.

IntType result = 0, shift = 0;
while (true) {
std::uint8_t byte;
stream.read(reinterpret_cast<char*>(&byte), sizeof(std::uint8_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read error should be handled somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also be done in read_little_endian and write_little_endian?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary there. Here it's needed to prevent an infinite loop. Other mitigations could also suffice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, note that shifting by more than the type width is technically bad, so we should avoid that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoided shifting by more than the type now.

@Sopel97
Copy link
Member

Sopel97 commented Jun 13, 2023

Additionally I'd like to see some startup benchmarks. I don't expect much impact, but I've had larger nets time out in cutechess cli due to taking to long to load a net.

@cngstefan
Copy link

Also i would do a non-regression test at STC only to be sure that really nothing breaks.

@locutus2
Copy link
Member

Above post is me! Sorry wrong user.

@vondele
Copy link
Member

vondele commented Jun 13, 2023

Will this still do the right thing, for all byte ordering ? Since it removed write/read_little_endian ...

@Sopel97
Copy link
Member

Sopel97 commented Jun 13, 2023

It reads byte by byte so should be endianness-agnostic

@MaximMolchanov
Copy link
Contributor Author

It looks like it should do right things for big endian but I don't know how to test it properly 🙃

@MaximMolchanov MaximMolchanov marked this pull request as draft June 13, 2023 22:45
@Sopel97
Copy link
Member

Sopel97 commented Jun 14, 2023

I'm concerned about the startup time

$ time -- ./stockfish-master.exe eval > /dev/null
real 0.15
user 0.00
sys 0.01

$ time -- ./stockfish-patch.exe eval > /dev/null
real 0.98
user 0.01
sys 0.00

@Sopel97
Copy link
Member

Sopel97 commented Jun 14, 2023

Serialization on python side takes around 30 seconds when implemented naively in python. I wonder if we perhaps should support both uncompressed and compressed networks, both in stockfish and the trainer. Would need to somehow squeeze a flag in the nnue format - might or might not be possible to keep backwards compatibility.

@MaximMolchanov
Copy link
Contributor Author

@Sopel97 What about https://pypi.org/project/leb128/ ? Is it faster?

@MaximMolchanov
Copy link
Contributor Author

I have same startup on my laptop:
master:

$ time stockfish eval > /dev/null

real    0m0,075s
user    0m0,037s
sys     0m0,038s

patch:

$ time stockfish eval > /dev/null

real    0m0,074s
user    0m0,039s
sys     0m0,036s

@Sopel97
Copy link
Member

Sopel97 commented Jun 14, 2023

@Sopel97 What about https://pypi.org/project/leb128/ ? Is it faster?

terrible, I gave up waiting after 2 minutes.

the two approaches, for the record



  def write_leb_128(self, value):
    value = int(value)
    while True:
      byte = value & 0x7f
      value = value >> 7
      if ((value == 0 and (byte & 0x40) == 0) or (value == -1 and (byte & 0x40) != 0)):
        self.buf.extend(byte.to_bytes(1, 'little'))
        return
      byte = byte | 0x80;
      self.buf.extend(byte.to_bytes(1, 'little'))

  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.write_leb_128(v)
  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.buf.extend(leb128.i.encode(v))

edit.

  def write_leb_128_array(self, arr):
    for v in arr.numpy():
      self.buf.extend(leb128.i.encode(int(v)))

finished in around 30s too. It lacks array API, so is useless, and is implemented in python, which is just bad

@Sopel97
Copy link
Member

Sopel97 commented Jun 14, 2023

I have same startup on my laptop: master:

$ time stockfish eval > /dev/null

real    0m0,075s
user    0m0,037s
sys     0m0,038s

patch:

$ time stockfish eval > /dev/null

real    0m0,074s
user    0m0,039s
sys     0m0,036s

are you sure you tested this properly? stockfish there would not find the local stockfish executable but a global one

@MaximMolchanov
Copy link
Contributor Author

Oh, nice catch. Yes, the difference is too big:

$ time ./stockfish eval > /dev/null

real    0m0,064s
user    0m0,028s
sys     0m0,036s

vs

$ time ./stockfish eval > /dev/null

real    0m0,495s
user    0m0,469s
sys     0m0,025s

@Torom
Copy link
Contributor

Torom commented Jun 14, 2023

Startup time on a Raspberry Pi 4:

master:

$ time ./stockfish eval > /dev/null

real    0m0.339s
user    0m0.241s
sys     0m0.097s

patch:

$ time ./stockfish.nn-leb-128 eval > /dev/null

real    0m1.977s
user    0m1.884s
sys     0m0.090s

@XInTheDark
Copy link
Contributor

Similar difference observed on MacBook M1. Arguably it will not make a difference on actual gameplay, but still considerable enough?

master:

% time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.07s user 0.02s system 93% cpu 0.088 total

patch:

% time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.55s user 0.02s system 99% cpu 0.576 total

@MaximMolchanov
Copy link
Contributor Author

Thanks everyone who posted benchmarks, I appreciate if you will do it one more time 😄

Here are my new results:

 time ./stockfish eval > /dev/null

real    0m0,121s
user    0m0,105s
sys     0m0,016s

@XInTheDark
Copy link
Contributor

My new results:

 % time ./stockfish eval > /dev/null
./stockfish eval > /dev/null  0.13s user 0.02s system 95% cpu 0.155 total

Huge improvement over original patch!

@Sopel97
Copy link
Member

Sopel97 commented Jun 16, 2023

$ time -- ./stockfish-patch.exe eval > /dev/null
real 0.25
user 0.00
sys 0.01

Performance looks ok now. I'm still torn on whether we should support both compressed and uncompressed networks, at least for the time being, as there's no tooling for compression yet. We have a bit of a leeway on the new format so should probably add a discernable flag.

@vondele
Copy link
Member

vondele commented Jun 16, 2023

IMO performance is good now.

I would prefer at this phase not to have the choice between compressed and not.

If this is OK, we should go for it. However, I fully agree we need the tooling in the trainer to read/write this format. For the tooling we might need the capability to convert non-compressed to compressed, just so that we can do training from older nets, and convert e.g. nets that are running now.

So, from the SF point of view, this looks OK, but can't be merged before we have the tooling part done, IMO.

@Sopel97
Copy link
Member

Sopel97 commented Jun 16, 2023

if we can agree on a discriminator in the file format I can work on the trainer side. I think we should be fine if we just add some long-ish magic string before each compressed layer. "COMPRESSED_LEB128" ?

@vondele
Copy link
Member

vondele commented Jun 16, 2023

I think the magic string is fine, if we need it. On the other hand we don't have magic strings for the architecture version, we expect users to pick the right arch on the command line. Up to you.

@Sopel97
Copy link
Member

Sopel97 commented Jun 17, 2023

Preliminary unoptimized serializer/deserializer. https://github.com/Sopel97/nnue-pytorch/tree/leb. Right now requires the leb128 package. It is slow, but I'll implement the encoding/decoding natively later.

example for compressed/uncompressed round-trips

python serialize.py --features=HalfKAv2_hm nn-fdc1d0fe6455.nnue a.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue b.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue c.nnue

I made some alterations to the format. Each saved compressed tensor has a small header, that consists of the magic string "COMPRESSED_LEB" encoded in utf-8, followed by a 4-byte little-endian integer telling the number of bytes the compressed data of this tensor occupies. This is the same as in this PR, with the addition of the magic string. The rationale for the magic string is that we're able to detect the encoding during network conversion, and could earlier assert when loading the network in the engine.

@MaximMolchanov
Copy link
Contributor Author

MaximMolchanov commented Jun 18, 2023

Preliminary unoptimized serializer/deserializer. https://github.com/Sopel97/nnue-pytorch/tree/leb. Right now requires the leb128 package. It is slow, but I'll implement the encoding/decoding natively later.

example for compressed/uncompressed round-trips

python serialize.py --features=HalfKAv2_hm nn-fdc1d0fe6455.nnue a.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue b.nnue --ft_compression=leb128
python serialize.py --features=HalfKAv2_hm a.nnue c.nnue

I made some alterations to the format. Each saved compressed tensor has a small header, that consists of the magic string "COMPRESSED_LEB" encoded in utf-8, followed by a 4-byte little-endian integer telling the number of bytes the compressed data of this tensor occupies. This is the same as in this PR, with the addition of the magic string. The rationale for the magic string is that we're able to detect the encoding during network conversion, and could earlier assert when loading the network in the engine.

Here is my implementation of read_leb_128, should be about 2-3 times faster:

  def read_leb_128_array(self, dtype, shape):
    l = self.read_int32()
    d = self.f.read(l)
    if len(d) != l:
      raise Exception('Unexpected end of file when reading compressed data.')

    n = reduce(operator.mul, shape, 1)
    ints = np.zeros(n, dtype=dtype)
    k = 0
    for i in range(n):
      r = 0
      shift = 0
      while True:
        byte = d[k]
        k = k + 1
        r |= (byte & 0x7f) << shift
        shift += 7
        if (byte & 0x80) == 0:
          ints[i] = r if (byte & 0x40) == 0 else r | ~((1 << shift) - 1)
          break
    res = torch.FloatTensor(ints)
    res = res.reshape(shape)
    return res

EDIT:

I've succed to make it much faster (almost same time as none-compression) using numba, here is a patch (can be copied and applied using git apply command:

diff --git a/serialize.py b/serialize.py
index 8ce9ebe..3ed1162 100644
--- a/serialize.py
+++ b/serialize.py
@@ -11,7 +11,8 @@ import pytorch_lightning as pl
 from torch.utils.data import DataLoader
 from functools import reduce
 import operator
-import leb128
+import numpy as np
+from numba import njit
 
 def ascii_hist(name, x, bins=6):
   N,X = numpy.histogram(x, bins=bins)
@@ -25,6 +26,36 @@ def ascii_hist(name, x, bins=6):
     xi = '{0: <8.4g}'.format(xi).ljust(10)
     print('{0}| {1}'.format(xi,bar))
 
+@njit
+def encode_leb_128_array(arr):
+  res = []
+  for v in arr:
+    while True:
+      byte = v & 0x7f
+      v = v >> 7
+      if (v == 0 and byte & 0x40 == 0) or (v == -1 and byte & 0x40 != 0):
+        res.append(byte)
+        break
+      res.append(byte | 0x80)
+  return res
+
+@njit
+def decode_leb_128_array(d, n):
+  ints = np.zeros(n)
+  k = 0
+  for i in range(n):
+    r = 0
+    shift = 0
+    while True:
+      byte = d[k]
+      k = k + 1
+      r |= (byte & 0x7f) << shift
+      shift += 7
+      if (byte & 0x80) == 0:
+        ints[i] = r if (byte & 0x40) == 0 else r | ~((1 << shift) - 1)
+        break
+  return ints
+
 # hardcoded for now
 VERSION = 0x7AF32F20
 DEFAULT_DESCRIPTION = "Network trained with the https://github.com/glinscott/nnue-pytorch trainer."
@@ -79,9 +110,7 @@ class NNUEWriter():
     self.buf.extend(encoded_description)
 
   def write_leb_128_array(self, arr):
-    buf = bytearray()
-    for v in arr:
-      buf.extend(leb128.i.encode(int(v)))
+    buf = encode_leb_128_array(arr)
     self.int32(len(buf))
     self.buf.extend(buf)
 
@@ -195,9 +224,7 @@ class NNUEReader():
     if len(d) != l:
       raise Exception('Unexpected end of file when reading compressed data.')
 
-    inp = io.BytesIO(d)
-    ints = [leb128.i.decode_reader(inp)[0] for i in range(reduce(operator.mul, shape, 1))]
-    res = torch.FloatTensor(ints)
+    res = torch.FloatTensor(decode_leb_128_array(d, reduce(operator.mul, shape, 1)))
     res = res.reshape(shape)
     return res
 

@Sopel97
Copy link
Member

Sopel97 commented Jun 19, 2023

Using NUMBA indeed provides huge benefits, and I think it's reasonable to add it as a dependency, thanks. I made a finalized PR on the trainer side. official-stockfish/nnue-pytorch#251. Could you update this PR to take into account the format changes?

@MaximMolchanov
Copy link
Contributor Author

Using NUMBA indeed provides huge benefits, and I think it's reasonable to add it as a dependency, thanks. I made a finalized PR on the trainer side. glinscott/nnue-pytorch#251. Could you update this PR to take into account the format changes?

Sure, added string writing and 'eating' while reading.

Also attaching previously success non-regression: https://tests.stockfishchess.org/tests/view/6488f524f42a44347ed7b763

@MaximMolchanov MaximMolchanov marked this pull request as ready for review June 19, 2023 17:22
@vondele vondele added the to be merged Will be merged shortly label Jun 19, 2023
@vondele vondele closed this in a46087e Jun 19, 2023
@mstembera
Copy link
Contributor

Maybe now the extra step of compressing the binaries on abrok from 43MB to 33MB is no longer worth it?

@vondele
Copy link
Member

vondele commented Jun 20, 2023

Note that since today, we have the latest dev builds available as pre-releases on github
https://github.com/official-stockfish/Stockfish/releases

As the binaries need to be distributed with license etc. having them in a zip or tar file is quite a good solution, IMO.

@mstembera
Copy link
Contributor

Nice. Are these gcc or the faster clang?

@vondele
Copy link
Member

vondele commented Jun 20, 2023

right now gcc, but we should pin to a compiler version and can improve as needed.
Import is that testing and deployment use the same compiler, which right now seems easier with gcc.

rn5f107s2 pushed a commit to rn5f107s2/Stockfish that referenced this pull request Jun 22, 2023
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.

The new nn-78bacfcee510.nnue corresponds to the master net compressed.

closes official-stockfish#4617

No functional change
@mstembera
Copy link
Contributor

@MaximMolchanov Sorry for the late question but I'm curious... I read the link https://en.wikipedia.org/wiki/LEB128 at the top of this PR but it doesn't really explain how LEB128 compares to other compression algos. That is why is LEB128 a particularly good choice for us? Also why do we use the signed version even though the unsigned pseudo code looks a bit simpler? Thanks!

@MaximMolchanov
Copy link
Contributor Author

@mstembera

why is LEB128 a particularly good choice for us?

In short: LEB128 is useful when there are lots of small values and the maximum value is big enough. So, technically, if we have int16 and almost all values are in a range [-64...+63] - then LEB128 is fine.
For us the major point is in feature_transformer's weights. In the current net (nn-c38c3d8d3920) we have 46137344 weights, and there are 42002644 weights are in the range [-64..+63]. It means that for 42002644 values we can use 1 byte of memory instead of two and we reduce 42 MB (same idea stands for biases and psqtWeights, but the benefit of it is too small in comparison with weights)

Also why do we use the signed version even though the unsigned pseudo code looks a bit simpler?

We have int16 type for feature_transformer weights (also signed types for biases and psqtWeights), we have to change its type if we want to use compression for unsinged types. The first idea that coming to mind is to add +32768 to each value and use unsigned compression, but it will not reduce much memory because then those values form [-64...+63] will take two bytes.

how LEB128 compares to other compression algos

I didn't make 'too deep' research, I just tried to use an algorithm that I've known before, checked that it reduces net size for about 45% and made a PR (by the way, I suggested somewhere in comments or discord this idea a couple of years ago when the net size was 20MB and reduced size was about 11-12 MB, but I didn't implement it then). Also the algorithm is simple enough - we simply add an implementation of reading ints without architectural changes (and even there we faced some problems - it was slow without buffers and python implementation was not absolutely straightforward). Probably some better algos exists but remember that we have to support it also for python trainers. So as for me the relation between complication of code and MBs reduced is good enough.

@mstembera
Copy link
Contributor

@MaximMolchanov Thank you for the nice explanation.

linrock pushed a commit to linrock/Stockfish that referenced this pull request Aug 26, 2023
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.

The new nn-78bacfcee510.nnue corresponds to the master net compressed.

closes official-stockfish#4617

No functional change
linrock pushed a commit to linrock/Stockfish that referenced this pull request Aug 26, 2023
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.

The new nn-78bacfcee510.nnue corresponds to the master net compressed.

closes official-stockfish#4617

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants