-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented LEB128 in feature transformer #4617
Implemented LEB128 in feature transformer #4617
Conversation
src/nnue/nnue_common.h
Outdated
inline IntType read_leb_128(std::istream& stream) { | ||
static_assert(std::is_signed_v<IntType>, "Not implemented for unsigned types"); | ||
IntType result = 0, shift = 0; | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an infinite loop here is not acceptable, asking for exploits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten, no infinite loop here.
src/nnue/nnue_common.h
Outdated
IntType result = 0, shift = 0; | ||
while (true) { | ||
std::uint8_t byte; | ||
stream.read(reinterpret_cast<char*>(&byte), sizeof(std::uint8_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read error should be handled somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it also be done in read_little_endian
and write_little_endian
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary there. Here it's needed to prevent an infinite loop. Other mitigations could also suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, note that shifting by more than the type width is technically bad, so we should avoid that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoided shifting by more than the type now.
Additionally I'd like to see some startup benchmarks. I don't expect much impact, but I've had larger nets time out in cutechess cli due to taking to long to load a net. |
Also i would do a non-regression test at STC only to be sure that really nothing breaks. |
Above post is me! Sorry wrong user. |
Will this still do the right thing, for all byte ordering ? Since it removed write/read_little_endian ... |
It reads byte by byte so should be endianness-agnostic |
It looks like it should do right things for big endian but I don't know how to test it properly 🙃 |
I'm concerned about the startup time
|
Serialization on python side takes around 30 seconds when implemented naively in python. I wonder if we perhaps should support both uncompressed and compressed networks, both in stockfish and the trainer. Would need to somehow squeeze a flag in the nnue format - might or might not be possible to keep backwards compatibility. |
@Sopel97 What about https://pypi.org/project/leb128/ ? Is it faster? |
I have same startup on my laptop:
patch:
|
terrible, I gave up waiting after 2 minutes. the two approaches, for the record
edit.
finished in around 30s too. It lacks array API, so is useless, and is implemented in python, which is just bad |
are you sure you tested this properly? |
Oh, nice catch. Yes, the difference is too big:
vs
|
Startup time on a Raspberry Pi 4: master:
patch:
|
Similar difference observed on MacBook M1. Arguably it will not make a difference on actual gameplay, but still considerable enough? master:
patch:
|
Thanks everyone who posted benchmarks, I appreciate if you will do it one more time 😄 Here are my new results:
|
My new results:
Huge improvement over original patch! |
Performance looks ok now. I'm still torn on whether we should support both compressed and uncompressed networks, at least for the time being, as there's no tooling for compression yet. We have a bit of a leeway on the new format so should probably add a discernable flag. |
IMO performance is good now. I would prefer at this phase not to have the choice between compressed and not. If this is OK, we should go for it. However, I fully agree we need the tooling in the trainer to read/write this format. For the tooling we might need the capability to convert non-compressed to compressed, just so that we can do training from older nets, and convert e.g. nets that are running now. So, from the SF point of view, this looks OK, but can't be merged before we have the tooling part done, IMO. |
if we can agree on a discriminator in the file format I can work on the trainer side. I think we should be fine if we just add some long-ish magic string before each compressed layer. "COMPRESSED_LEB128" ? |
I think the magic string is fine, if we need it. On the other hand we don't have magic strings for the architecture version, we expect users to pick the right arch on the command line. Up to you. |
Preliminary unoptimized serializer/deserializer. https://github.com/Sopel97/nnue-pytorch/tree/leb. Right now requires the leb128 package. It is slow, but I'll implement the encoding/decoding natively later. example for compressed/uncompressed round-trips
I made some alterations to the format. Each saved compressed tensor has a small header, that consists of the magic string "COMPRESSED_LEB" encoded in utf-8, followed by a 4-byte little-endian integer telling the number of bytes the compressed data of this tensor occupies. This is the same as in this PR, with the addition of the magic string. The rationale for the magic string is that we're able to detect the encoding during network conversion, and could earlier assert when loading the network in the engine. |
Here is my implementation of
EDIT: I've succed to make it much faster (almost same time as none-compression) using numba, here is a patch (can be copied and applied using
|
Using NUMBA indeed provides huge benefits, and I think it's reasonable to add it as a dependency, thanks. I made a finalized PR on the trainer side. official-stockfish/nnue-pytorch#251. Could you update this PR to take into account the format changes? |
Sure, added string writing and 'eating' while reading. Also attaching previously success non-regression: https://tests.stockfishchess.org/tests/view/6488f524f42a44347ed7b763 |
Maybe now the extra step of compressing the binaries on abrok from 43MB to 33MB is no longer worth it? |
Note that since today, we have the latest dev builds available as pre-releases on github As the binaries need to be distributed with license etc. having them in a zip or tar file is quite a good solution, IMO. |
Nice. Are these gcc or the faster clang? |
right now gcc, but we should pin to a compiler version and can improve as needed. |
Implemented LEB128 (de)compression for the feature transformer. Reduces embedded network size from 70 MiB to 39 Mib. The new nn-78bacfcee510.nnue corresponds to the master net compressed. closes official-stockfish#4617 No functional change
@MaximMolchanov Sorry for the late question but I'm curious... I read the link https://en.wikipedia.org/wiki/LEB128 at the top of this PR but it doesn't really explain how LEB128 compares to other compression algos. That is why is LEB128 a particularly good choice for us? Also why do we use the signed version even though the unsigned pseudo code looks a bit simpler? Thanks! |
In short: LEB128 is useful when there are lots of small values and the maximum value is big enough. So, technically, if we have int16 and almost all values are in a range [-64...+63] - then LEB128 is fine.
We have int16 type for feature_transformer weights (also signed types for biases and psqtWeights), we have to change its type if we want to use compression for unsinged types. The first idea that coming to mind is to add +32768 to each value and use unsigned compression, but it will not reduce much memory because then those values form [-64...+63] will take two bytes.
I didn't make 'too deep' research, I just tried to use an algorithm that I've known before, checked that it reduces net size for about 45% and made a PR (by the way, I suggested somewhere in comments or discord this idea a couple of years ago when the net size was 20MB and reduced size was about 11-12 MB, but I didn't implement it then). Also the algorithm is simple enough - we simply add an implementation of reading ints without architectural changes (and even there we faced some problems - it was slow without buffers and python implementation was not absolutely straightforward). Probably some better algos exists but remember that we have to support it also for python trainers. So as for me the relation between complication of code and MBs reduced is good enough. |
@MaximMolchanov Thank you for the nice explanation. |
Implemented LEB128 (de)compression for the feature transformer. Reduces embedded network size from 70 MiB to 39 Mib. The new nn-78bacfcee510.nnue corresponds to the master net compressed. closes official-stockfish#4617 No functional change
Implemented LEB128 (de)compression for the feature transformer. Reduces embedded network size from 70 MiB to 39 Mib. The new nn-78bacfcee510.nnue corresponds to the master net compressed. closes official-stockfish#4617 No functional change
Implemented LEB128 compression. Current net size changed from 70 MiB to 39 Mib.