Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark branchless numeric parser #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jabolina
Copy link
Member

For reference: netty/netty@4.1...franz1981:netty:4.1_branchless_varint

The current parser is InfinispanParser, and the new is BranchlessParser. There is lots more code.
Running on an Intel i7-9850H.

Integer:

Benchmark                                  (width)  Mode  Cnt  Score   Error  Units
IntegerBenchmark.parseVarint32Branchless         1  avgt   15  2.831 ± 0.132  ns/op
IntegerBenchmark.parseVarint32Branchless        15  avgt   15  4.451 ± 0.284  ns/op
IntegerBenchmark.parseVarint32Branchless        24  avgt   15  4.569 ± 0.183  ns/op
IntegerBenchmark.parseVarint32Branchless        31  avgt   15  5.756 ± 0.116  ns/op
IntegerBenchmark.parserVarint32Infinispan        1  avgt   15  2.328 ± 0.060  ns/op
IntegerBenchmark.parserVarint32Infinispan       15  avgt   15  5.309 ± 0.180  ns/op
IntegerBenchmark.parserVarint32Infinispan       24  avgt   15  7.218 ± 0.258  ns/op
IntegerBenchmark.parserVarint32Infinispan       31  avgt   15  9.469 ± 0.608  ns/op

Long:

Benchmark                               (width)  Mode  Cnt   Score   Error  Units
LongBenchmark.parseVarint32Branchless         1  avgt   15   2.641 ± 0.105  ns/op
LongBenchmark.parseVarint32Branchless        24  avgt   15   4.941 ± 0.448  ns/op
LongBenchmark.parseVarint32Branchless        31  avgt   15   6.611 ± 0.141  ns/op
LongBenchmark.parseVarint32Branchless        48  avgt   15   6.670 ± 0.213  ns/op
LongBenchmark.parserVarint32Infinispan        1  avgt   15   2.267 ± 0.075  ns/op
LongBenchmark.parserVarint32Infinispan       24  avgt   15   5.903 ± 0.129  ns/op
LongBenchmark.parserVarint32Infinispan       31  avgt   15   7.897 ± 0.306  ns/op
LongBenchmark.parserVarint32Infinispan       48  avgt   15   9.179 ± 0.407  ns/op
LongBenchmark.parserVarint32Infinispan       63  avgt   15  12.485 ± 0.236  ns/op

The smaller numbers have a similar performance on both, so this change may be unnecessary. For the bigger numbers, we have an improvement.

@franz1981
Copy link

franz1981 commented Jun 18, 2024

I would improve the benchmark to cause mispredict i.e. using a big enough and reproducible inputs which have different var int sizes, see https://github.com/netty/netty/blob/151dfa083d28e995a18f7d2c73d4a7d3b7ab73b2/microbench/src/main/java/io/netty/handler/codec/protobuf/VarintDecodingBenchmark.java#L46 for reference

unless the point is that you always expect data to always have some specific size/length.

@jabolina
Copy link
Member Author

Thanks, @franz1981. That seems better. I'll try updating the benchmark.

We have some updates planned for Hot Rod to reduce the client to a single connection and improve the batching/pipelining of commands. This change would make the buffer size vary between submissions. Internally, it should likely help for individual commands, although I'm not 100% sure about that. However, updating the benchmark would reflect better the actual usage.

@jabolina
Copy link
Member Author

Some months passed and I finally applied the suggestions.

Results:

Benchmark                                     (elementType)  (inputDistribution)  (inputs)  Mode  Cnt   Score   Error  Units
NumericParserBenchmark.parseNumberBranchless            INT                SMALL         1  avgt   20   4.810 ± 0.223  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL         1  avgt   20   5.298 ± 0.539  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL       128  avgt   20   4.164 ± 0.195  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL       128  avgt   20   6.688 ± 2.269  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL    128000  avgt   20   9.111 ± 0.593  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL    128000  avgt   20  15.544 ± 1.132  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                LARGE         1  avgt   20   6.619 ± 0.289  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                LARGE         1  avgt   20  14.317 ± 1.525  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM         1  avgt   20   3.613 ± 0.129  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM         1  avgt   20   4.670 ± 0.533  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM       128  avgt   20   4.772 ± 1.000  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM       128  avgt   20   7.183 ± 1.055  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM    128000  avgt   20  11.379 ± 0.620  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM    128000  avgt   20  16.079 ± 0.971  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL         1  avgt   20   3.759 ± 0.267  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL         1  avgt   20   5.182 ± 0.792  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL       128  avgt   20   5.842 ± 0.149  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL       128  avgt   20   8.953 ± 1.271  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL    128000  avgt   20  16.172 ± 1.114  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL    128000  avgt   20  17.555 ± 1.504  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL         1  avgt   20   5.394 ± 0.255  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL         1  avgt   20   7.804 ± 0.524  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL       128  avgt   20   5.676 ± 0.396  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL       128  avgt   20   7.493 ± 1.073  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL    128000  avgt   20  12.252 ± 1.023  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL    128000  avgt   20  16.152 ± 0.754  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                LARGE         1  avgt   20   6.225 ± 0.232  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                LARGE         1  avgt   20  18.368 ± 1.396  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM         1  avgt   20   4.283 ± 0.129  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM         1  avgt   20   6.546 ± 0.623  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM       128  avgt   20   6.917 ± 0.386  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM       128  avgt   20  10.355 ± 1.676  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM    128000  avgt   20  14.167 ± 0.840  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM    128000  avgt   20  18.337 ± 1.120  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL         1  avgt   20   6.648 ± 0.169  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL         1  avgt   20  16.557 ± 2.431  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL       128  avgt   20   6.811 ± 0.096  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL       128  avgt   20  12.062 ± 1.354  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL    128000  avgt   20  14.776 ± 0.908  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL    128000  avgt   20  20.272 ± 1.042  ns/op

It seems to win a few NSs on all of the tests. Not sure if there is a way we can add more optimizations 🤔

@franz1981
Copy link

eheh probably not, and although it seems just few ns if you look at the ratio (or the throughput), is a HUGE improvement no?

well done @jabolina I'm happy if someone use it!
In short, it is better in any case, let's say


// Now we isolate the bits in sequence. We check 14 bits at a time.
// The intervals are 0-14 bits, 16-30 (and shift 2), 32-46 (and shift 2 + 2), 48-62 (and shift 2 + 2 + 2).
return (continuation & 0x3FFF) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use Long::compress if you have the right JDK version (since 19) using the right mask which isolates the bits you need "to compress"
see https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/lang/Long.html#compress(long,long)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We're on 17, but I'll add it as a comment to the ISPN code.

@jabolina
Copy link
Member Author

jabolina commented Sep 3, 2024

While integrating it into ISPN, I noticed I was wrong on some bit calculations. I've fixed everything and it is working with ISPN and continues to perform better.

Benchmark                                     (elementType)  (inputDistribution)  (inputs)  Mode  Cnt   Score   Error  Units
NumericParserBenchmark.parseNumberBranchless            INT                SMALL         1  avgt   20   4.418 ± 0.177  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL         1  avgt   20   4.619 ± 0.163  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL       128  avgt   20   3.897 ± 0.042  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL       128  avgt   20   4.045 ± 0.057  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL    128000  avgt   20   8.441 ± 0.111  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL    128000  avgt   20   8.396 ± 0.241  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                LARGE         1  avgt   20   6.751 ± 0.199  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                LARGE         1  avgt   20  10.472 ± 1.770  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM         1  avgt   20   3.507 ± 0.046  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM         1  avgt   20   3.657 ± 0.334  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM       128  avgt   20   4.501 ± 0.117  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM       128  avgt   20   5.597 ± 0.399  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM    128000  avgt   20  10.232 ± 0.269  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM    128000  avgt   20  13.567 ± 1.036  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL         1  avgt   20   3.638 ± 0.067  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL         1  avgt   20   3.974 ± 0.267  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL       128  avgt   20   5.899 ± 0.260  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL       128  avgt   20   7.243 ± 0.854  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL    128000  avgt   20  14.735 ± 0.504  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL    128000  avgt   20  15.390 ± 1.167  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL         1  avgt   20   5.032 ± 0.063  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL         1  avgt   20   6.954 ± 0.612  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL       128  avgt   20   5.118 ± 0.033  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL       128  avgt   20   6.322 ± 0.467  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL    128000  avgt   20  11.584 ± 0.167  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL    128000  avgt   20  14.818 ± 0.995  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                LARGE         1  avgt   20   7.519 ± 0.242  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                LARGE         1  avgt   20  18.525 ± 2.720  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM         1  avgt   20   4.226 ± 0.076  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM         1  avgt   20   6.359 ± 0.639  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM       128  avgt   20   6.510 ± 0.296  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM       128  avgt   20   9.137 ± 0.793  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM    128000  avgt   20  13.374 ± 0.548  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM    128000  avgt   20  16.068 ± 1.016  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL         1  avgt   20   6.496 ± 0.061  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL         1  avgt   20  12.838 ± 1.337  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL       128  avgt   20   6.824 ± 0.151  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL       128  avgt   20   9.497 ± 0.915  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL    128000  avgt   20  13.800 ± 0.346  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL    128000  avgt   20  16.096 ± 0.666  ns/op

Larger values are the ones with more improvements. Medium values have some ns improvements. And smaller numbers perform slightly better, a few hundred us. Which still means an improvement overall. I'll be opening the PR to ISPN just after some cleaning.

@jabolina
Copy link
Member Author

jabolina commented Sep 3, 2024

@franz1981, you might notice that the method to read smaller values (24 bits) differs from the one used by Netty. On ISPN, we have some tests that slowly replay a buffer to check if we're not consuming more bytes than would be correct. Maybe this is not an issue for Netty. The code below reproduces it. Simulates a buffer that has received only the 3 first bytes of a larger integer.

    public static void main(String[] args) {
        // Integer.MAX_VALUE as vint: {-43, -1, -1, -1, 7, 0};
        ByteBuf buf = Unpooled.buffer(6);
        buf.writeByte(-43);
        buf.writeByte(-1);
        buf.writeByte(-1);

        // No data read and no data consumed.
        assert 0 == readRawVarint24(buf.resetReaderIndex());
        assert 0 == buf.readerIndex();
    }

    private static int readRawVarint24(ByteBuf buffer) {
        // From Netty.
        if (!buffer.isReadable()) {
            return 0;
        }
        buffer.markReaderIndex();

        byte tmp = buffer.readByte();
        if (tmp >= 0) {
            return tmp;
        }
        int result = tmp & 127;
        if (!buffer.isReadable()) {
            buffer.resetReaderIndex();
            return 0;
        }
        if ((tmp = buffer.readByte()) >= 0) {
            return result | tmp << 7;
        }
        result |= (tmp & 127) << 7;
        if (!buffer.isReadable()) {
            buffer.resetReaderIndex();
            return 0;
        }
        if ((tmp = buffer.readByte()) >= 0) {
            return result | tmp << 14;
        }
        return result | (tmp & 127) << 14;
    }

@franz1981
Copy link

In netty I've avoided this (and the mark reader) by checking before if I got enough room in the buffer before and by using read bytes with offset - which won't change the offset.
In this way you can decide to move the offset based on what the branchless outcome decide (which is the "skipBytes" part in the Netty pr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants