[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

franz1981 · 2022-12-14T00:16:07Z

No description provided.

franz1981 · 2022-12-14T00:16:31Z

Still WIP, I'll run some micro and end 2 end benchmark soon with the final implementation.

franz1981 · 2022-12-14T02:59:01Z

@fl4via @stuartwdouglas
I've added a JMH benchmark (a bit unfair because it doesn't iterate through different strings), which results are (on my machine):

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt     Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20     0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20     0.044 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20     0.255 ±  0.002  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20    12.686 ±  1.038  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20   488.882 ±  2.891  us/op
AsciiEncodingBenchmark.encode        noBatch           7           8196  avgt   20     0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch          19           8196  avgt   20     0.042 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch         248           8196  avgt   20     0.409 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch       12392           8196  avgt   20    17.131 ±  0.133  us/op
AsciiEncodingBenchmark.encode        noBatch      493727           8196  avgt   20   706.596 ±  6.517  us/op
AsciiEncodingBenchmark.encode        vanilla           7           8196  avgt   20     0.026 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla          19           8196  avgt   20     0.052 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla         248           8196  avgt   20     0.546 ±  0.006  us/op
AsciiEncodingBenchmark.encode        vanilla       12392           8196  avgt   20    28.323 ±  0.216  us/op
AsciiEncodingBenchmark.encode        vanilla      493727           8196  avgt   20  1089.093 ± 26.762  us/op

I didn't yet spend time into profiling (with perf) to see if bound check elimination has been effective and I still need to run a full fat end 2 end test with SpecJ, but code-wise, should be a in a decent state for a review.

carterkozak · 2022-12-14T14:31:03Z

👋 Hi there, @franz1981 :-)

Out of curiosity, how do the benchmarks above compare to some variation of ByteBuffer.put(string.getBytes(StandardCharsets.US_ASCII))? (relevant blogpost from Claes Redestad: https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html -- I realize the allocation will have broader impact, but I'm curious about the comparison point!)

franz1981 · 2022-12-14T14:49:46Z

For smaller strings is very likely will perform better then what I have done, but not fully sure; let me do it, I am curious too
And Hi sir @carterkozak 😁 I was sure I would have triggered you eheh

franz1981 · 2022-12-14T14:50:50Z

And I didn't tried with varhandle byte buffer view (that will save, likely the reverse bytes due to endianess)

franz1981 · 2022-12-14T15:45:18Z

Results from SpecJ are very good: as expected it deliver a reduction of overhead ~2X
ie from 2.91% to 1.43% of cpu samples presence for the overall test duration and, for the specific stack trace related JSP
from 51.82% to 31,19% that's still very good.
I'll double check the ASM of JMH to see if there's anything better thanks to var handle byte buffer view and a comparison with lang encoders as I've promised to @carterkozak

franz1981 · 2022-12-14T22:16:36Z

Interestingly, JDK 19 (note: prev results were using JDK 11) seems to disagree:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.020 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.041 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.253 ±  0.013  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20   12.771 ±  0.574  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  499.805 ± 11.393  us/op
AsciiEncodingBenchmark.encode        noBatch           7           8196  avgt   20    0.019 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch          19           8196  avgt   20    0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch         248           8196  avgt   20    0.161 ±  0.004  us/op
AsciiEncodingBenchmark.encode        noBatch       12392           8196  avgt   20    8.856 ±  0.095  us/op
AsciiEncodingBenchmark.encode        noBatch      493727           8196  avgt   20  329.699 ±  1.925  us/op
AsciiEncodingBenchmark.encode        vanilla           7           8196  avgt   20    0.021 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla          19           8196  avgt   20    0.035 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla         248           8196  avgt   20    0.303 ±  0.007  us/op
AsciiEncodingBenchmark.encode        vanilla       12392           8196  avgt   20   19.122 ±  0.039  us/op
AsciiEncodingBenchmark.encode        vanilla      493727           8196  avgt   20  722.313 ± 59.218  us/op

and report the first commit of this PR to be the winner by some margin.

Still on JDK 19 using a VarHandle to write LittleEndian long data into the buffer instead, makes batch to be near the winner:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.019 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.033 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.185 ±  0.017  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20    9.273 ±  0.511  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  384.584 ± 10.149  us/op

For reference, VarHandle is very bad on the same encoder type with JDK 11:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score   Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.029 ± 0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.045 ± 0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.275 ± 0.015  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20   14.635 ± 0.582  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  608.942 ± 6.466  us/op

performing even worse then the ByteBuffer version

franz1981 · 2022-12-17T16:58:14Z

I believe this can be made even faster by reading chars as long via varhandle/methodhandle, will experiment with it next week.
I was hoping it was using SIMD to read chars into long already but verified that was just spilling (advice was coming from Richard Startin)

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java

franz1981 · 2022-12-21T00:22:45Z

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java

+            if (((batch1 | batch2) & 0xff80_ff80_ff80_ff80L) != 0) {
+                return i << 3;
+            }
+            final long batch = (batch1 << 8) | batch2;


@theRealAph I've removed

final long maskedBatch1 = (batch1 & 0x007f_007f_007f_007fL) << 8; final long maskedBatch2 = batch2 & 0x007f_007f_007f_007fL; final long batch = maskedBatch1 | maskedBatch2;

because of the previous check re being in [0, 127]

franz1981 · 2022-12-28T13:39:32Z

@fl4via this is now ready for review; it delivers a great speedup (up to 2X) across different JDK versions (11->19)

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java

benchmarks/src/main/java/io/undertow/benchmarks/AsciiEncoders.java

benchmarks/src/main/java/io/undertow/benchmarks/AsciiEncodingBenchmark.java

ropalka · 2023-01-24T15:08:55Z

Thanks @franz1981 !

franz1981 · 2023-01-24T15:08:59Z

Thanks @ropalka !!!

franz1981 · 2023-01-25T16:10:15Z

Placing this comment here to remember: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L52 shows that whatever single byte char is fine, meaning 0-255 too!
This translate in a wider encoding domain (Windows-1252 which is a superset of ISO 8859-1, also called ISO Latin-1) then what we've used in undertow (and not only, it seems).
TLDR we can still perform the super fast path of single-byte encoding till 255 included and not just 127 (and the check mask of this optimized method should change accordly!)

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter

cc58de2

franz1981 added 2 commits December 14, 2022 02:52

batch encoding

1314c01

JMH playground

de1f1af

franz1981 marked this pull request as ready for review December 14, 2022 02:59

single class import on JMH bench

5ecbd9c

theRealAph reviewed Dec 20, 2022

View reviewed changes

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java Outdated Show resolved Hide resolved

Further improving performance by using less tmp registers

a4c5633

franz1981 commented Dec 21, 2022

View reviewed changes

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java Show resolved Hide resolved

Removing un-necessary cleanup mask

592b99d

franz1981 commented Dec 21, 2022

View reviewed changes

fixed small error on JMH bench

50a2482

theRealAph reviewed Dec 30, 2022

View reviewed changes

servlet/src/main/java/io/undertow/servlet/spec/ServletPrintWriter.java Show resolved Hide resolved

fl4via added enhancement Enhances existing behaviour or code under verification Currently being verified (running tests, reviewing) before posting a review to contributor labels Jan 10, 2023

Removed unused ByteOrder on benchs

8565c76

fl4via requested changes Jan 10, 2023

View reviewed changes

benchmarks/src/main/java/io/undertow/benchmarks/AsciiEncoders.java Show resolved Hide resolved

benchmarks/src/main/java/io/undertow/benchmarks/AsciiEncodingBenchmark.java Outdated Show resolved Hide resolved

header fixes

24f9fb4

fl4via approved these changes Jan 13, 2023

View reviewed changes

fl4via added waiting peer review PRs that edit core classes might require an extra review and removed waiting CI check Ready to be merged but waiting for CI check waiting PR update Awaiting PR update(s) from contributor before merging labels Jan 13, 2023

ropalka removed the waiting peer review PRs that edit core classes might require an extra review label Jan 24, 2023

ropalka approved these changes Jan 24, 2023

View reviewed changes

ropalka merged commit 5aaa607 into undertow-io:master Jan 24, 2023

fl4via removed the next release This PR will be merged before next release or has already been merged (for payload double check) label Feb 14, 2023

franz1981 mentioned this pull request Mar 20, 2023

Build custom buffered input stream that can reuse tmp read buffers across indexing ops smallrye/jandex#302

Closed

franz1981 mentioned this pull request Nov 13, 2023

Use directBuffer with 8m initial capacity quarkiverse/quarkus-cxf#1105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

carterkozak commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022 •

edited

Loading

franz1981 commented Dec 17, 2022 •

edited

Loading

franz1981 Dec 21, 2022

franz1981 commented Dec 28, 2022

ropalka commented Jan 24, 2023

franz1981 commented Jan 24, 2023

franz1981 commented Jan 25, 2023 •

edited

Loading

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

Conversation

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

carterkozak commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022

franz1981 commented Dec 14, 2022 • edited Loading

franz1981 commented Dec 17, 2022 • edited Loading

franz1981 Dec 21, 2022

Choose a reason for hiding this comment

franz1981 commented Dec 28, 2022

ropalka commented Jan 24, 2023

franz1981 commented Jan 24, 2023

franz1981 commented Jan 25, 2023 • edited Loading

franz1981 commented Dec 14, 2022 •

edited

Loading

franz1981 commented Dec 17, 2022 •

edited

Loading

franz1981 commented Jan 25, 2023 •

edited

Loading