Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UNDERTOW-2210] Improved write ASCII path on ServletPrintWriter #1424

Merged
merged 9 commits into from
Jan 24, 2023

Conversation

franz1981
Copy link
Contributor

No description provided.

@franz1981
Copy link
Contributor Author

Still WIP, I'll run some micro and end 2 end benchmark soon with the final implementation.

@franz1981
Copy link
Contributor Author

@fl4via @stuartwdouglas
I've added a JMH benchmark (a bit unfair because it doesn't iterate through different strings), which results are (on my machine):

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt     Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20     0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20     0.044 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20     0.255 ±  0.002  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20    12.686 ±  1.038  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20   488.882 ±  2.891  us/op
AsciiEncodingBenchmark.encode        noBatch           7           8196  avgt   20     0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch          19           8196  avgt   20     0.042 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch         248           8196  avgt   20     0.409 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch       12392           8196  avgt   20    17.131 ±  0.133  us/op
AsciiEncodingBenchmark.encode        noBatch      493727           8196  avgt   20   706.596 ±  6.517  us/op
AsciiEncodingBenchmark.encode        vanilla           7           8196  avgt   20     0.026 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla          19           8196  avgt   20     0.052 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla         248           8196  avgt   20     0.546 ±  0.006  us/op
AsciiEncodingBenchmark.encode        vanilla       12392           8196  avgt   20    28.323 ±  0.216  us/op
AsciiEncodingBenchmark.encode        vanilla      493727           8196  avgt   20  1089.093 ± 26.762  us/op

I didn't yet spend time into profiling (with perf) to see if bound check elimination has been effective and I still need to run a full fat end 2 end test with SpecJ, but code-wise, should be a in a decent state for a review.

@franz1981 franz1981 marked this pull request as ready for review December 14, 2022 02:59
@carterkozak
Copy link
Contributor

👋 Hi there, @franz1981 :-)

Out of curiosity, how do the benchmarks above compare to some variation of ByteBuffer.put(string.getBytes(StandardCharsets.US_ASCII))? (relevant blogpost from Claes Redestad: https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html -- I realize the allocation will have broader impact, but I'm curious about the comparison point!)

@franz1981
Copy link
Contributor Author

For smaller strings is very likely will perform better then what I have done, but not fully sure; let me do it, I am curious too
And Hi sir @carterkozak 😁 I was sure I would have triggered you eheh

@franz1981
Copy link
Contributor Author

And I didn't tried with varhandle byte buffer view (that will save, likely the reverse bytes due to endianess)

@franz1981
Copy link
Contributor Author

Results from SpecJ are very good: as expected it deliver a reduction of overhead ~2X
ie from 2.91% to 1.43% of cpu samples presence for the overall test duration and, for the specific stack trace related JSP
from 51.82% to 31,19% that's still very good.
I'll double check the ASM of JMH to see if there's anything better thanks to var handle byte buffer view and a comparison with lang encoders as I've promised to @carterkozak

@franz1981
Copy link
Contributor Author

franz1981 commented Dec 14, 2022

Interestingly, JDK 19 (note: prev results were using JDK 11) seems to disagree:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.020 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.041 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.253 ±  0.013  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20   12.771 ±  0.574  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  499.805 ± 11.393  us/op
AsciiEncodingBenchmark.encode        noBatch           7           8196  avgt   20    0.019 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch          19           8196  avgt   20    0.025 ±  0.001  us/op
AsciiEncodingBenchmark.encode        noBatch         248           8196  avgt   20    0.161 ±  0.004  us/op
AsciiEncodingBenchmark.encode        noBatch       12392           8196  avgt   20    8.856 ±  0.095  us/op
AsciiEncodingBenchmark.encode        noBatch      493727           8196  avgt   20  329.699 ±  1.925  us/op
AsciiEncodingBenchmark.encode        vanilla           7           8196  avgt   20    0.021 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla          19           8196  avgt   20    0.035 ±  0.001  us/op
AsciiEncodingBenchmark.encode        vanilla         248           8196  avgt   20    0.303 ±  0.007  us/op
AsciiEncodingBenchmark.encode        vanilla       12392           8196  avgt   20   19.122 ±  0.039  us/op
AsciiEncodingBenchmark.encode        vanilla      493727           8196  avgt   20  722.313 ± 59.218  us/op

and report the first commit of this PR to be the winner by some margin.

Still on JDK 19 using a VarHandle to write LittleEndian long data into the buffer instead, makes batch to be near the winner:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score    Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.019 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.033 ±  0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.185 ±  0.017  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20    9.273 ±  0.511  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  384.584 ± 10.149  us/op

For reference, VarHandle is very bad on the same encoder type with JDK 11:

Benchmark                      (encoderType)  (inLength)  (outCapacity)  Mode  Cnt    Score   Error  Units
AsciiEncodingBenchmark.encode          batch           7           8196  avgt   20    0.029 ± 0.001  us/op
AsciiEncodingBenchmark.encode          batch          19           8196  avgt   20    0.045 ± 0.001  us/op
AsciiEncodingBenchmark.encode          batch         248           8196  avgt   20    0.275 ± 0.015  us/op
AsciiEncodingBenchmark.encode          batch       12392           8196  avgt   20   14.635 ± 0.582  us/op
AsciiEncodingBenchmark.encode          batch      493727           8196  avgt   20  608.942 ± 6.466  us/op

performing even worse then the ByteBuffer version

@franz1981
Copy link
Contributor Author

franz1981 commented Dec 17, 2022

I believe this can be made even faster by reading chars as long via varhandle/methodhandle, will experiment with it next week.
I was hoping it was using SIMD to read chars into long already but verified that was just spilling (advice was coming from Richard Startin)

if (((batch1 | batch2) & 0xff80_ff80_ff80_ff80L) != 0) {
return i << 3;
}
final long batch = (batch1 << 8) | batch2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph I've removed

            final long maskedBatch1 = (batch1 & 0x007f_007f_007f_007fL) << 8;
            final long maskedBatch2 = batch2 & 0x007f_007f_007f_007fL;
            final long batch = maskedBatch1 | maskedBatch2;

because of the previous check re being in [0, 127]

@franz1981
Copy link
Contributor Author

@fl4via this is now ready for review; it delivers a great speedup (up to 2X) across different JDK versions (11->19)

@fl4via fl4via added enhancement Enhances existing behaviour or code under verification Currently being verified (running tests, reviewing) before posting a review to contributor labels Jan 10, 2023
@fl4via fl4via added next release This PR will be merged before next release or has already been merged (for payload double check) waiting PR update Awaiting PR update(s) from contributor before merging waiting CI check Ready to be merged but waiting for CI check and removed under verification Currently being verified (running tests, reviewing) before posting a review to contributor labels Jan 10, 2023
@fl4via fl4via added waiting peer review PRs that edit core classes might require an extra review and removed waiting CI check Ready to be merged but waiting for CI check waiting PR update Awaiting PR update(s) from contributor before merging labels Jan 13, 2023
@ropalka ropalka removed the waiting peer review PRs that edit core classes might require an extra review label Jan 24, 2023
@ropalka ropalka merged commit 5aaa607 into undertow-io:master Jan 24, 2023
@ropalka
Copy link
Contributor

ropalka commented Jan 24, 2023

Thanks @franz1981 !

@franz1981
Copy link
Contributor Author

Thanks @ropalka !!!

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 25, 2023

Placing this comment here to remember: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L52 shows that whatever single byte char is fine, meaning 0-255 too!
This translate in a wider encoding domain (Windows-1252 which is a superset of ISO 8859-1, also called ISO Latin-1) then what we've used in undertow (and not only, it seems).
TLDR we can still perform the super fast path of single-byte encoding till 255 included and not just 127 (and the check mask of this optimized method should change accordly!)

@fl4via fl4via removed the next release This PR will be merged before next release or has already been merged (for payload double check) label Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances existing behaviour or code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants