-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110
Comments
Looking through the flamegraphs, the issue seems to be in bufferCrypt() |
Stumbled across this article and decided to give it a go https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/ . The result looks much better than before, whereby there are improvements for both Graviton and Non-Graviton, but the improvement on Graviton is of much higher ratio that it's raw performance is better than Non-Graviton now. Find attached the result In a sense it seems to me ACCP is kinda doing some native instructions already (via openSSL), and that definitely brings a lot of advantages. |
HistoryOpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes: 8143925: Enhancing CounterMode.crypt() for AES
The change intrinsifies 8177784: Use CounterMode intrinsic for AES/GCM This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness. Later 8143925 was further improved for AVX512 and the Vector AES instructions 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions
The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in ToDoAs Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt() to track this in OpenJDK upstream. Arm® NeoverseTM N1 Software Optimization Guide, p.58 mentions the following:
Does it make sense to process more than 4 blocks in parallel? I also read about the ARM SVE2-AES extension which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions but it looks like SVE2-AES will only become available in ARMv9. |
That’s an astute analysis
To your question: I think 4 parallel AES should suffice on Graviton2 but I will double check with AWS internal folks and come back with recommendation to make it more generic for future cores
At this point, NEON implementation should suffice and no need for SVE/SVE2 versions as they won’t change the perf outcome
…Sent from my iPhone
On May 31, 2021, at 10:23 AM, Volker Simonis ***@***.***> wrote:
History
OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.
However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:
8143925: Enhancing CounterMode.crypt() for AES<https://bugs.openjdk.java.net/browse/JDK-8143925>
Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode.
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772
From the issue summary:
The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions.
Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline.
We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.
The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().
8177784: Use CounterMode intrinsic for AES/GCM<https://bugs.openjdk.java.net/browse/JDK-8177784>
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f
This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.
Later 8143925 was further improved for AVX512 and the Vector AES instructions
8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions<https://bugs.openjdk.java.net/browse/JDK-8233741>
https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe
From the issue summary:
As per the Intel Architecture Instruction Set Reference<https://software.intel.com/sites/default/files/managed/ad/01/253666-sdm-vol-2a.pdf>, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE<https://software.intel.com/en-us/articles/intel-software-development-emulator> to confirm that encoding and semantics are correctly implemented.
The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().
ToDo
As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt()<https://bugs.openjdk.java.net/browse/JDK-8267993> to track this in OpenJDK upstream.
Arm® NeoverseTM N1 Software Optimization Guide<https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd?token=>, p.58 mentions the following:
4.6 AES encryption/decryption
Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of two cycles. This means encryption or decryption for at least four data
chunks should be interleaved for maximum performance:
AESE data0, key0
AESMC data0, data0
AESE data1, key0
AESMC data1, data1
AESE data2, key0
AESMC data2, data2
AESE data3, key1
AESMC data3, data3
AESE data0, key0
AESMC data0, data0
...
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.
Does it make sense to process more than 4 blocks in parallel?
I also read about the ARM SVE2-AES extension<https://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions/AESE--AES-single-round-encryption-> which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions<https://bugs.openjdk.java.net/browse/JDK-8233741> but it looks like SVE2-AES will only become available in ARMv9.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#110 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCLWNMLJCVZGWMM4JRTTQPAV7ANCNFSM45MJG3FA>.
|
hi, @h3nd24 |
|
Here is the PR of Interleave GCM. please note it's off by default, user needs to explicitly enable it by |
This patch has been backported to openjdk 11.0.14. |
Here is the result of corretto-11 nightly build on r6g instance.
for dataSize/keyLength = (1024/128), we can see 3.5~5x more thoughput.
|
This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the jvm.config files to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.
This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the JVM's options to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.
This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the jvm.config files to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.
This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the JVM's options to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.
Resolving as the JDK has the necessary backports, and commits to other projects to use the flag have been committed. |
Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:
Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node.
FlameGraphs.zip
Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result
BenchmarkingResult.tar.gz
Thanks for your help.
The text was updated successfully, but these errors were encountered: