Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

h3nd24 · 2021-05-24T01:30:29Z

Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:

Client encryption is on
Created stressor boxes (one producer and one consumer) in the same data centre as the Kafka cluster
Using librdkafka to stress test
The rate is 7m messages/second at 100bytes each
Using AWS Corretto
Kafka 2.6.1
Generated the symbols as per https://www.slideshare.net/brendangregg/java-performance-analysis-on-linux-with-flame-graphs

Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node.
FlameGraphs.zip

Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result
BenchmarkingResult.tar.gz

Thanks for your help.

sebpop · 2021-05-26T00:37:54Z

Looking through the flamegraphs, the issue seems to be in bufferCrypt()
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/javax/crypto/CipherSpi.java#L749
The function takes 15.3% of all samples on non-Graviton and 28.59% of samples on Graviton.
bufferCrypt() calls more native functions in non-Graviton: jbyte_disjoint_array, ghash_processBlocks, counterMode_AESCrypt
than on Graviton: aescrypt_encrypt
We need to dive in and see which of the native functions need to be ported to arm64.

h3nd24 · 2021-05-27T09:02:44Z

Stumbled across this article and decided to give it a go https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/ . The result looks much better than before, whereby there are improvements for both Graviton and Non-Graviton, but the improvement on Graviton is of much higher ratio that it's raw performance is better than Non-Graviton now. Find attached the result
BenchmarkingResultWithACCP.tar.gz

In a sense it seems to me ACCP is kinda doing some native instructions already (via openSSL), and that definitely brings a lot of advantages.

simonis · 2021-05-31T17:22:26Z

History

OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.

However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:

8143925: Enhancing CounterMode.crypt() for AES
Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode.
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772
From the issue summary:

The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions.
Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline.
We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.

The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().

8177784: Use CounterMode intrinsic for AES/GCM
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f

This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.

Later 8143925 was further improved for AVX512 and the Vector AES instructions

8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions
https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe
From the issue summary:

As per the Intel Architecture Instruction Set Reference, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE to confirm that encoding and semantics are correctly implemented.

The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().

ToDo

As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt() to track this in OpenJDK upstream.

Arm® NeoverseTM N1 Software Optimization Guide, p.58 mentions the following:

4.6 AES encryption/decryption
Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of two cycles. This means encryption or decryption for at least four data
chunks should be interleaved for maximum performance:
AESE  data0, key0
AESMC data0, data0
AESE  data1, key0
AESMC data1, data1
AESE  data2, key0
AESMC data2, data2
AESE  data3, key1
AESMC data3, data3
AESE  data0, key0
AESMC data0, data0
...
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.

Does it make sense to process more than 4 blocks in parallel?

I also read about the ARM SVE2-AES extension which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions but it looks like SVE2-AES will only become available in ARMv9.

AWSNB · 2021-05-31T18:38:43Z

That’s an astute analysis To your question: I think 4 parallel AES should suffice on Graviton2 but I will double check with AWS internal folks and come back with recommendation to make it more generic for future cores At this point, NEON implementation should suffice and no need for SVE/SVE2 versions as they won’t change the perf outcome

…

Sent from my iPhone On May 31, 2021, at 10:23 AM, Volker Simonis ***@***.***> wrote: History OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt. However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes: 8143925: Enhancing CounterMode.crypt() for AES<https://bugs.openjdk.java.net/browse/JDK-8143925> Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode. http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1 http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772 From the issue summary: The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions. Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline. We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel. The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel(). 8177784: Use CounterMode intrinsic for AES/GCM<https://bugs.openjdk.java.net/browse/JDK-8177784> http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness. Later 8143925 was further improved for AVX512 and the Vector AES instructions 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions<https://bugs.openjdk.java.net/browse/JDK-8233741> https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe From the issue summary: As per the Intel Architecture Instruction Set Reference<https://software.intel.com/sites/default/files/managed/ad/01/253666-sdm-vol-2a.pdf>, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE<https://software.intel.com/en-us/articles/intel-software-development-emulator> to confirm that encoding and semantics are correctly implemented. The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt(). ToDo As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt()<https://bugs.openjdk.java.net/browse/JDK-8267993> to track this in OpenJDK upstream. Arm® NeoverseTM N1 Software Optimization Guide<https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd?token=>, p.58 mentions the following: 4.6 AES encryption/decryption Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance: AESE data0, key0 AESMC data0, data0 AESE data1, key0 AESMC data1, data1 AESE data2, key0 AESMC data2, data2 AESE data3, key1 AESMC data3, data3 AESE data0, key0 AESMC data0, data0 ... Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register. Does it make sense to process more than 4 blocks in parallel? I also read about the ARM SVE2-AES extension<https://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions/AESE--AES-single-round-encryption-> which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions<https://bugs.openjdk.java.net/browse/JDK-8233741> but it looks like SVE2-AES will only become available in ARMv9. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#110 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCLWNMLJCVZGWMM4JRTTQPAV7ANCNFSM45MJG3FA>.

navyxliu · 2021-08-03T23:58:31Z

hi, @h3nd24
Are you using Corretto-11 or Corretto-8?

charlese-instaclustr · 2021-08-06T01:22:07Z

Are you using Corretto-11 or Corretto-8?
@navyxliu This was with Corretto-11

navyxliu · 2021-11-22T19:52:27Z

Here is the PR of Interleave GCM.
openjdk/jdk11u-dev#410

please note it's off by default, user needs to explicitly enable it by -XX:+UseAESCTRIntrinsics.

navyxliu · 2021-12-02T19:30:59Z

This patch has been backported to openjdk 11.0.14.
We expect to see 8x speedup of GCM encrypt/decrypt on Gravition2. Next move, we will evaluate the performance gain on kafka.

navyxliu · 2021-12-09T02:20:53Z

Here is the result of corretto-11 nightly build on r6g instance.

java -jar ./target/benchmarks.jar -jvm ../../amazon-corretto-11.0.14.1.0-linux-aarch64/bin/java -jvmArgs '-XX:+UnlockDiagnosticVMOptions -XX:+UseAESCTRIntrinsics' org.openjdk.bench.javax.crypto.small.AESGCMBench.*

for dataSize/keyLength = (1024/128), we can see 3.5~5x more thoughput.

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  727,936.577 ±  803.232  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  580,544.738 ± 2154.782  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  929,792.800 ± 2149.501  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  893,095.915 ± 1069.300  ops/s

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  173,051.939 ±  701.488  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  162,495.531 ± 1512.809  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  180,853.164 ± 1102.398  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  175,422.285 ± 2028.631  ops/s

navyxliu · 2021-12-09T17:22:31Z

the preview link, expire in 7 days

This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the jvm.config files to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.

This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the JVM's options to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.

This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the jvm.config files to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.

This PR adds options -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics to the JVM's options to enable intrinsic support for AES CTR/GCM on ARM64. It improves performance of network communication with S3 a lot on graviton instances. It's enabled in Java 18 by default, which is already released. Therefore risk is minimal. In latest Java (OpenJDK) versions the performance of AES CTR/GCM for AARCH64 was significally improved and it was backported to OpenJDK 11. Backport PR: openjdk/jdk11u-dev#410. Some additional explanation is here: aws/aws-graviton-getting-started#110 (comment) The original OpenJDK issue is here: https://bugs.openjdk.java.net/browse/JDK-8267993 To use that backport we need to enable it explicitly by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics. It was not enabled by default in the backport because of conservative approach.

geoffreyblake · 2022-08-03T19:42:16Z

Resolving as the JDK has the necessary backports, and commits to other projects to use the flag have been committed.

geoffreyblake closed this as completed Aug 3, 2022

sebpop mentioned this issue Sep 23, 2022

[java] document AES/GCM performance improvement and flags #251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

h3nd24 commented May 24, 2021 •

edited

Loading

sebpop commented May 26, 2021

h3nd24 commented May 27, 2021

simonis commented May 31, 2021

AWSNB commented May 31, 2021 via email

navyxliu commented Aug 3, 2021

charlese-instaclustr commented Aug 6, 2021

navyxliu commented Nov 22, 2021

navyxliu commented Dec 2, 2021

navyxliu commented Dec 9, 2021

navyxliu commented Dec 9, 2021

geoffreyblake commented Aug 3, 2022

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

Comments

h3nd24 commented May 24, 2021 • edited Loading

sebpop commented May 26, 2021

h3nd24 commented May 27, 2021

simonis commented May 31, 2021

History

ToDo

AWSNB commented May 31, 2021 via email

navyxliu commented Aug 3, 2021

charlese-instaclustr commented Aug 6, 2021

navyxliu commented Nov 22, 2021

navyxliu commented Dec 2, 2021

navyxliu commented Dec 9, 2021

navyxliu commented Dec 9, 2021

geoffreyblake commented Aug 3, 2022

h3nd24 commented May 24, 2021 •

edited

Loading