Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

Closed
h3nd24 opened this issue May 24, 2021 · 11 comments

Comments

@h3nd24
Copy link

h3nd24 commented May 24, 2021

Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:

Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node.
FlameGraphs.zip

Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result
BenchmarkingResult.tar.gz

Thanks for your help.

@sebpop
Copy link
Contributor

sebpop commented May 26, 2021

Looking through the flamegraphs, the issue seems to be in bufferCrypt()
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/javax/crypto/CipherSpi.java#L749
The function takes 15.3% of all samples on non-Graviton and 28.59% of samples on Graviton.
bufferCrypt() calls more native functions in non-Graviton: jbyte_disjoint_array, ghash_processBlocks, counterMode_AESCrypt
than on Graviton: aescrypt_encrypt
We need to dive in and see which of the native functions need to be ported to arm64.

@h3nd24
Copy link
Author

h3nd24 commented May 27, 2021

Stumbled across this article and decided to give it a go https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/ . The result looks much better than before, whereby there are improvements for both Graviton and Non-Graviton, but the improvement on Graviton is of much higher ratio that it's raw performance is better than Non-Graviton now. Find attached the result
BenchmarkingResultWithACCP.tar.gz

In a sense it seems to me ACCP is kinda doing some native instructions already (via openSSL), and that definitely brings a lot of advantages.

@simonis
Copy link
Contributor

simonis commented May 31, 2021

History

OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.

However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:

8143925: Enhancing CounterMode.crypt() for AES
Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode.
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772
From the issue summary:

The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions.
Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline.
We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.

The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().

8177784: Use CounterMode intrinsic for AES/GCM
http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f

This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.

Later 8143925 was further improved for AVX512 and the Vector AES instructions

8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions
https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe
From the issue summary:

As per the Intel Architecture Instruction Set Reference, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE to confirm that encoding and semantics are correctly implemented.

The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().

ToDo

As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt() to track this in OpenJDK upstream.

Arm® NeoverseTM N1 Software Optimization Guide, p.58 mentions the following:

4.6 AES encryption/decryption
Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of two cycles. This means encryption or decryption for at least four data
chunks should be interleaved for maximum performance:

AESE  data0, key0
AESMC data0, data0
AESE  data1, key0
AESMC data1, data1
AESE  data2, key0
AESMC data2, data2
AESE  data3, key1
AESMC data3, data3
AESE  data0, key0
AESMC data0, data0
...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.

Does it make sense to process more than 4 blocks in parallel?

I also read about the ARM SVE2-AES extension which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions but it looks like SVE2-AES will only become available in ARMv9.

@AWSNB
Copy link
Contributor

AWSNB commented May 31, 2021 via email

@navyxliu
Copy link

navyxliu commented Aug 3, 2021

hi, @h3nd24
Are you using Corretto-11 or Corretto-8?

@charlese-instaclustr
Copy link

Are you using Corretto-11 or Corretto-8?
@navyxliu This was with Corretto-11

@navyxliu
Copy link

Here is the PR of Interleave GCM.
openjdk/jdk11u-dev#410

please note it's off by default, user needs to explicitly enable it by -XX:+UseAESCTRIntrinsics.

@navyxliu
Copy link

navyxliu commented Dec 2, 2021

This patch has been backported to openjdk 11.0.14.
We expect to see 8x speedup of GCM encrypt/decrypt on Gravition2. Next move, we will evaluate the performance gain on kafka.

@navyxliu
Copy link

navyxliu commented Dec 9, 2021

Here is the result of corretto-11 nightly build on r6g instance.

java -jar ./target/benchmarks.jar -jvm ../../amazon-corretto-11.0.14.1.0-linux-aarch64/bin/java -jvmArgs '-XX:+UnlockDiagnosticVMOptions -XX:+UseAESCTRIntrinsics' org.openjdk.bench.javax.crypto.small.AESGCMBench.*

for dataSize/keyLength = (1024/128), we can see 3.5~5x more thoughput.

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  727,936.577 ±  803.232  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  580,544.738 ± 2154.782  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  929,792.800 ± 2149.501  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  893,095.915 ± 1069.300  ops/s

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  173,051.939 ±  701.488  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  162,495.531 ± 1512.809  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  180,853.164 ± 1102.398  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  175,422.285 ± 2028.631  ops/s

@navyxliu
Copy link

navyxliu commented Dec 9, 2021

radek-kondziolka pushed a commit to radek-kondziolka/trino that referenced this issue May 10, 2022
This PR adds options -XX:+UnlockDiagnosticVMOptions and
-XX:+UseAESCTRIntrinsics to the jvm.config files to enable
intrinsic support for AES CTR/GCM on ARM64.

It improves performance of network communication with S3
a lot on graviton instances. It's enabled in Java 18 by default,
which is already released. Therefore risk is minimal.

In latest Java (OpenJDK) versions the performance of AES CTR/GCM
for AARCH64 was significally improved and it was backported to OpenJDK 11.
Backport PR: openjdk/jdk11u-dev#410.
Some additional explanation is here:
aws/aws-graviton-getting-started#110 (comment)
The original OpenJDK issue is here:
https://bugs.openjdk.java.net/browse/JDK-8267993
To use that backport we need to enable it explicitly by
enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics.
It was not enabled by default in the backport because of
conservative approach.
radek-kondziolka pushed a commit to radek-kondziolka/charts that referenced this issue May 11, 2022
This PR adds options -XX:+UnlockDiagnosticVMOptions and
-XX:+UseAESCTRIntrinsics to the JVM's options to enable
intrinsic support for AES CTR/GCM on ARM64.

It improves performance of network communication with S3
a lot on graviton instances. It's enabled in Java 18 by default,
which is already released. Therefore risk is minimal.

In latest Java (OpenJDK) versions the performance of AES CTR/GCM
for AARCH64 was significally improved and it was backported to OpenJDK 11.
Backport PR: openjdk/jdk11u-dev#410.
Some additional explanation is here:
aws/aws-graviton-getting-started#110 (comment)
The original OpenJDK issue is here:
https://bugs.openjdk.java.net/browse/JDK-8267993
To use that backport we need to enable it explicitly by
enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics.
It was not enabled by default in the backport because of
conservative approach.
sopel39 pushed a commit to trinodb/trino that referenced this issue May 12, 2022
This PR adds options -XX:+UnlockDiagnosticVMOptions and
-XX:+UseAESCTRIntrinsics to the jvm.config files to enable
intrinsic support for AES CTR/GCM on ARM64.

It improves performance of network communication with S3
a lot on graviton instances. It's enabled in Java 18 by default,
which is already released. Therefore risk is minimal.

In latest Java (OpenJDK) versions the performance of AES CTR/GCM
for AARCH64 was significally improved and it was backported to OpenJDK 11.
Backport PR: openjdk/jdk11u-dev#410.
Some additional explanation is here:
aws/aws-graviton-getting-started#110 (comment)
The original OpenJDK issue is here:
https://bugs.openjdk.java.net/browse/JDK-8267993
To use that backport we need to enable it explicitly by
enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics.
It was not enabled by default in the backport because of
conservative approach.
hashhar pushed a commit to trinodb/charts that referenced this issue May 13, 2022
This PR adds options -XX:+UnlockDiagnosticVMOptions and
-XX:+UseAESCTRIntrinsics to the JVM's options to enable
intrinsic support for AES CTR/GCM on ARM64.

It improves performance of network communication with S3
a lot on graviton instances. It's enabled in Java 18 by default,
which is already released. Therefore risk is minimal.

In latest Java (OpenJDK) versions the performance of AES CTR/GCM
for AARCH64 was significally improved and it was backported to OpenJDK 11.
Backport PR: openjdk/jdk11u-dev#410.
Some additional explanation is here:
aws/aws-graviton-getting-started#110 (comment)
The original OpenJDK issue is here:
https://bugs.openjdk.java.net/browse/JDK-8267993
To use that backport we need to enable it explicitly by
enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics.
It was not enabled by default in the backport because of
conservative approach.
@geoffreyblake
Copy link
Contributor

Resolving as the JDK has the necessary backports, and commits to other projects to use the flag have been committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants