Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RemoteClusterSecurityReloadCredentialsRestIT testFirstTimeSetupWithElasticsearchSettings failing #116883

Closed
elasticsearchmachine opened this issue Nov 16, 2024 · 9 comments · Fixed by #117157
Assignees
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Nov 16, 2024

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:security:qa:multi-cluster:javaRestTest" --tests "org.elasticsearch.xpack.remotecluster.RemoteClusterSecurityReloadCredentialsRestIT.testFirstTimeSetupWithElasticsearchSettings" -Dtests.seed=540C601FBCEE4825 -Dtests.locale=hy-Armn-AM -Dtests.timezone=Asia/Aden -Druntime.java=17 -Dtests.fips.enabled=true

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.Exception: Test abandoned because suite timeout was reached.

Issue Reasons:

  • [8.x] 3 failures in test testFirstTimeSetupWithElasticsearchSettings (1.4% fail rate in 216 executions)
  • [8.x] 3 failures in step openjdk17_checkpart4_java-fips-matrix (37.5% fail rate in 8 executions)
  • [8.x] 3 failures in pipeline elasticsearch-periodic (37.5% fail rate in 8 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC >test-failure Triaged test failures from CI Team:Security Meta label for security team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 16, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-security (Team:Security)

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Nov 20, 2024

Marking this (and other tests) a low-risk since it does not reproduce locally. Looking at the logs, it seems that query cluster is attempting to establish connection with fulfilling cluster and gets certificate issued by Elastic Auto RemoteCluster CA which is not trusted because query cluster is expecting certificates issued by Elastic Auto Transport CA. I don't see immediately how this can happen:

[2024-11-15T23:41:35,649][WARN ][o.e.c.s.DiagnosticTrustManager] [query-cluster-0] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2024-11-15T21:41:35.634253849Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.transport.ssl (with trust configuration: PEM-trust{/dev/shm/bk/bk-agent-prod-gcp-1731705798559692147/elastic/elasticsearch-periodic/x-pack/plugin/security/qa/multi-cluster/build/testrun/javaRestTest/temp/query-cluster7670203185286997510/query-cluster-0/config/transport-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto Transport CA] with fingerprint [bbe49e3f986506008a70ab651b188c70df104812]
java.security.cert.CertificateException: No issuer certificate for certificate in certification path found.
	at org.bouncycastle.jsse.provider.ProvX509TrustManager.validateChain(ProvX509TrustManager.java:318) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ProvX509TrustManager.checkTrusted(ProvX509TrustManager.java:273) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ProvX509TrustManager.checkServerTrusted(ProvX509TrustManager.java:188) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ExportX509TrustManager_7.checkServerTrusted(ExportX509TrustManager_7.java:61) ~[bctls-fips-1.0.17.jar:?]
	at org.elasticsearch.common.ssl.DiagnosticTrustManager.checkServerTrusted(DiagnosticTrustManager.java:101) ~[?:?]
	at org.bouncycastle.jsse.provider.ImportX509TrustManager_7.checkServerTrusted(ImportX509TrustManager_7.java:62) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ProvSSLEngine.checkServerTrusted(ProvSSLEngine.java:150) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ProvTlsClient$1.notifyServerCertificate(ProvTlsClient.java:377) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsUtils.processServerCertificate(TlsUtils.java:4849) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsClientProtocol.handleServerCertificate(TlsClientProtocol.java:797) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsClientProtocol.receive13ServerCertificate(TlsClientProtocol.java:1596) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsClientProtocol.handle13HandshakeMessage(TlsClientProtocol.java:160) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsClientProtocol.handleHandshakeMessage(TlsClientProtocol.java:366) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsProtocol.processHandshakeQueue(TlsProtocol.java:715) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsProtocol.processRecord(TlsProtocol.java:591) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.RecordStream.readFullRecord(RecordStream.java:209) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsProtocol.safeReadFullRecord(TlsProtocol.java:926) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.tls.TlsProtocol.offerInput(TlsProtocol.java:1368) ~[bctls-fips-1.0.17.jar:?]
	at org.bouncycastle.jsse.provider.ProvSSLEngine.unwrap(ProvSSLEngine.java:486) ~[bctls-fips-1.0.17.jar:?]
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:309) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1473) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1366) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1415) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

@n1v0lg Assigning this one to you as you were the original author of RemoteClusterSecurityReloadCredentialsRestIT.

@slobodanadamovic slobodanadamovic added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 20, 2024
@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

Interestingly, these are all failing in FIPS mode, and likewise the reload call is used in other suites that were also failing in FIPS mode (e.g., we recently disabled FIPS mode for snapshot ITs via #116811)

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

Hm actually that message is benign -- the QC is not yet configured with a remote cluster credential and attempts to connect to the FC's port as though it were regular transport, not remote server, and therefore does not trust the cert (symptom 1 in our docs), eventually the keystore is configured with the cross cluster API key credential and those failure messages stop.

I get those same exact error messages on runs that are successful.

I think the suite simply doesn't have enough time to wrap up, reaches time-out, and fails.

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

Yikes, yeah just getting the FC and QC up and running takes ~4m, so about 2m per cluster, then each test takes between 1.5m and 2m -- with 4 tests total we time-out while completing the last one.

Main hog is the keystore setup -- almost >30s to create and provide all the entries:

[2024-11-18T08:46:18,684][INFO ][process-output           ] [[elasticsearch-keystore-log-forwarder]] Created elasticsearch keystore in /dev/shm/bk/bk-agent-prod-gcp-1731907391568494223/elastic/elasticsearch-periodic/x-pack/plugin/security/qa/multi-cluster/build/testrun/javaRestTest/temp/fulfilling-cluster6546496433333095446/fulfilling-cluster-0/config/elasticsearch.keystore
[2024-11-18T08:46:18,781][ERROR][process-output           ] [[elasticsearch-keystore-log-forwarder]] Enter new password for the elasticsearch keystore (empty for no password): Enter same password again: 
[2024-11-18T08:46:28,874][ERROR][process-output           ] [[elasticsearch-keystore-log-forwarder]] Enter password for the elasticsearch keystore : Enter value for xpack.security.remote_cluster_server.ssl.secure_key_passphrase: 
[2024-11-18T08:46:41,040][ERROR][process-output           ] [[elasticsearch-keystore-log-forwarder]] Enter password for the elasticsearch keystore : Enter value for bootstrap.password: 
[2024-11-18T08:46:52,123][ERROR][process-output           ] [[elasticsearch-keystore-log-forwarder]] Enter password for the elasticsearch keystore : Enter value for xpack.security.transport.ssl.secure_key_passphrase: 

There is room for improvement here...

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

I need to check the timestamps for a non-FIPS run, but I wonder if this is so heinously slow because we are using slower default algos in FIPS mode for e.g., keystore encryption

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Nov 20, 2024

Nice catch! Yeah, that sounds way too slow. I'm wondering if we could avoid running the same commands on every node, but rather construct the keystore once and copy it to all nodes (potentially using KeyStoreWrapper)?

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

@slobodanadamovic that'd be a nice improvement for larger clusters for sure -- the slow test run here though was for single-node clusters so wouldn't benefit much.

Delightfully, I've actually made this whole thing even slower in 8.14, by increasing the KDF iteration count from 10_000 to 210_000 in #107107 🤦

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 20, 2024

In the interim: #117157

@n1v0lg n1v0lg closed this as completed in 312f831 Nov 20, 2024
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
(cherry picked from commit 312f831)
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
(cherry picked from commit 312f831)
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
(cherry picked from commit 312f831)
elasticsearchmachine pushed a commit that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: #116883
(cherry picked from commit 312f831)
elasticsearchmachine pushed a commit that referenced this issue Nov 20, 2024
* Longer RCS suite timeout due to slow keystore (#117157)

Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: #116883

* Unmute
rjernst pushed a commit to rjernst/elasticsearch that referenced this issue Nov 20, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
elasticsearchmachine pushed a commit that referenced this issue Nov 21, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: #116883
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this issue Nov 28, 2024
Rather than muting the suite and losing signal, bump the suite timeout to account for very slow keystore operations.

We should follow this up with performance improvements around keystore setup in tests.

Closes: elastic#116883
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants