Vault tests are flaky #21095

famod · 2021-10-29T15:49:56Z

Description

Seen those fail a lot recently (mostly on Java 17?), e.g.:
#21078 (comment)

io.quarkus.vault.AgroalVaultITCase
io.quarkus.vault.AgroalVaultKv1ITCase
io.quarkus.vault.VaultKv1ITCase

Might just be caused by slow CI.

/cc @vsevel @stuartwdouglas

Implementation ideas

No response

quarkus-bot · 2021-10-29T15:49:58Z

/cc @vsevel

geoand · 2021-10-29T15:53:47Z

Let me try a quick bandaid

Hopefully fixes: quarkusio#21095

geoand · 2021-10-29T16:02:09Z

Let's see if #21097 does anything

Increase vault client timeouts

geoand · 2021-10-30T12:29:43Z

Reopening as #21105 failed.

It seems like the increase in timeout in #21097 got this moving more, but now we have this error:

2021-10-30T02:58:03.7090291Z Caused by: java.lang.RuntimeException: Unable to start Quarkus test resource io.quarkus.vault.test.VaultTestLifecycleManager@5f59376
2021-10-30T02:58:03.7092400Z 	at io.quarkus.test.common.TestResourceManager.lambda$start$1(TestResourceManager.java:130)
2021-10-30T02:58:03.7093748Z 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
2021-10-30T02:58:03.7136764Z 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2021-10-30T02:58:03.7361736Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-10-30T02:58:03.7363435Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-10-30T02:58:03.7364488Z 	at java.base/java.lang.Thread.run(Thread.java:829)
2021-10-30T02:58:03.7365754Z Caused by: io.quarkus.vault.runtime.VaultIOException: io.vertx.core.VertxException: Connection was closed
2021-10-30T02:58:03.7367329Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.exec(VertxVaultClient.java:202)
2021-10-30T02:58:03.7368985Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.exec(VertxVaultClient.java:172)
2021-10-30T02:58:03.7370634Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.post(VertxVaultClient.java:116)
2021-10-30T02:58:03.7372459Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.post(VertxVaultClient.java:110)
2021-10-30T02:58:03.7374626Z 	at io.quarkus.vault.test.client.TestVaultClient.generateAppRoleSecretId(TestVaultClient.java:30)
2021-10-30T02:58:03.7376987Z 	at io.quarkus.vault.test.VaultTestExtension.initVault(VaultTestExtension.java:290)
2021-10-30T02:58:03.7419859Z 	at io.quarkus.vault.test.VaultTestExtension.start(VaultTestExtension.java:244)
2021-10-30T02:58:03.7422290Z 	at io.quarkus.vault.test.VaultTestLifecycleManager.start(VaultTestLifecycleManager.java:33)
2021-10-30T02:58:03.7424198Z 	at io.quarkus.test.common.TestResourceManager.lambda$start$1(TestResourceManager.java:125)

@vsevel any idea what's going on? Should we perhaps automatically retry when the connection was closed?

vsevel · 2021-10-30T13:44:14Z

any idea what's going on?

hi @geoand,

that is how I create the web client:

        WebClientOptions options = new WebClientOptions()
                .setConnectTimeout((int) vaultBootstrapConfig.connectTimeout.toMillis())
                .setIdleTimeout((int) vaultBootstrapConfig.readTimeout.getSeconds());

I am not convinced using the concept of read timeout does translate well into vertx's idle timeout. but when I did this, I did not find any readTimeout or soTimeout in the WebClientOptions as I was expecting.

I use the read timeout as well in:

    public Buffer get(String path, String token) {
        final HttpRequest<Buffer> request = builder(path, token).method(HttpMethod.GET);
        final HttpResponse<Buffer> response = request.send().await().atMost(getRequestTimeout());

...

    private Duration getRequestTimeout() {
        return vaultConfigHolder.getVaultBootstrapConfig().readTimeout;
    }

so the readTimeout config parameter is used both in setIdleTimeout() and atMost(). may be in the case above, the idleTimeout closed the connection before the atMost was able to throw a timeout exception. may be we could get the atMost to be smaller than the idleTimeout. Or we should have 2 proper params.

that said, we would have an issue anyway (whether it is a closed connection or a timeout), because for some reason there is an issue communicating with vault.

in our project we have been hit a lot with timeout exceptions communicating with vault. since we are using the vault config source, those exceptions would happen at startup time, and get the app to not start, get killed, and we would have to get it to restart. that is why I did #20343.

we tried to investigate this communication issue, looking at vault's audit logs. but all calls received by vault were processed in 5ms. we could not pinpoint something wrong in vault. rather than vault being slow, this gave the impression (without being able to prove it) that the quarkus client was actually not sending the http request, but yet waiting for a response, which would never come back, and finish up in a timeout.

again just an impression, but I have developed the feeling that I did not have this behavior when working with the original okhttp client that I was using at the beginning. may be a review of vault's MutinyVertxClientFactory and VertxVaultClient by somebody knowledgeable would be a good thing. may be I am using the vertx web client improperly.

if we want a bandaid, we can certainly do some retries on the type of exception above. note however that it is not as simple as retrying VaultTestExtension.initVault(), because vault's initialization for our tests has some actions that are likely to fail on the second try, if the first try passes. for instance I suspect that attempting to mount twice a secret engine with the same name will result in an error on the second try. so we would have to catch those exceptions for each action that are not idempotent.

if all errors happen in the init phase, we can fix them one by one, a bit painful but certainly doable. I would be surprised though it is is happening only there, and if we make this part more resilient, there is a chance that test may fail anyway on the calls that are made in the tests themselves (as opposed to in the initialization).

I have the feeling that we have an issue at the communication level. it would be ideal to get that solved instead. I realize how painful an unreliable CI can be however, so I can definitely help in making some of these interactions more resilient.

geoand · 2021-10-30T13:55:22Z

again just an impression, but I have developed the feeling that I did not have this behavior when working with the original okhttp client that I was using at the beginning. may be a review of vault's MutinyVertxClientFactory and VertxVaultClient by somebody knowledgeable would be a good thing. may be I am using the vertx web client improperly.

@cescoffier could you perhaps take a look when you have some time?

Hopefully fixes: quarkusio#21095 (cherry picked from commit 4e89bdf)

famod added area/housekeeping Issue type for generalized tasks not related to bugs or enhancements flaky-test labels Oct 29, 2021

quarkus-bot bot added the area/vault label Oct 29, 2021

geoand added a commit to geoand/quarkus that referenced this issue Oct 29, 2021

Increase vault client timeouts

4e89bdf

Hopefully fixes: quarkusio#21095

geoand mentioned this issue Oct 29, 2021

Increase vault client timeouts #21097

Merged

famod closed this as completed in #21097 Oct 29, 2021

famod added a commit that referenced this issue Oct 29, 2021

Merge pull request #21097 from geoand/#21095

d5bbe6f

Increase vault client timeouts

quarkus-bot bot added this to the 2.5 - main milestone Oct 29, 2021

gsmet modified the milestones: 2.5 - main, 2.4.1.Final Nov 2, 2021

gsmet pushed a commit to gsmet/quarkus that referenced this issue Nov 2, 2021

Increase vault client timeouts

e43d521

Hopefully fixes: quarkusio#21095 (cherry picked from commit 4e89bdf)

vsevel mentioned this issue Nov 2, 2021

Check for Vault API to be ready in Vault tests #21154

Merged

gsmet modified the milestones: 2.4.1.Final, 2.4.2.Final Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault tests are flaky #21095

Vault tests are flaky #21095

famod commented Oct 29, 2021

quarkus-bot bot commented Oct 29, 2021

geoand commented Oct 29, 2021

geoand commented Oct 29, 2021

geoand commented Oct 30, 2021

vsevel commented Oct 30, 2021 •

edited

Loading

geoand commented Oct 30, 2021

Vault tests are flaky #21095

Vault tests are flaky #21095

Comments

famod commented Oct 29, 2021

Description

Implementation ideas

quarkus-bot bot commented Oct 29, 2021

geoand commented Oct 29, 2021

geoand commented Oct 29, 2021

geoand commented Oct 30, 2021

vsevel commented Oct 30, 2021 • edited Loading

geoand commented Oct 30, 2021

vsevel commented Oct 30, 2021 •

edited

Loading