Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault tests are flaky #21095

Closed
famod opened this issue Oct 29, 2021 · 6 comments · Fixed by #21097 or #21154
Closed

Vault tests are flaky #21095

famod opened this issue Oct 29, 2021 · 6 comments · Fixed by #21097 or #21154
Labels
area/housekeeping Issue type for generalized tasks not related to bugs or enhancements area/vault
Milestone

Comments

@famod
Copy link
Member

famod commented Oct 29, 2021

Description

Seen those fail a lot recently (mostly on Java 17?), e.g.:
#21078 (comment)

  • io.quarkus.vault.AgroalVaultITCase
  • io.quarkus.vault.AgroalVaultKv1ITCase
  • io.quarkus.vault.VaultKv1ITCase

Might just be caused by slow CI.

/cc @vsevel @stuartwdouglas

Implementation ideas

No response

@famod famod added area/housekeeping Issue type for generalized tasks not related to bugs or enhancements flaky-test labels Oct 29, 2021
@quarkus-bot
Copy link

quarkus-bot bot commented Oct 29, 2021

/cc @vsevel

@geoand
Copy link
Contributor

geoand commented Oct 29, 2021

Let me try a quick bandaid

geoand added a commit to geoand/quarkus that referenced this issue Oct 29, 2021
@geoand
Copy link
Contributor

geoand commented Oct 29, 2021

Let's see if #21097 does anything

famod added a commit that referenced this issue Oct 29, 2021
Increase vault client timeouts
@quarkus-bot quarkus-bot bot added this to the 2.5 - main milestone Oct 29, 2021
@geoand
Copy link
Contributor

geoand commented Oct 30, 2021

Reopening as #21105 failed.

It seems like the increase in timeout in #21097 got this moving more, but now we have this error:

2021-10-30T02:58:03.7090291Z Caused by: java.lang.RuntimeException: Unable to start Quarkus test resource io.quarkus.vault.test.VaultTestLifecycleManager@5f59376
2021-10-30T02:58:03.7092400Z 	at io.quarkus.test.common.TestResourceManager.lambda$start$1(TestResourceManager.java:130)
2021-10-30T02:58:03.7093748Z 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
2021-10-30T02:58:03.7136764Z 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2021-10-30T02:58:03.7361736Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-10-30T02:58:03.7363435Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-10-30T02:58:03.7364488Z 	at java.base/java.lang.Thread.run(Thread.java:829)
2021-10-30T02:58:03.7365754Z Caused by: io.quarkus.vault.runtime.VaultIOException: io.vertx.core.VertxException: Connection was closed
2021-10-30T02:58:03.7367329Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.exec(VertxVaultClient.java:202)
2021-10-30T02:58:03.7368985Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.exec(VertxVaultClient.java:172)
2021-10-30T02:58:03.7370634Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.post(VertxVaultClient.java:116)
2021-10-30T02:58:03.7372459Z 	at io.quarkus.vault.runtime.client.VertxVaultClient.post(VertxVaultClient.java:110)
2021-10-30T02:58:03.7374626Z 	at io.quarkus.vault.test.client.TestVaultClient.generateAppRoleSecretId(TestVaultClient.java:30)
2021-10-30T02:58:03.7376987Z 	at io.quarkus.vault.test.VaultTestExtension.initVault(VaultTestExtension.java:290)
2021-10-30T02:58:03.7419859Z 	at io.quarkus.vault.test.VaultTestExtension.start(VaultTestExtension.java:244)
2021-10-30T02:58:03.7422290Z 	at io.quarkus.vault.test.VaultTestLifecycleManager.start(VaultTestLifecycleManager.java:33)
2021-10-30T02:58:03.7424198Z 	at io.quarkus.test.common.TestResourceManager.lambda$start$1(TestResourceManager.java:125)

@vsevel any idea what's going on? Should we perhaps automatically retry when the connection was closed?

@vsevel
Copy link
Contributor

vsevel commented Oct 30, 2021

any idea what's going on?

hi @geoand,

that is how I create the web client:

        WebClientOptions options = new WebClientOptions()
                .setConnectTimeout((int) vaultBootstrapConfig.connectTimeout.toMillis())
                .setIdleTimeout((int) vaultBootstrapConfig.readTimeout.getSeconds());

I am not convinced using the concept of read timeout does translate well into vertx's idle timeout. but when I did this, I did not find any readTimeout or soTimeout in the WebClientOptions as I was expecting.

I use the read timeout as well in:

    public Buffer get(String path, String token) {
        final HttpRequest<Buffer> request = builder(path, token).method(HttpMethod.GET);
        final HttpResponse<Buffer> response = request.send().await().atMost(getRequestTimeout());

...

    private Duration getRequestTimeout() {
        return vaultConfigHolder.getVaultBootstrapConfig().readTimeout;
    }

so the readTimeout config parameter is used both in setIdleTimeout() and atMost(). may be in the case above, the idleTimeout closed the connection before the atMost was able to throw a timeout exception. may be we could get the atMost to be smaller than the idleTimeout. Or we should have 2 proper params.

that said, we would have an issue anyway (whether it is a closed connection or a timeout), because for some reason there is an issue communicating with vault.

in our project we have been hit a lot with timeout exceptions communicating with vault. since we are using the vault config source, those exceptions would happen at startup time, and get the app to not start, get killed, and we would have to get it to restart. that is why I did #20343.

we tried to investigate this communication issue, looking at vault's audit logs. but all calls received by vault were processed in 5ms. we could not pinpoint something wrong in vault. rather than vault being slow, this gave the impression (without being able to prove it) that the quarkus client was actually not sending the http request, but yet waiting for a response, which would never come back, and finish up in a timeout.

again just an impression, but I have developed the feeling that I did not have this behavior when working with the original okhttp client that I was using at the beginning. may be a review of vault's MutinyVertxClientFactory and VertxVaultClient by somebody knowledgeable would be a good thing. may be I am using the vertx web client improperly.

if we want a bandaid, we can certainly do some retries on the type of exception above. note however that it is not as simple as retrying VaultTestExtension.initVault(), because vault's initialization for our tests has some actions that are likely to fail on the second try, if the first try passes. for instance I suspect that attempting to mount twice a secret engine with the same name will result in an error on the second try. so we would have to catch those exceptions for each action that are not idempotent.

if all errors happen in the init phase, we can fix them one by one, a bit painful but certainly doable. I would be surprised though it is is happening only there, and if we make this part more resilient, there is a chance that test may fail anyway on the calls that are made in the tests themselves (as opposed to in the initialization).

I have the feeling that we have an issue at the communication level. it would be ideal to get that solved instead. I realize how painful an unreliable CI can be however, so I can definitely help in making some of these interactions more resilient.

@geoand
Copy link
Contributor

geoand commented Oct 30, 2021

again just an impression, but I have developed the feeling that I did not have this behavior when working with the original okhttp client that I was using at the beginning. may be a review of vault's MutinyVertxClientFactory and VertxVaultClient by somebody knowledgeable would be a good thing. may be I am using the vertx web client improperly.

@cescoffier could you perhaps take a look when you have some time?

@gsmet gsmet modified the milestones: 2.5 - main, 2.4.1.Final Nov 2, 2021
gsmet pushed a commit to gsmet/quarkus that referenced this issue Nov 2, 2021
Hopefully fixes: quarkusio#21095

(cherry picked from commit 4e89bdf)
@gsmet gsmet modified the milestones: 2.4.1.Final, 2.4.2.Final Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/housekeeping Issue type for generalized tasks not related to bugs or enhancements area/vault
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants