-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the HTTPMultiBin server timeout on flaky test #3545
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3545 +/- ##
==========================================
- Coverage 72.97% 72.93% -0.04%
==========================================
Files 280 278 -2
Lines 20941 20944 +3
==========================================
- Hits 15281 15276 -5
- Misses 4690 4694 +4
- Partials 970 974 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I don't mind increasing the timeout of the server, but unfortunately, I bet it doesn't fix the issue of the flaky test that you've mentioned. Long story short, the cause of the issue is not in our code base but in the GRPC package, which has no guarantee that For instance, if we look closer at the logs (in fact, they are direct logs from GRPC package) of the CI job, we see that the expected message is in CI output, but it wasn't propagated by grpc, and as a result, the test fails with context ended since it simply can't wait any longer. I even opened an issue to the grpc-go and planned to work on it but failed since it has no capacity and a good understanding of the internals of GRPC. On our side the issue #3551. |
Aha! I think that unfortunately you're right (see https://github.com/grafana/k6/actions/runs/7543208013/job/20546713199), so I guess it's just a matter of probability 🤷🏻 Thanks for the context and explanations! PS: Although I have to admit that I'm curious about why it's flaky, or even more, why it's almost impossible to make it fail locally? If it's purely a matter of a bug in the code (that it considers the certificate errors as temporary and just wait for the context to exceed, shouldn't it be more deterministic?) PS2: Indeed, I feel like if we would increase both timeouts enough (client & server), it would never fail, because it seems to remain waiting here and then follow this path, which ends up returning the connection error. That probably explains why it seemed to be that increasing the server timeout at least seems to make the test less prone to fail). But yeah, I'm not super convinced that's what we really want, because increasing timeouts means the server keeps waiting in the aforementioned line until the context deadline exceeds anyway (even if the test finishes successfully..). |
I guess it's possible that we have to instruct tests with something like |
I think the predominant problem is that the connection error is stored in the structure and could be overwritten, and keeping in mind that internally GRPC manages things concurrently we could meet not what we expected there. This theory is proven by the fact that we see the GRPC team says that we shouldn't rely on these connection errors; it's even an anti-pattern there. I believe they are considering them from the regular customer perspective, but from our testing/load testing perspective having a chance to catch this connection errors is useful. |
Okay, I see now what you mean. So,
Yeah, I get that error generalization is tricky, and for instance the reason why However, it's not as easy as that because although the returned error models a |
My issue is that this test case sounds to me a bit artificial. When I tried to investigate, I found the kind of error we're getting when we try to use a valid certificate, but that certificate is issued for the email signing. I also believe that in that case, we are actually trying to test not k6's functionality but rather TLS or grpc, with a notice that we are "proxying" the errors back to the client. Keeping in mind the comments from above, maybe it's better to remove that test case and make a comment in |
What?
It increases the server timeout of the
HTTPMultiBin
server for a specific flaky test that fails from time to time (for instance here), from2*time.Second
to5*time.Second
.Why?
After having looked at the code involved and tried a few possible solutions, it seemed to me that the root cause was the server timing out from time to time, depending on the conditions the test is executed at (like runner resources, etc).
In fact, it's difficult to say if it will never fail again because it's not that easy to reproduce it. To do so, I tried to run the specific test multiple times (
-count=X
), with the race detector enabled (-race
), with a limited amount of resources (GOMAXPROCS=1
) and with multiple runners (windows
,macos
, etc), and at the very least it seems to reduce much the chances to fail (previously it was always failing with the aforementioned conditions).Checklist
make lint
) and all checks pass.make tests
) and all tests pass.Related PR(s)/Issue(s)
Contributes to #2144