-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test TestKafkaStorage/GetTrace #2760
Comments
I'd like to give this one a shot |
+1 |
If anyone is following this, I have been able to get a consistent reproduction of this behavior by running the described test in quick succession. I haven't found a specific root cause that would explain the behavior as of yet but I hope to soon. |
nice! |
Thanks a lot for the details. This is above and beyond. Do we have the same setup that multiple tests run against a Kafka broken that starts only once? Or do we only run the test once? In case of only-once test, the common problem in integrations is not ensuring that the dependency is ready. I am not sure if sarama has an API to do something like a "ping" to the broken to validate connection. Re point 3 - we're currently not specifying any value for |
I can give the RequiredAcks suggestion a shot as well as look into the Sarama ping! As for the test execution itself, I believe this is the only one that is running, but I’m not completely sure. |
It has been a while (longer than I had hoped) so I figure I might give an update on my findings. Currently, I am quite confident that the flaky behavior mentioned in this issue is caused by a race between the setup time of the Kafka cluster and the amount of time that the mentioned test is waiting for messages to be available to the consumer. While the solution does feel lackluster, I think increasing the wait time of the test from 3 seconds to 4 seconds should cause this exception to be almost non-existent with the current hardware that is used to test. I am more than willing to switch this over if the solution is accepted. To break it down a bit more, I have noticed that the Kafka topic that messages are being sent on isn't actually created until a producer attempts to send a message. This means that there is some overhead on the Kafka side that, in very rare instances, causes this test to get pushed out past the 3 second interval that is currently allowed for such set up. Couple this with the startup time of the Kafka/Zookeeper container itself and the test has a decent chance of failing. To back up my findings with a bit of anecdotal evidence (the best kind 😄). The consumer in this test polls for new data once an iteration with a max iteration count set to 30. On my many local machine runs, I have seen iteration counts ranging from 3 (near instant completion) all the way up to around 24 for successful test runs. I would also like to note that I don't think the consistent local reproduction I mentioned in previous comments is related to this current issue and it just so happened to produce a similar output. I believe this behavior is the result of a "Join Group" request by the Kafka consumer being delayed due to actions on the Kafka cluster itself that are triggered by a previously run Jaeger test not calling sarama's Close method and instead getting disconnected by the Kafka cluster's own timeouts. Which, if that is the case, a new issue might need to be created. |
I think we can try increasing the interval, especially as a temporary measure before a proper solution. |
Let's hope #2873 will solve the flakiness |
Seen here: https://github.com/jaegertracing/jaeger/pull/2759/checks?check_run_id=1813633773. The mentioned PR does not touch Kafka at all.
The text was updated successfully, but these errors were encountered: