Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test TestKafkaStorage/GetTrace #2760

Closed
jpkrohling opened this issue Feb 2, 2021 · 10 comments
Closed

Flaky test TestKafkaStorage/GetTrace #2760

jpkrohling opened this issue Feb 2, 2021 · 10 comments
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@jpkrohling
Copy link
Contributor

Seen here: https://github.com/jaegertracing/jaeger/pull/2759/checks?check_run_id=1813633773. The mentioned PR does not touch Kafka at all.

    integration.go:188: 
        	Error Trace:	integration.go:188
        	Error:      	Should be true
        	Test:       	TestKafkaStorage/GetTrace
    trace_compare.go:57: 
        	Error Trace:	trace_compare.go:57
        	            				integration.go:189
        	Error:      	Expected value not to be nil.
        	Test:       	TestKafkaStorage/GetTrace
--- FAIL: TestKafkaStorage (4.15s)
    --- FAIL: TestKafkaStorage/GetTrace (3.02s)
@LostLaser
Copy link
Contributor

I'd like to give this one a shot

@yurishkuro
Copy link
Member

+1

@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Feb 3, 2021
@LostLaser
Copy link
Contributor

If anyone is following this, I have been able to get a consistent reproduction of this behavior by running the described test in quick succession. I haven't found a specific root cause that would explain the behavior as of yet but I hope to soon.

@yurishkuro
Copy link
Member

nice!

@LostLaser
Copy link
Contributor

As mentioned a few days ago, I was able to reproduce the same error message mentioned in this issue by running this test case in quick succession given these conditions:

  1. The docker container that hosts the Kafka/Zookeeper instance persists between runs (variance from the test suite that stops the docker process after each run)
  2. The test case is ran, and then ran again at most approximately 15 seconds after the initial run. Subsequent runs within the 15 second time frame of the previous will produce errors.

Based on the attached Wireshark logs I have narrowed this issue down to a few possibilities:

  1. The connection after each run is not being properly terminated.
  2. The Kafka container used to test is misconfigured in a way that restricts connections.
  3. There is a bug in the shopify/sarama package that is used to interface with Kafka with a potential root cause being that there is no assertion the Kafka group has been joined before sending messages.

The most important thing to note about the Wireshark logs, other than the fact that there are no fetch calls for the failing case, is the order in which the join group response arrives when comparing the successful run and the failing run. In some instances the join group response would come ~2-3 seconds after the test had already failed.

Wireshark logs

Failing run output:
kafka-failure
Successful run output:
kafka-success

At this point, I'm pretty stuck on how to proceed with a solution here and am welcome to any suggestions.

Note: While the error output is the same, the environment required to reproduce the error is not. I was not able to reproduce the error when using the kafk-integration-test.sh script that the CI pipeline uses when executing the test case indicating that my test bench may be flawed and/or introducing the error.

@yurishkuro
Copy link
Member

Thanks a lot for the details. This is above and beyond.

Do we have the same setup that multiple tests run against a Kafka broken that starts only once? Or do we only run the test once? In case of only-once test, the common problem in integrations is not ensuring that the dependency is ready. I am not sure if sarama has an API to do something like a "ping" to the broken to validate connection.

Re point 3 - we're currently not specifying any value for kafka.producer.RequiredAcks. If we set it to WaitForLocal, then the driver will ensure to receive an ACK from the broken that the message was received, or return an error, which will fail the test at the writing stage rather than reading stage. It can be used to confirm the "broker not ready" hypothesis.

@LostLaser
Copy link
Contributor

I can give the RequiredAcks suggestion a shot as well as look into the Sarama ping! As for the test execution itself, I believe this is the only one that is running, but I’m not completely sure.

@LostLaser
Copy link
Contributor

It has been a while (longer than I had hoped) so I figure I might give an update on my findings. Currently, I am quite confident that the flaky behavior mentioned in this issue is caused by a race between the setup time of the Kafka cluster and the amount of time that the mentioned test is waiting for messages to be available to the consumer. While the solution does feel lackluster, I think increasing the wait time of the test from 3 seconds to 4 seconds should cause this exception to be almost non-existent with the current hardware that is used to test. I am more than willing to switch this over if the solution is accepted.

To break it down a bit more, I have noticed that the Kafka topic that messages are being sent on isn't actually created until a producer attempts to send a message. This means that there is some overhead on the Kafka side that, in very rare instances, causes this test to get pushed out past the 3 second interval that is currently allowed for such set up. Couple this with the startup time of the Kafka/Zookeeper container itself and the test has a decent chance of failing.

To back up my findings with a bit of anecdotal evidence (the best kind 😄). The consumer in this test polls for new data once an iteration with a max iteration count set to 30. On my many local machine runs, I have seen iteration counts ranging from 3 (near instant completion) all the way up to around 24 for successful test runs.

I would also like to note that I don't think the consistent local reproduction I mentioned in previous comments is related to this current issue and it just so happened to produce a similar output. I believe this behavior is the result of a "Join Group" request by the Kafka consumer being delayed due to actions on the Kafka cluster itself that are triggered by a previously run Jaeger test not calling sarama's Close method and instead getting disconnected by the Kafka cluster's own timeouts. Which, if that is the case, a new issue might need to be created.

@jpkrohling
Copy link
Contributor Author

I think we can try increasing the interval, especially as a temporary measure before a proper solution.

@yurishkuro
Copy link
Member

Let's hope #2873 will solve the flakiness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

3 participants