Flaky test TestKafkaStorage/GetTrace #2760

jpkrohling · 2021-02-02T10:14:35Z

Seen here: https://github.com/jaegertracing/jaeger/pull/2759/checks?check_run_id=1813633773. The mentioned PR does not touch Kafka at all.

    integration.go:188: 
        	Error Trace:	integration.go:188
        	Error:      	Should be true
        	Test:       	TestKafkaStorage/GetTrace
    trace_compare.go:57: 
        	Error Trace:	trace_compare.go:57
        	            				integration.go:189
        	Error:      	Expected value not to be nil.
        	Test:       	TestKafkaStorage/GetTrace
--- FAIL: TestKafkaStorage (4.15s)
    --- FAIL: TestKafkaStorage/GetTrace (3.02s)

The text was updated successfully, but these errors were encountered:

LostLaser · 2021-02-03T01:25:31Z

I'd like to give this one a shot

yurishkuro · 2021-02-03T01:54:14Z

+1

LostLaser · 2021-02-06T19:51:26Z

If anyone is following this, I have been able to get a consistent reproduction of this behavior by running the described test in quick succession. I haven't found a specific root cause that would explain the behavior as of yet but I hope to soon.

yurishkuro · 2021-02-06T19:52:44Z

nice!

LostLaser · 2021-02-12T04:34:28Z

As mentioned a few days ago, I was able to reproduce the same error message mentioned in this issue by running this test case in quick succession given these conditions:

The docker container that hosts the Kafka/Zookeeper instance persists between runs (variance from the test suite that stops the docker process after each run)
The test case is ran, and then ran again at most approximately 15 seconds after the initial run. Subsequent runs within the 15 second time frame of the previous will produce errors.

Based on the attached Wireshark logs I have narrowed this issue down to a few possibilities:

The connection after each run is not being properly terminated.
The Kafka container used to test is misconfigured in a way that restricts connections.
There is a bug in the shopify/sarama package that is used to interface with Kafka with a potential root cause being that there is no assertion the Kafka group has been joined before sending messages.

The most important thing to note about the Wireshark logs, other than the fact that there are no fetch calls for the failing case, is the order in which the join group response arrives when comparing the successful run and the failing run. In some instances the join group response would come ~2-3 seconds after the test had already failed.

Wireshark logs

Failing run output:

Successful run output:

At this point, I'm pretty stuck on how to proceed with a solution here and am welcome to any suggestions.

Note: While the error output is the same, the environment required to reproduce the error is not. I was not able to reproduce the error when using the kafk-integration-test.sh script that the CI pipeline uses when executing the test case indicating that my test bench may be flawed and/or introducing the error.

yurishkuro · 2021-02-12T05:39:32Z

Thanks a lot for the details. This is above and beyond.

Do we have the same setup that multiple tests run against a Kafka broken that starts only once? Or do we only run the test once? In case of only-once test, the common problem in integrations is not ensuring that the dependency is ready. I am not sure if sarama has an API to do something like a "ping" to the broken to validate connection.

Re point 3 - we're currently not specifying any value for kafka.producer.RequiredAcks. If we set it to WaitForLocal, then the driver will ensure to receive an ACK from the broken that the message was received, or return an error, which will fail the test at the writing stage rather than reading stage. It can be used to confirm the "broker not ready" hypothesis.

LostLaser · 2021-02-15T07:14:23Z

I can give the RequiredAcks suggestion a shot as well as look into the Sarama ping! As for the test execution itself, I believe this is the only one that is running, but I’m not completely sure.

LostLaser · 2021-03-09T07:24:51Z

It has been a while (longer than I had hoped) so I figure I might give an update on my findings. Currently, I am quite confident that the flaky behavior mentioned in this issue is caused by a race between the setup time of the Kafka cluster and the amount of time that the mentioned test is waiting for messages to be available to the consumer. While the solution does feel lackluster, I think increasing the wait time of the test from 3 seconds to 4 seconds should cause this exception to be almost non-existent with the current hardware that is used to test. I am more than willing to switch this over if the solution is accepted.

To break it down a bit more, I have noticed that the Kafka topic that messages are being sent on isn't actually created until a producer attempts to send a message. This means that there is some overhead on the Kafka side that, in very rare instances, causes this test to get pushed out past the 3 second interval that is currently allowed for such set up. Couple this with the startup time of the Kafka/Zookeeper container itself and the test has a decent chance of failing.

To back up my findings with a bit of anecdotal evidence (the best kind 😄). The consumer in this test polls for new data once an iteration with a max iteration count set to 30. On my many local machine runs, I have seen iteration counts ranging from 3 (near instant completion) all the way up to around 24 for successful test runs.

I would also like to note that I don't think the consistent local reproduction I mentioned in previous comments is related to this current issue and it just so happened to produce a similar output. I believe this behavior is the result of a "Join Group" request by the Kafka consumer being delayed due to actions on the Kafka cluster itself that are triggered by a previously run Jaeger test not calling sarama's Close method and instead getting disconnected by the Kafka cluster's own timeouts. Which, if that is the case, a new issue might need to be created.

jpkrohling · 2021-03-09T08:48:29Z

I think we can try increasing the interval, especially as a temporary measure before a proper solution.

yurishkuro · 2021-03-11T15:33:00Z

Let's hope #2873 will solve the flakiness

jpkrohling added the bug label Feb 2, 2021

github-actions bot added the needs-triage label Feb 2, 2021

jpkrohling removed the needs-triage label Feb 2, 2021

yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Feb 3, 2021

LostLaser mentioned this issue Mar 10, 2021

Increase test timeout for TestKafkaStorage #2873

Merged

yurishkuro closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test TestKafkaStorage/GetTrace #2760

Flaky test TestKafkaStorage/GetTrace #2760

jpkrohling commented Feb 2, 2021

LostLaser commented Feb 3, 2021

yurishkuro commented Feb 3, 2021

LostLaser commented Feb 6, 2021

yurishkuro commented Feb 6, 2021

LostLaser commented Feb 12, 2021

yurishkuro commented Feb 12, 2021

LostLaser commented Feb 15, 2021

LostLaser commented Mar 9, 2021

jpkrohling commented Mar 9, 2021

yurishkuro commented Mar 11, 2021

Flaky test TestKafkaStorage/GetTrace #2760

Flaky test TestKafkaStorage/GetTrace #2760

Comments

jpkrohling commented Feb 2, 2021

LostLaser commented Feb 3, 2021

yurishkuro commented Feb 3, 2021

LostLaser commented Feb 6, 2021

yurishkuro commented Feb 6, 2021

LostLaser commented Feb 12, 2021

Wireshark logs

yurishkuro commented Feb 12, 2021

LostLaser commented Feb 15, 2021

LostLaser commented Mar 9, 2021

jpkrohling commented Mar 9, 2021

yurishkuro commented Mar 11, 2021