Producer doesn't handle `not_leader_for_partition` errors #396

shamilpd · 2020-01-21T17:06:56Z

If the connection to a partition leader is broken, KafkaEx handles it by triggering a metadata request and updating its cached metadata. It is possible that the leader can change while the broker connection is intact. For example you can manually trigger a leader election, reassign partitions, or there can be a network partition with zookeeper that makes the controller think that a partition leader dropped out of the cluster.

If the metadata is not up-to-date in such cases, KafkaEx will attempt to produce to the old partition leader, and it will get a not_leader_for_partition error back. But this error is not handled (https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L481). KafkaEx should trigger a metadata update when this happens and try the produce request again. It wasn't easy but I did actually manage to produce the bug locally.

With the default metadata_update_interval of 30s, produce requests could fail for up to 30s before the issue is fixed which is not acceptable for apps that have high produce rates.

Once this problem is fixed, we would love a config to be able to disable the periodic metadata updates. If an app only produces messages and at a relatively high rate, these metadata updates don't add any value. Any change in the metadata would be noticed by the first produce request and subsequently updated. What do you think?

I started working on a PR to handle not_leader_for_partition. Had a question about this code https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L430-L439
Why does it call retrieve_metadata() and then update_metadata() after. Seems like the first call is redundant?

The text was updated successfully, but these errors were encountered:

joshuawscott · 2020-02-06T13:29:30Z

@shamilpd My apologies, I thought I had already responded to this 😞

The fixes you're proposing seem reasonable.

As far as the code you mention in server.ex, that code is a bit of a mess, so I'm frankly not surprised there's a redundant call. I don't see any reason for it, although I might be missing something. If the tests pass, then I would feel pretty comfortable with removing that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Producer doesn't handle `not_leader_for_partition` errors #396

Producer doesn't handle `not_leader_for_partition` errors #396

shamilpd commented Jan 21, 2020

joshuawscott commented Feb 6, 2020

Producer doesn't handle not_leader_for_partition errors #396

Producer doesn't handle not_leader_for_partition errors #396

Comments

shamilpd commented Jan 21, 2020

joshuawscott commented Feb 6, 2020

Producer doesn't handle `not_leader_for_partition` errors #396

Producer doesn't handle `not_leader_for_partition` errors #396