You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the connection to a partition leader is broken, KafkaEx handles it by triggering a metadata request and updating its cached metadata. It is possible that the leader can change while the broker connection is intact. For example you can manually trigger a leader election, reassign partitions, or there can be a network partition with zookeeper that makes the controller think that a partition leader dropped out of the cluster.
If the metadata is not up-to-date in such cases, KafkaEx will attempt to produce to the old partition leader, and it will get a not_leader_for_partition error back. But this error is not handled (https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L481). KafkaEx should trigger a metadata update when this happens and try the produce request again. It wasn't easy but I did actually manage to produce the bug locally.
With the default metadata_update_interval of 30s, produce requests could fail for up to 30s before the issue is fixed which is not acceptable for apps that have high produce rates.
Once this problem is fixed, we would love a config to be able to disable the periodic metadata updates. If an app only produces messages and at a relatively high rate, these metadata updates don't add any value. Any change in the metadata would be noticed by the first produce request and subsequently updated. What do you think?
@shamilpd My apologies, I thought I had already responded to this 😞
The fixes you're proposing seem reasonable.
As far as the code you mention in server.ex, that code is a bit of a mess, so I'm frankly not surprised there's a redundant call. I don't see any reason for it, although I might be missing something. If the tests pass, then I would feel pretty comfortable with removing that.
If the connection to a partition leader is broken, KafkaEx handles it by triggering a metadata request and updating its cached metadata. It is possible that the leader can change while the broker connection is intact. For example you can manually trigger a leader election, reassign partitions, or there can be a network partition with zookeeper that makes the controller think that a partition leader dropped out of the cluster.
If the metadata is not up-to-date in such cases, KafkaEx will attempt to produce to the old partition leader, and it will get a
not_leader_for_partition
error back. But this error is not handled (https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L481). KafkaEx should trigger a metadata update when this happens and try the produce request again. It wasn't easy but I did actually manage to produce the bug locally.With the default
metadata_update_interval
of 30s, produce requests could fail for up to 30s before the issue is fixed which is not acceptable for apps that have high produce rates.Once this problem is fixed, we would love a config to be able to disable the periodic metadata updates. If an app only produces messages and at a relatively high rate, these metadata updates don't add any value. Any change in the metadata would be noticed by the first produce request and subsequently updated. What do you think?
I started working on a PR to handle
not_leader_for_partition
. Had a question about this code https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L430-L439Why does it call
retrieve_metadata()
and thenupdate_metadata()
after. Seems like the first call is redundant?The text was updated successfully, but these errors were encountered: