-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly report retry counts when refreshing the lead broker #7
Conversation
Thanks for the explanation; I understand the change. Is this a change we'd have to permanently keep in our fork, meaning it's just how we want sarama to work OR is this potentially a bug in sarama itself whose fix we should try push upstream as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself makes sense to me, for our use case. Left a question comment but it shouldn't block this PR.
My expectation is that Sarama will consider this a bug and want to fix it, though possibly not the exact same way I did. (There is also possible data loss in the existing code when a broker is down, but I don't know Sarama well enough to evaluate that part, it might just be handled elsewhere.) I'll be filing a bug with them shortly. Our hope is still to drop this fork entirely once our blockers are addressed in the current release, but this does add one more blocker to the list. I've been testing this PR more and I found a reproducible crash with the fix applied when a lead broker goes down and then comes back up, so evidently something isn't quite right yet -- I will follow up here once I track it down. |
Fixed (knock on wood); the problem was that I adjusted this PR to combine the old and new approaches: I still track leader-update retries separately from message retries for determining the wait time, but I also report a missing leader as an immediate error for that message and continue to the next loop iteration, rather than blocking until I find one. This ensures progress while still reporting relevant errors and applying the correct backoff. |
Given the description for the fix I wonder if it would make sense from sarama as a library point of view to configure the retry backoff for the leader separately from the backoff for messages. Having different configs would still allow us to configure our backoff handler to get this behavior. Messages can be 'rescheduled' in sarama, like send to another partition after an error. But is there a good reason to actually have backoff at the message level? Normally I'd expect errors to be at the broker level (IO error). On the other hand, errors at the broker level + backoff can bring all publishing to halt (depends on config, though). |
Backoff is tracked at the message level because partial success is possible, so a single batch of messages can have different retry values: if we send 10 messages and some fail, then we send 10 messages again and some of those fail, then next time we could have a batch where some messages are being retried for the second time, some for the first, and some have never been tried. The retry level of a single batch is then aggregated over the messages being sent in it. It would be possible to add distinct backoffs for the different stages of message transmission, but I'm not sure we have a use case for that right now. As far as I know the beats-side intention for the flag was just to have a generic exponential backoff for producer failures regardless of whether it was caused by reported failures or a lost broker. I'm not sure how Sarama will feel about it, but my impression is that this was just an oversight based on internal implementation details, not an intentionally distinct error state. |
👍 Let's discuss it in an issue with Sarama. At least from Beats POV I don't think we care much about per message backoff. Actually per message backoff can be a problem due to the memory queue waiting for the output to make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New implementation LGTM.
This is a fix for elastic/beats#19015, in which Sarama fails to apply exponential backoff when
Producer.Retry.BackoffFunc
is set to an exponential backoff function.Sarama tracks producer retries per-message, and updates them when receiving a response from the broker. This means that when there is no broker, the retry count never gets incremented. Thus, retries of send attempts to a valid broker will properly apply the exponential backoff, but retries while connecting to the broker itself will always retry with the initial backoff time.
I have modified the wait so that it tracks retries on the broker itself separately from retries on its pending messages. Local tests as described in elastic/beats#19015 confirm that exponential backoff is now correctly applied when the broker is down.