Properly report retry counts when refreshing the lead broker #7

faec · 2020-06-09T19:30:10Z

This is a fix for elastic/beats#19015, in which Sarama fails to apply exponential backoff when Producer.Retry.BackoffFunc is set to an exponential backoff function.

Sarama tracks producer retries per-message, and updates them when receiving a response from the broker. This means that when there is no broker, the retry count never gets incremented. Thus, retries of send attempts to a valid broker will properly apply the exponential backoff, but retries while connecting to the broker itself will always retry with the initial backoff time.

I have modified the wait so that it tracks retries on the broker itself separately from retries on its pending messages. Local tests as described in elastic/beats#19015 confirm that exponential backoff is now correctly applied when the broker is down.

ycombinator · 2020-06-09T19:52:54Z

Thanks for the explanation; I understand the change. Is this a change we'd have to permanently keep in our fork, meaning it's just how we want sarama to work OR is this potentially a bug in sarama itself whose fix we should try push upstream as well?

ycombinator

The change itself makes sense to me, for our use case. Left a question comment but it shouldn't block this PR.

faec · 2020-06-09T20:11:46Z

My expectation is that Sarama will consider this a bug and want to fix it, though possibly not the exact same way I did. (There is also possible data loss in the existing code when a broker is down, but I don't know Sarama well enough to evaluate that part, it might just be handled elsewhere.) I'll be filing a bug with them shortly. Our hope is still to drop this fork entirely once our blockers are addressed in the current release, but this does add one more blocker to the list.

I've been testing this PR more and I found a reproducible crash with the fix applied when a lead broker goes down and then comes back up, so evidently something isn't quite right yet -- I will follow up here once I track it down.

faec · 2020-06-09T20:52:23Z

Fixed (knock on wood); the problem was that pp.parent.returnError returns an immediate fatal error for that message, and thus should only be called once per message (rather than in the error loop as in my first attempt).

I adjusted this PR to combine the old and new approaches: I still track leader-update retries separately from message retries for determining the wait time, but I also report a missing leader as an immediate error for that message and continue to the next loop iteration, rather than blocking until I find one. This ensures progress while still reporting relevant errors and applying the correct backoff.

urso · 2020-06-10T10:06:14Z

Given the description for the fix I wonder if it would make sense from sarama as a library point of view to configure the retry backoff for the leader separately from the backoff for messages. Having different configs would still allow us to configure our backoff handler to get this behavior.

Messages can be 'rescheduled' in sarama, like send to another partition after an error. But is there a good reason to actually have backoff at the message level? Normally I'd expect errors to be at the broker level (IO error). On the other hand, errors at the broker level + backoff can bring all publishing to halt (depends on config, though).

faec · 2020-06-10T12:33:30Z

Backoff is tracked at the message level because partial success is possible, so a single batch of messages can have different retry values: if we send 10 messages and some fail, then we send 10 messages again and some of those fail, then next time we could have a batch where some messages are being retried for the second time, some for the first, and some have never been tried. The retry level of a single batch is then aggregated over the messages being sent in it.

It would be possible to add distinct backoffs for the different stages of message transmission, but I'm not sure we have a use case for that right now. As far as I know the beats-side intention for the flag was just to have a generic exponential backoff for producer failures regardless of whether it was caused by reported failures or a lost broker. I'm not sure how Sarama will feel about it, but my impression is that this was just an oversight based on internal implementation details, not an intentionally distinct error state.

urso · 2020-06-10T12:38:44Z

👍

Let's discuss it in an issue with Sarama. At least from Beats POV I don't think we care much about per message backoff. Actually per message backoff can be a problem due to the memory queue waiting for the output to make progress.

ycombinator

New implementation LGTM.

Properly report retry counts when refreshing the lead broker

65b4d92

faec requested review from ycombinator and kvch June 9, 2020 19:30

ycombinator approved these changes Jun 9, 2020

View reviewed changes

Only call returnError once per incoming message

945143a

ycombinator approved these changes Jun 10, 2020

View reviewed changes

faec merged commit 1a2a3be into elastic:beats-vendored Jun 10, 2020

faec deleted the backoff-fix branch June 11, 2020 16:23

faec mentioned this pull request Jun 24, 2020

Properly report retry counts when refreshing the lead broker #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly report retry counts when refreshing the lead broker #7

Properly report retry counts when refreshing the lead broker #7

faec commented Jun 9, 2020

ycombinator commented Jun 9, 2020 •

edited

Loading

ycombinator left a comment

faec commented Jun 9, 2020

faec commented Jun 9, 2020 •

edited

Loading

urso commented Jun 10, 2020

faec commented Jun 10, 2020

urso commented Jun 10, 2020

ycombinator left a comment

Properly report retry counts when refreshing the lead broker #7

Properly report retry counts when refreshing the lead broker #7

Conversation

faec commented Jun 9, 2020

ycombinator commented Jun 9, 2020 • edited Loading

ycombinator left a comment

Choose a reason for hiding this comment

faec commented Jun 9, 2020

faec commented Jun 9, 2020 • edited Loading

urso commented Jun 10, 2020

faec commented Jun 10, 2020

urso commented Jun 10, 2020

ycombinator left a comment

Choose a reason for hiding this comment

ycombinator commented Jun 9, 2020 •

edited

Loading

faec commented Jun 9, 2020 •

edited

Loading