Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly report retry counts when refreshing the lead broker #7

Merged
merged 2 commits into from
Jun 10, 2020

Conversation

faec
Copy link
Collaborator

@faec faec commented Jun 9, 2020

This is a fix for elastic/beats#19015, in which Sarama fails to apply exponential backoff when Producer.Retry.BackoffFunc is set to an exponential backoff function.

Sarama tracks producer retries per-message, and updates them when receiving a response from the broker. This means that when there is no broker, the retry count never gets incremented. Thus, retries of send attempts to a valid broker will properly apply the exponential backoff, but retries while connecting to the broker itself will always retry with the initial backoff time.

I have modified the wait so that it tracks retries on the broker itself separately from retries on its pending messages. Local tests as described in elastic/beats#19015 confirm that exponential backoff is now correctly applied when the broker is down.

@faec faec requested review from ycombinator and kvch June 9, 2020 19:30
@ycombinator
Copy link

ycombinator commented Jun 9, 2020

Thanks for the explanation; I understand the change. Is this a change we'd have to permanently keep in our fork, meaning it's just how we want sarama to work OR is this potentially a bug in sarama itself whose fix we should try push upstream as well?

Copy link

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself makes sense to me, for our use case. Left a question comment but it shouldn't block this PR.

@faec
Copy link
Collaborator Author

faec commented Jun 9, 2020

My expectation is that Sarama will consider this a bug and want to fix it, though possibly not the exact same way I did. (There is also possible data loss in the existing code when a broker is down, but I don't know Sarama well enough to evaluate that part, it might just be handled elsewhere.) I'll be filing a bug with them shortly. Our hope is still to drop this fork entirely once our blockers are addressed in the current release, but this does add one more blocker to the list.

I've been testing this PR more and I found a reproducible crash with the fix applied when a lead broker goes down and then comes back up, so evidently something isn't quite right yet -- I will follow up here once I track it down.

@faec
Copy link
Collaborator Author

faec commented Jun 9, 2020

Fixed (knock on wood); the problem was that pp.parent.returnError returns an immediate fatal error for that message, and thus should only be called once per message (rather than in the error loop as in my first attempt).

I adjusted this PR to combine the old and new approaches: I still track leader-update retries separately from message retries for determining the wait time, but I also report a missing leader as an immediate error for that message and continue to the next loop iteration, rather than blocking until I find one. This ensures progress while still reporting relevant errors and applying the correct backoff.

@urso
Copy link

urso commented Jun 10, 2020

Given the description for the fix I wonder if it would make sense from sarama as a library point of view to configure the retry backoff for the leader separately from the backoff for messages. Having different configs would still allow us to configure our backoff handler to get this behavior.

Messages can be 'rescheduled' in sarama, like send to another partition after an error. But is there a good reason to actually have backoff at the message level? Normally I'd expect errors to be at the broker level (IO error). On the other hand, errors at the broker level + backoff can bring all publishing to halt (depends on config, though).

@faec
Copy link
Collaborator Author

faec commented Jun 10, 2020

Backoff is tracked at the message level because partial success is possible, so a single batch of messages can have different retry values: if we send 10 messages and some fail, then we send 10 messages again and some of those fail, then next time we could have a batch where some messages are being retried for the second time, some for the first, and some have never been tried. The retry level of a single batch is then aggregated over the messages being sent in it.

It would be possible to add distinct backoffs for the different stages of message transmission, but I'm not sure we have a use case for that right now. As far as I know the beats-side intention for the flag was just to have a generic exponential backoff for producer failures regardless of whether it was caused by reported failures or a lost broker. I'm not sure how Sarama will feel about it, but my impression is that this was just an oversight based on internal implementation details, not an intentionally distinct error state.

@urso
Copy link

urso commented Jun 10, 2020

👍

Let's discuss it in an issue with Sarama. At least from Beats POV I don't think we care much about per message backoff. Actually per message backoff can be a problem due to the memory queue waiting for the output to make progress.

Copy link

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New implementation LGTM.

@faec faec merged commit 1a2a3be into elastic:beats-vendored Jun 10, 2020
@faec faec deleted the backoff-fix branch June 11, 2020 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants