Retry sending messages only for retriable exceptions #29

praseodym · 2020-04-13T14:14:32Z

Fixes #27

kares · 2020-04-13T16:07:16Z

lib/logstash/outputs/kafka.rb

          failures << record
          nil
-        rescue org.apache.kafka.common.errors.InterruptException => e


🔴 change in behavior we need to understand: InterruptException were previously retried now they aren't.
we should understand why the special exception handling was introduced to be retriable and keep if it's still valid.

Good point! InterruptException gets thrown "if the thread is interrupted while blocked" according to the KafkaProducer#send() Javadoc. I don't think this should happen unless the producer is stopped or otherwise interrupted.

The current behaviour was introduced in logstash-plugins/logstash-output-kafka#151, which doesn't motivate why InterruptException is in there.

Thanks Mark, from that conversation it seemed like just following producer's send throws
... now Kafka's InterruptException doesn't inherit from RetriableException but I think we should keep the backwards compatible behavior esp. unless we clearly understand most cases for potential interrupts are not recoverable.

I've explicitly added InterruptException back in the retry logic. I'm still not sure this is the best way to go (maybe rethrowing it as an InterruptedException would make more sense), but at least the behaviour is backwards compatible now.

lib/logstash/outputs/kafka.rb

praseodym · 2020-05-19T07:13:54Z

@kares Do you see any other blockers to getting this merged? Thanks!

GiedriusS · 2020-05-29T12:35:33Z

@kares any updates on this?

adammike · 2020-06-05T16:34:12Z

This is a big problem for us, generating a ton of noise in our logs. Any ETA?

kares · 2020-06-08T04:21:17Z

LGTM in general,
some concerns raised on the original PR logstash-plugins/logstash-output-kafka#194,
thus I am going to submit this for the team to review, for a second set of 👀 to look at.

p.s. we are going to need a version bump, probably at least minor with a changelog entry.

robbavey · 2020-06-16T20:25:04Z

@praseodym Firstly, sincere apologies for the amount of time this and the the previous PR in the old repo have been hanging around for.

I'm ok with including this - it seems futile to keep on trying to send messages that will never be accepted by Kafka - but I think it makes sense to document the change in behavior, in a similar way to how we do in the Elasticsearch output, particularly with regard to messages that would generate RecordTooLargeException.

cc @jsvd, doc(@karenzone)

karenzone · 2020-06-17T00:17:10Z

Thanks for the ping, @robbavey.

@praseodym, I'm happy to review the docs when you're ready. Please let me know if you have any questions, or I can be of assistance. Thanks!

praseodym · 2020-06-23T13:25:33Z

@kares @robbavey @karenzone Thanks! I've rebased this PR and updated the docs + changelog in the most recent commit. Please take a look!

karenzone

Thank you for taking time to update the documentation with this improvement.
Docs build cleanly and LGTM!

kares · 2020-06-25T18:31:24Z

docs/output-kafka.asciidoc

@@ -323,6 +323,13 @@ Kafka down, etc).

 A value less than zero is a configuration error.

+This plugin will only retry exceptions that are a subclass of


this is a good start, would also mention that previous plugin versions (<= 10.4.0) kept retrying all errors forever.

Aren’t the docs versioned, i.e. previous versions of Logstash docs will not include this new paragraph? Or do you think it’s better to be explicit and add the version number?

Yes they are but it's quite uncommon for someone comparing docs side-by-side.
As noted before, the ES plugin still mentions a considerable change in behaviour on handling exceptions:
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#_retry_policy

I've updated the docs to also mention the old behaviour.

kares

we will also need a version bump at:

logstash-integration-kafka/logstash-integration-kafka.gemspec

Line 3 in 05ffbed

s.version = '10.3.0'

CHANGELOG.md

GiedriusS · 2020-07-10T11:05:02Z

Have been running this for about a week for so via my own fork, thank you! Works well and extremely needed.

Nil values were removed from the futures array before looping, causing wrong indexes relative to the batch array.

To preserve existing behaviour.

praseodym · 2020-07-12T13:38:08Z

we will also need a version bump at:

logstash-integration-kafka/logstash-integration-kafka.gemspec

Line 3 in 05ffbed

s.version = '10.3.0'

Done!

robbavey

I'm good with the updated changes. Thank you again @praseodym for your contribution

@karenzone - would you mind giving the updated docs another check?

karenzone

Docs build cleanly and look great! Thanks for this contribution, @praseodym.

Partially solve infinite loops that Filebeat goes into by explicitly listing what Apache Kafka calls "retriable errors" and checking for them when an error occurs. Then, as time goes on given that no new events come in, the batch will be completely dropped if `flush.timeout` is set to at least `10s`. This is needed because otherwise the breaker will be constantly open. Ideally, the breaker wouldn't open when such errors happens but at the very least we can be smarter from Beats's side by handling errors like this. Partially inspired by logstash-plugins/logstash-integration-kafka#29. Signed-off-by: Giedrius Statkevičius <[email protected]>

This was referenced Apr 13, 2020

Do not retry sending messages that failed with a permanent exception #27

Closed

Retry send only for retriable exceptions logstash-plugins/logstash-output-kafka#194

Closed

kares reviewed Apr 13, 2020

View reviewed changes

lib/logstash/outputs/kafka.rb Outdated Show resolved Hide resolved

praseodym force-pushed the retry-only-retriable branch from 9d581b5 to 78ab75c Compare April 27, 2020 11:17

elasticsearch-bot self-assigned this Jun 16, 2020

robbavey unassigned elasticsearch-bot Jun 16, 2020

robbavey requested review from robbavey and jsvd June 16, 2020 20:31

praseodym force-pushed the retry-only-retriable branch from 78ab75c to 5c0be42 Compare June 23, 2020 13:23

karenzone reviewed Jun 23, 2020

View reviewed changes

kares reviewed Jun 25, 2020

View reviewed changes

kares suggested changes Jun 30, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

praseodym added 4 commits July 12, 2020 15:29

Retry sending messages only for retriable exceptions

1e52f1a

Fix futures array loop in retrying_send

9eea373

Nil values were removed from the futures array before looping, causing wrong indexes relative to the batch array.

Update producer exception logging

ac3ab69

Also retry Kafka InterruptExceptions

f5876a5

To preserve existing behaviour.

praseodym force-pushed the retry-only-retriable branch 2 times, most recently from 983da6a to 04cba7a Compare July 12, 2020 13:35

Document new retry logic and bump version to v10.5.0

f18bb0c

praseodym force-pushed the retry-only-retriable branch from 04cba7a to f18bb0c Compare July 12, 2020 13:37

robbavey approved these changes Jul 14, 2020

View reviewed changes

kares approved these changes Jul 14, 2020

View reviewed changes

karenzone self-requested a review July 14, 2020 20:54

karenzone approved these changes Jul 14, 2020

View reviewed changes

karenzone merged commit ba70405 into logstash-plugins:master Jul 20, 2020

praseodym deleted the retry-only-retriable branch July 20, 2020 20:13

GiedriusS mentioned this pull request Jul 21, 2020

libbeat: kafka: partially solve infinite loops with non-existent topics elastic/beats#20094

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry sending messages only for retriable exceptions #29

Retry sending messages only for retriable exceptions #29

praseodym commented Apr 13, 2020

kares Apr 13, 2020 •

edited

Loading

praseodym Apr 13, 2020

praseodym Apr 13, 2020

kares Apr 26, 2020

praseodym Apr 27, 2020

praseodym commented May 19, 2020

GiedriusS commented May 29, 2020

adammike commented Jun 5, 2020

kares commented Jun 8, 2020

robbavey commented Jun 16, 2020

karenzone commented Jun 17, 2020

praseodym commented Jun 23, 2020

karenzone left a comment

kares Jun 25, 2020 •

edited

Loading

praseodym Jun 26, 2020

kares Jun 30, 2020

praseodym Jul 12, 2020

kares left a comment

GiedriusS commented Jul 10, 2020

praseodym commented Jul 12, 2020

robbavey left a comment

karenzone left a comment

		@@ -323,6 +323,13 @@ Kafka down, etc).

		A value less than zero is a configuration error.

		This plugin will only retry exceptions that are a subclass of

Retry sending messages only for retriable exceptions #29

Retry sending messages only for retriable exceptions #29

Conversation

praseodym commented Apr 13, 2020

kares Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

praseodym Apr 13, 2020

Choose a reason for hiding this comment

praseodym Apr 13, 2020

Choose a reason for hiding this comment

kares Apr 26, 2020

Choose a reason for hiding this comment

praseodym Apr 27, 2020

Choose a reason for hiding this comment

praseodym commented May 19, 2020

GiedriusS commented May 29, 2020

adammike commented Jun 5, 2020

kares commented Jun 8, 2020

robbavey commented Jun 16, 2020

karenzone commented Jun 17, 2020

praseodym commented Jun 23, 2020

karenzone left a comment

Choose a reason for hiding this comment

kares Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

praseodym Jun 26, 2020

Choose a reason for hiding this comment

kares Jun 30, 2020

Choose a reason for hiding this comment

praseodym Jul 12, 2020

Choose a reason for hiding this comment

kares left a comment

Choose a reason for hiding this comment

GiedriusS commented Jul 10, 2020

praseodym commented Jul 12, 2020

robbavey left a comment

Choose a reason for hiding this comment

karenzone left a comment

Choose a reason for hiding this comment

kares Apr 13, 2020 •

edited

Loading

kares Jun 25, 2020 •

edited

Loading