Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry messages on shutdown #420

Merged
merged 2 commits into from
Apr 27, 2015
Merged

Retry messages on shutdown #420

merged 2 commits into from
Apr 27, 2015

Conversation

eapache
Copy link
Contributor

@eapache eapache commented Apr 14, 2015

@Shopify/kafka fixes #419.

Rather than use the old complicated system of reference-counting flags to shutdown cleanly, do the much simpler thing: keep a sync.WaitGroup counting the number of messages "in flight" (aka owned by the producer). When shutdown is requested, spawn a goroutine that waits for this counter to hit 0, then closes everything in one go.

We add messages to the in-flight set in the topicDispatcher (only new messages with retries==0 though). We remove messages from the in-flight set in returnError and returnSuccesses; even if the Producer.Return.* values are false they are still guaranteed to see every message.

We also add/remove chaser messages in leaderDispatcher.

This still needs tests.

I'm not sure what performance impact the waitgroup will have. An alternative might be an atomic counter, and have the shutdown goroutine just poll it every 10ms or something.

@eapache eapache force-pushed the retry-on-shutdown branch 2 times, most recently from 0d90bd3 to 1ae385a Compare April 16, 2015 19:22
@eapache
Copy link
Contributor Author

eapache commented Apr 16, 2015

I'm not sure what performance impact the waitgroup will have. An alternative might be an atomic counter, and have the shutdown goroutine just poll it every 10ms or something.

A recent benchmarking and profiling push says: maybe this has a tiny impact on performance, but it's still swamped out by stupid stuff like CRC calculations, so not a concern.

@eapache eapache force-pushed the retry-on-shutdown branch 4 times, most recently from 18c94ac to 91534c6 Compare April 17, 2015 19:18
@eapache
Copy link
Contributor Author

eapache commented Apr 17, 2015

CI failing because of kisielk/errcheck#70

@eapache
Copy link
Contributor Author

eapache commented Apr 17, 2015

CI fixed.

@Shopify/kafka this is ready for review.

@eapache eapache force-pushed the retry-on-shutdown branch 2 times, most recently from bd75081 to 644af59 Compare April 24, 2015 15:38
@@ -355,6 +347,7 @@ func (p *asyncProducer) leaderDispatcher(topic string, partition int32, input ch
// in fact this message is not even the current retry level, so buffer it for now (unless it's a just a chaser)
if msg.flags&chaser == chaser {
retryState[msg.retries].expectChaser = false
p.inFlight.Done() // this chaser is now useless and will be garbage collected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is now useless/is now handled/ ?

@wvanbergen
Copy link
Contributor

I think the accounting is correct.

  • New (0 retries) messages increment, errors and successes decrement.
  • New chaser messages increment, and are decremented when they are handled.
  • Only shutdown messages are not accounted for, but there's only one and it doesn't propagate through the goroutines (see my comment above)

👍, this is a nice simplification and it's much easier to understand now.

@eapache eapache force-pushed the retry-on-shutdown branch from 644af59 to 54eb5af Compare April 27, 2015 13:55
@eapache
Copy link
Contributor Author

eapache commented Apr 27, 2015

Your description of the accounting matches my understanding exactly. Once CI is 🍏 and I've had a quick pair of 👀 on the third commit, I think this is good to go.

continue
} else if msg.retries == 0 {
if shuttingDown {
p.returnError(msg, ErrShuttingDown)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces the inflight counter, but it was never incremented for this message. We should probably move the p.inFlight.Add(1) up so it always gets executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, really nice catch. I will fix and see if I can add or adjust a test for this case too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this path is fixed and tested now.

@eapache eapache force-pushed the retry-on-shutdown branch from 2bb62c1 to 08ccf5e Compare April 27, 2015 14:27
@wvanbergen
Copy link
Contributor

I think this looks good. We should do some stress testing of this though.

@eapache
Copy link
Contributor Author

eapache commented Apr 27, 2015

I am comfortable enough to push this to master now. I will do some stressing before releasing the next stable version.

eapache added a commit that referenced this pull request Apr 27, 2015
@eapache eapache merged commit f948bc2 into master Apr 27, 2015
@eapache eapache deleted the retry-on-shutdown branch April 27, 2015 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Producer does not retry messages when shutting down
2 participants