Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messaging: duplicate message delivery #5796

Open
derekperkins opened this issue Feb 6, 2020 · 1 comment
Open

messaging: duplicate message delivery #5796

derekperkins opened this issue Feb 6, 2020 · 1 comment

Comments

@derekperkins
Copy link
Member

We're seeing significant issues around duplicate message delivery for queues with backlogs. When a queue is running in near-realtime, we see 0 duplicates. If there is a backlog, there is some interesting behavior that happens with the queue cache and we see 20-30% duplicate message deliveries. The queue is processing quickly, so we are not exceeding the queue server timeout (vt_ack_wait=300).

Here's a chart for the last 24 hours showing duplicate rates for a queue with significant backlog.

image

Over the same period is a queue processing more messages with 0 duplicates until we stopped our queue consumer for 15 minutes to build up a backlog. Once it worked through those, there were no more duplicates.

image

Possibly related to this, we see weird behavior when new consumers connect to the message manager. For context on the chart:

  • Ready to run: time_next <= NOW()
  • Waiting to run: time_next > NOW()
  • Failed: time_next = MaxInt64 (this is our internal usage)

You can see a large jump in the status of 1M messages that coincides with us increasing consumers. We see this every time consumers change, and these metrics are collected via an out of band query, so the table itself must be changing. I can't explain what is happening. We see similar behavior if we run a query to reschedule messages that are failed or already acked.

image

My gut feeling is that something is happening in the message cache, but I don't have any data yet to support that.

@sougou
Copy link
Contributor

sougou commented Feb 22, 2020

Most likely, you're hitting the situation where sends are failing. If a send fails, then the message is retried almost immediately: https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletserver/messager/message_manager.go#L120-L126.

We could change this behavior to postpone a message even if the send failed, but that may cause unnecessary postponements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants