Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka reader duplicates reads under certain circumstances #15

Open
macgyver603 opened this issue Nov 29, 2018 · 0 comments
Open

Kafka reader duplicates reads under certain circumstances #15

macgyver603 opened this issue Nov 29, 2018 · 0 comments

Comments

@macgyver603
Copy link
Contributor

refs #5

When running the Kafka reader with multiple workers on a topic that has been populated with a fixed number of records, the job would occasionally read too many records. In this specific scenario, I had a job with 3 workers reading from a topic with 300k records in 10k batches, but would end up with anywhere between 300k and 330k records in Elasticsearch after the workers finished reading from the topic.

I believe the extra reads are coming from the rebalancing that happens as the workers first start up. Since the workers do not all start up at the same time, there is a chance that the first worker would be able to connect to kafka before the other two have started and fetch a batch of records, but once another worker joins, a rebalance is triggered, and the worker with the first batch of records would not be able to commit the offsets after putting the records in ES. Since no offsets are committed, the workers would re-read that data after the rebalance.

The logs did show that the workers were processing more than the 300000 records that were in the topic, and in one case a worker logged this error after marking its first slice as resolved:

"msg":"Kafka reader error after slice resolution { Error: Broker: Group rebalance in progress ...

Ultimately this stems from the workers committing offsets after a slice is resolved since any problems with committing the offsets would only happen after the data has been completely processed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant