Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

Closed
alistairking opened this issue Mar 23, 2023 · 10 comments · Fixed by #410
Closed

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

alistairking opened this issue Mar 23, 2023 · 10 comments · Fixed by #410

Comments

@alistairking
Copy link
Contributor

Hey!

I'm almost certain this is about me doing something wrong, and not a bug with the library, but I'm hoping you might have an idea of what it is that I'm doing wrong :)

So, I have a cooperative-sticky consumer that uses BlockRebalanceOnPoll and DisableAutoCommit, and then during the poll loop I do something like:

	fs := k.PollRecords(ctx, maxFetch)
	defer k.AllowRebalance()
        // process records
	if err := k.CommitUncommittedOffsets(ctx); err != nil {
		if !errors.Is(err, context.Canceled) {
			log.Error().Err(err).Msg("failed to commit offsets")
		}
	}

When the group rebalances (e.g., adding a new topic-partition), I see from the rebalance event logs that only one consumer instance has had a partition added (as I'd expect), but then I see a bunch of these logs:

g.cfg.logger.Log(LogLevelWarn, "unable to commit offset for topic partition",
along with my CommitUncommittedOffsets call returning kerr.RebalanceInProgress.

I'm not sure why the consumers are failing to commit their offsets given that rebalances are blocked during the poll loop? (I don't see any evidence of consumer instances being evicted from the group due to timeouts.)

Any insight you might have would be much appreciated.

@twmb
Copy link
Owner

twmb commented Mar 23, 2023

This has bugged me forever, too.

Notes from spelunking the Kafka source / how it can interact with the client:

* heartbeat sees RebalanceInProgress
* heartbeat revokes
* revoke finishes with nothing
* you Poll, since this is cooperative, there are still partitions with data
* rebalancing is "blocked"
* heartbeat loop finishes
* consumer enters join group
* you commit
* commit fails with RebalanceInProgress since the broker rejects commits between join group and sync group

* In eager mode consuming, you shouldn't get the error because there can be nothing to commit / you should be stuck polling
* In cooperative consuming, the blocking actually only blocks your callbacks -- the rebalance can still happen _up until_ you would lose _or_ be assigned partitions

I think the fix internally is to block the join request within the client, I can work on this later today I hope

@alistairking
Copy link
Contributor Author

Huh, thanks. It definitely makes me feel better that I'm not the only one with this problem :)
Let me know if there is anything I can do to help.

twmb added a commit that referenced this issue Mar 23, 2023
See comments. This is also a different and arguably better approach to
the old issue #137.

Closes #409.
Closes #137.
@twmb
Copy link
Owner

twmb commented Mar 23, 2023

If you could try this PR: #410
before I test it further myself / merge it, that'd be good

@alistairking
Copy link
Contributor Author

That was quick!
I don't think I'll have time to test this today, but I can take it for a spin in 2-3 days from now

@alistairking
Copy link
Contributor Author

Ok, I tried this out and I think that it's working. I deployed to a subset of consumer instances and I didn't see any of these errors from the members that I had upgraded.

twmb added a commit that referenced this issue Mar 27, 2023
See comments. This is also a different and arguably better approach to
the old issue #137.

The implementation has to be a bit insane because we want to obey the
context while _also_ trying to grab a lock, which cannot be canceled.

Closes #409.
Closes #137.
@twmb
Copy link
Owner

twmb commented Mar 31, 2023

I'm aiming to investigate one to three more things before releasing this, fwiw -- I don't think this is that high priority to release ASAP, but it will be a patch release.

@alistairking
Copy link
Contributor Author

No problem. Do you think #410 is in good enough shape for me to use that in the meantime?

@twmb
Copy link
Owner

twmb commented Mar 31, 2023

I loop tested it for a long time, it should be

@twmb twmb closed this as completed in #410 Apr 1, 2023
@twmb twmb closed this as completed in ee70930 Apr 1, 2023
@twmb
Copy link
Owner

twmb commented Apr 1, 2023

This is now pushed in v1.13.2

@alistairking
Copy link
Contributor Author

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants