[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

alistairking · 2023-03-23T18:36:04Z

Hey!

I'm almost certain this is about me doing something wrong, and not a bug with the library, but I'm hoping you might have an idea of what it is that I'm doing wrong :)

So, I have a cooperative-sticky consumer that uses BlockRebalanceOnPoll and DisableAutoCommit, and then during the poll loop I do something like:

	fs := k.PollRecords(ctx, maxFetch)
	defer k.AllowRebalance()
        // process records
	if err := k.CommitUncommittedOffsets(ctx); err != nil {
		if !errors.Is(err, context.Canceled) {
			log.Error().Err(err).Msg("failed to commit offsets")
		}
	}

When the group rebalances (e.g., adding a new topic-partition), I see from the rebalance event logs that only one consumer instance has had a partition added (as I'd expect), but then I see a bunch of these logs:

franz-go/pkg/kgo/consumer_group.go

Line 1920 in 1a59c2d

g.cfg.logger.Log(LogLevelWarn, "unable to commit offset for topic partition",

along with my CommitUncommittedOffsets call returning kerr.RebalanceInProgress.

I'm not sure why the consumers are failing to commit their offsets given that rebalances are blocked during the poll loop? (I don't see any evidence of consumer instances being evicted from the group due to timeouts.)

Any insight you might have would be much appreciated.

The text was updated successfully, but these errors were encountered:

twmb · 2023-03-23T19:43:47Z

This has bugged me forever, too.

Notes from spelunking the Kafka source / how it can interact with the client:

* heartbeat sees RebalanceInProgress
* heartbeat revokes
* revoke finishes with nothing
* you Poll, since this is cooperative, there are still partitions with data
* rebalancing is "blocked"
* heartbeat loop finishes
* consumer enters join group
* you commit
* commit fails with RebalanceInProgress since the broker rejects commits between join group and sync group

* In eager mode consuming, you shouldn't get the error because there can be nothing to commit / you should be stuck polling
* In cooperative consuming, the blocking actually only blocks your callbacks -- the rebalance can still happen _up until_ you would lose _or_ be assigned partitions

I think the fix internally is to block the join request within the client, I can work on this later today I hope

alistairking · 2023-03-23T20:21:25Z

Huh, thanks. It definitely makes me feel better that I'm not the only one with this problem :)
Let me know if there is anything I can do to help.

See comments. This is also a different and arguably better approach to the old issue #137. Closes #409. Closes #137.

twmb · 2023-03-23T21:09:54Z

If you could try this PR: #410
before I test it further myself / merge it, that'd be good

alistairking · 2023-03-23T22:28:13Z

That was quick!
I don't think I'll have time to test this today, but I can take it for a spin in 2-3 days from now

alistairking · 2023-03-26T19:56:10Z

Ok, I tried this out and I think that it's working. I deployed to a subset of consumer instances and I didn't see any of these errors from the members that I had upgraded.

See comments. This is also a different and arguably better approach to the old issue #137. The implementation has to be a bit insane because we want to obey the context while _also_ trying to grab a lock, which cannot be canceled. Closes #409. Closes #137.

twmb · 2023-03-31T17:56:05Z

I'm aiming to investigate one to three more things before releasing this, fwiw -- I don't think this is that high priority to release ASAP, but it will be a patch release.

alistairking · 2023-03-31T18:48:32Z

No problem. Do you think #410 is in good enough shape for me to use that in the meantime?

twmb · 2023-03-31T18:56:07Z

I loop tested it for a long time, it should be

twmb · 2023-04-01T01:32:10Z

This is now pushed in v1.13.2

alistairking · 2023-04-01T17:41:45Z

Thanks a lot!

twmb added a commit that referenced this issue Mar 23, 2023

kgo groups: block join&sync while a commit is inflight

dc1aed0

See comments. This is also a different and arguably better approach to the old issue #137. Closes #409. Closes #137.

twmb mentioned this issue Mar 23, 2023

kgo groups: block join&sync while a commit is inflight #410

Merged

twmb closed this as completed in #410 Apr 1, 2023

twmb closed this as completed in ee70930 Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

alistairking commented Mar 23, 2023

twmb commented Mar 23, 2023 •

edited

Loading

alistairking commented Mar 23, 2023

twmb commented Mar 23, 2023

alistairking commented Mar 23, 2023

alistairking commented Mar 26, 2023

twmb commented Mar 31, 2023

alistairking commented Mar 31, 2023

twmb commented Mar 31, 2023

twmb commented Apr 1, 2023

alistairking commented Apr 1, 2023

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

[Question] REBALANCE_IN_PROGRESS error while committing offsets #409

Comments

alistairking commented Mar 23, 2023

twmb commented Mar 23, 2023 • edited Loading

alistairking commented Mar 23, 2023

twmb commented Mar 23, 2023

alistairking commented Mar 23, 2023

alistairking commented Mar 26, 2023

twmb commented Mar 31, 2023

alistairking commented Mar 31, 2023

twmb commented Mar 31, 2023

twmb commented Apr 1, 2023

alistairking commented Apr 1, 2023

twmb commented Mar 23, 2023 •

edited

Loading