Fix MPT flakiness #15131

mfleming · 2023-11-24T17:29:34Z

I've observed that MPT sometimes fails when run with cdt, though usually not on the first run. Since MPT is the foundation for the partition density work we really need a good signal from MPT.

Initial analysis looks like it's just an issue with stopping rp.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 887, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1038, in _test_many_partitions
    with repeater_traffic(context=self._ctx,
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_repeater_service.py", line 377, in repeater_traffic
    svc.stop()
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_repeater_service.py", line 205, in stop
    super().stop(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 310, in stop
    self.stop_node(node, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_repeater_service.py", line 214, in stop_node
    node.account.signal(self._pid, signal.SIGKILL, allow_fail=False)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/cluster/remoteaccount.py", line 418, in signal
    self.ssh(cmd, allow_fail=allow_fail)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/cluster/remoteaccount.py", line 300, in ssh
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
ducktape.cluster.remoteaccount.RemoteCommandError: root@ip-172-31-32-195: Command 'kill -9 5973' returned non-zero exit status -1.

The text was updated successfully, but these errors were encountered:

mfleming · 2023-11-29T12:11:06Z

Also seeing this failure:

[WARNING - 2023-11-29 11:36:30,376 - service_registry - free_all - lineno:83]: Error cleaning service <KgoRepeaterService-0-140580239755344: num_nodes: 2, nodes: ['ip-172-31-47-216', 'ip-172-31-33-90']>: 
[INFO  - 2023-11-29 11:36:30,376 - runner_client - log - lineno:317]: RunnerClient: rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions: FAIL: TimeoutError('Waiting for group repeat01 to be ready')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 890, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1086, in _test_many_partitions
    progress_check()
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1063, in progress_check
    repeater.await_group_ready()
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_repeater_service.py", line 274, in await_group_ready
    self.redpanda.wait_until(group_ready,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1197, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for group repeat01 to be ready

mfleming · 2023-12-04T11:03:40Z

Also seeing this failure:

[WARNING - 2023-11-29 11:36:30,376 - service_registry - free_all - lineno:83]: Error cleaning service <KgoRepeaterService-0-140580239755344: num_nodes: 2, nodes: ['ip-172-31-47-216', 'ip-172-31-33-90']>: 
[INFO  - 2023-11-29 11:36:30,376 - runner_client - log - lineno:317]: RunnerClient: rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions: FAIL: TimeoutError('Waiting for group repeat01 to be ready')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 890, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1086, in _test_many_partitions
    progress_check()
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1063, in progress_check
    repeater.await_group_ready()
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_repeater_service.py", line 274, in await_group_ready
    self.redpanda.wait_until(group_ready,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1197, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for group repeat01 to be ready

Some of the worker threads on the kgo-repeater client are missing (missing consumers):

[ERROR - 2023-12-03 23:40:42,253 - kgo_repeater_service - await_group_ready - lineno:310]:   ip-172-31-0-203: missing set()
[ERROR - 2023-12-03 23:40:42,253 - kgo_repeater_service - await_group_ready - lineno:310]:   ip-172-31-13-214: missing set()
[ERROR - 2023-12-03 23:40:42,253 - kgo_repeater_service - await_group_ready - lineno:310]:   ip-172-31-3-70: missing {0}

The number of missing workers varies from failure to failure but for this particular failure we can see the following for worker 0:

time="2023-12-03T23:37:22Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 acked 90 on partition 724 offset 23"
time="2023-12-03T23:37:32Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:37:42Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:37:52Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:02Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:12Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:22Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:32Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:42Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:38:52Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:02Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:12Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:22Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:32Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:42Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:39:52Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:02Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:12Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:22Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:32Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:42Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."
time="2023-12-03T23:40:52Z" level=debug msg="Produce ip-172-31-3-70_14365_w_0 checking for token..."

mfleming · 2023-12-08T12:22:23Z

I enabled trace debugging in kgo-repeater which gives us the franz-go logs and those show that for the missing workers they perform read/writes to the broker until suddenly they stop for some reason and the connections to the brokers get reaped:

time="2023-12-06T16:27:20Z" name=ip-172-31-11-50_8366_w_0[DEBUG] metadata refresh has identical topic partition data; topic: scale_000000, partition: 11577, leader: 4, leader_epoch: 1                                                                         
time="2023-12-06T16:27:20Z" name=ip-172-31-11-50_8366_w_0[DEBUG] read Produce v7; broker: 3, bytes_read: 64, read_wait: 17.122µs, time_to_read: 2.35847ms, err: <nil>                                                                                           
time="2023-12-06T16:27:20Z" name=ip-172-31-11-50_8366_w_0[DEBUG] produced; broker: 3, to: scale_000000[11575{31=>32}]                                                                                                                                           
time="2023-12-06T16:27:20Z" name=ip-172-31-11-50_8366_w_0[DEBUG] metadata refresh has identical topic partition data; topic: scale_000000, partition: 11578, leader: 7, leader_epoch: 1                                                                         
time="2023-12-06T16:27:20Z" name=ip-172-31-11-50_8366_w_0[DEBUG] metadata refresh has identical topic partition data; topic: scale_000000, partition: 11579, leader: 6, leader_epoch: 1         
... (snipped) ...
time="2023-12-06T16:27:21Z" name=ip-172-31-11-50_8366_w_0[DEBUG] autocommitting; group: repeat01                                                                                                                                                                
time="2023-12-06T16:27:26Z" name=ip-172-31-11-50_8366_w_0[DEBUG] autocommitting; group: repeat01                                                                                                                                                                
time="2023-12-06T16:27:31Z" name=ip-172-31-11-50_8366_w_0[DEBUG] reaped connections; time_since_last_reap: 19.999056699s, reap_dur: 60.499µs, num_reaped: 3                                                                                                     
time="2023-12-06T16:27:31Z" name=ip-172-31-11-50_8366_w_0[DEBUG] reaped connections; time_since_last_reap: 19.999871577s, reap_dur: 112.704µs, num_reaped: 7                                                                                                    
time="2023-12-06T16:27:31Z" name=ip-172-31-11-50_8366_w_0[DEBUG] autocommitting; group: repeat01

Note that the kgo client disconnects before the await_group_ready step in MPT.

On the cluster side, the number of pending RPC requests also starts increasing while we're in the await_group_ready phase but this is pretty typical for this test:

The only interesting thing I could find in the RP logs was this:

WARN  2023-12-06 16:27:48,167 [shard 1:main] raft - [group_id:5854, {kafka/scale_000000/5853}] consensus.cc:3230 - transfer leadership: timed out waiting on node {id: {7}, revision: {66}} recovery

mfleming · 2023-12-13T12:53:21Z

The culprit seems to be that the heartbeat failed, which is a fatal error for consumer groups, i.e. I increased the group ready timeout from 2 minutes to 5 minutes and can still hit this failure (it's permanent when triggered). Grepping for "unassigning everything" in the logs leads to some clues (these messages are only printed for workers that are missing)

$ cat results/latest/ManyPartitionsTest/test_many_partitions/1/KgoRepeaterService-0-139804018803424/*/kgo-repeater.log | grep "unassigning everything"                                                                                                        
time="2023-12-12T21:22:16Z" name=ip-172-31-14-210_9031_w_0[INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:                                                                         
time="2023-12-12T21:22:16Z" name=ip-172-31-3-231_8921_w_48[INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:                                                                         
time="2023-12-12T21:22:18Z" name=ip-172-31-3-231_8921_w_47[INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:                                                                         
time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_39[INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:                                                                          
time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_95[INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:

This is caused by the heartbeat failing:

time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_95[DEBUG] heartbeat complete; group: repeat01, err: context canceled                                                                                                                                     
time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_95[DEBUG] wrote FindCoordinator v3; broker: 8, bytes_written: 50, write_wait: 5.98µs, time_to_write: 9.447µs, err: <nil>                                                                                 
time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_95[DEBUG] read FindCoordinator v3; broker: 8, bytes_read: 41, read_wait: 4.383425ms, time_to_read: 6.92µs, err: <nil>                                                                                    
time="2023-12-12T21:22:18Z" name=ip-172-31-4-13_9879_w_95[INFO] heartbeat errored; group: repeat01, err: context canceled

Still unclear on why the heartbeat failed. We were restarting node 9 at this point and the Redpanda logs contain things like:

DEBUG 2023-12-12 21:22:15,495 [shard 2:main] r/heartbeat - heartbeat_manager.cc:403 - Received error when sending heartbeats to node 9 - rpc::errc::exponential_backoff

and also issues with node 1...

DEBUG 2023-12-12 21:22:54,827 [shard 0:main] r/heartbeat - heartbeat_manager.cc:535 - Heartbeat request for group 1605 timed out on the node 1

@twmb any ideas what could be going on here? Is heartbeat failure for a consumer group fatal? What happens when we disconnect from a node in the middle of a heartbeat request?

twmb · 2023-12-13T16:00:04Z

In taking a quickish look at the franz-go code, the only time the context used for a heartbeat request is canceled is when the consumer is leaving the group (or, when the client is closing -- often these are the same event).

mfleming · 2023-12-13T19:39:15Z

In taking a quickish look at the franz-go code, the only time the context used for a heartbeat request is canceled is when the consumer is leaving the group (or, when the client is closing -- often these are the same event).

I don't think we're leaving the group or closing the connection because I would expect to see log messages around "leaving
group" but there are none. I've attached the log file for a consumer that hit this issue.

kgo-repeater.log.tar.gz

Plus the consumer never attempts to rejoin the group.

Also, if groupConsumer.manage() returns do we return an error the caller? One of the sticking points I've had chasing this issue is that things just silently stop working without much announcement.

mfleming · 2023-12-14T15:29:10Z

Also attaching the golang stacks from the log
printstack.log.tar.gz

piyushredpanda · 2023-12-15T10:19:01Z

@twmb could you kindly help take a look at the logs please? This is blocking our partition density work so we can increase # of supported partitions, a key deliverable for Cloud tiers.

twmb · 2023-12-18T21:13:35Z

There's a bug in franz-go if two requests are loading the coordinator at the same time, and the first request cancels its context. This happened with OffsetCommit canceling its context while finding a coordinator, while Heartbeat was also waiting on the same find-coordinator logic. Ping me in slack if you want the extremely detailed writeup I was working on for this.

Also, this system is incredibly loaded. Unrelated to the franz-go bug, it took 5.5s to write a heartbeat request -- as in, something that should have happened immediately took >5.5s to be scheduled and execute, between two back to back bits of logic.

twmb · 2023-12-18T21:14:01Z

Around where this failure happened are a bunch of other failures that show system load -- EOF failures on connections (timeouts), etc. But, these are all recoverable in the client.

Some load testing in Redpanda showed a failure where consuming quit unexpectedly and unrecoverably. The sequence of events is: * if OffsetCommit is issued just before Heartbeat * and the group needs to be loaded so FindCoordinator is triggered, * and OffsetCommit happens again, canceling the prior commit's context Then, * FindCoordinator would cancel * Heartbeat, which is waiting on the same load, would fail with context.Canceled * This error is seen as a group leave error * The group management logic would quit entirely. Now, the context used for FindCoordinator is the client context, which is only closed on client close. This is also better anyway -- if two requests are waiting for the same coordinator load, we don't want the first request canceling to error the second request. If all requests cancel and we have a stray FindCoordinator in flight, that's ok too, because well, worst case we'll just eventually have a little bit of extra data cached that is likely needed in the future anyway. Closes redpanda-data/redpanda#15131

author Travis Bischel <[email protected]> 1698807017 -0600 committer Tiago Peczenyj <[email protected]> 1703159929 +0100 parent 6ebcb43 author Travis Bischel <[email protected]> 1698807017 -0600 committer Tiago Peczenyj <[email protected]> 1703159889 +0100 kgo: no-op mark functions when not using AutoCommitMarks Closes twmb#598. kgo: pin AddPartitionsToTxn to v3 when using one transaction KIP-890 has been updated such that v3 must be used by clients. We will pin to v3 unless multiple transactions are being added, or unless any transaction is verify only. Closes twmb#609. GHA: try redpandadata/redpanda Latest stable kgo: be sure to use topics when other topics are paused Follow up from twmb#585, there was a bug in the commit for it. If any topic was paused, then all non-paused topics would be returned once, but they would not be marked as fetchable after that. I _think_ the non-fetchability would eventually be cleared on a metadata update, _but_ the source would re-fetch from the old position again. The only way the topic would advance would be if no topics were paused after the metadata update. However this is a bit confusing, and overall this patch is required. This also patches a second bug in PollFetches with pausing: if a topic has a paused partition, if the fetch response does NOT contain any paused partitions, then the logic would actually strip the entire topic. The pause tests have been strengthened a good bit -- all lines but one are hit, and the one line that is not hit could more easily be hit if more partitions are added to the topic / a cluster of size one is used. The line is currently not hit because it requires one paused partition and one unpaused partition to be returned from the same broker at the same time. Lastly, this adds an error reason to why list or epoch is reloading, which was used briefly while investigating test slowness. sticky: further improvements * Introduces separate functions for go 1.21+, allowing to eliminate unremoveable allocs from sort.Sort. To keep it simple, this simplifies <=1.20 a little bit, so that is **slightly** more inefficient. * Improves new-partition assignment further -- ensure we always place unassigned partitions on the least consuming member. CHANGELOG: update for v1.15.2 parent 33e15f9 author Victor <[email protected]> 1699638659 -0300 committer Tiago Peczenyj <[email protected]> 1703159232 +0100 parent 33e15f9 author Victor <[email protected]> 1699638659 -0300 committer Tiago Peczenyj <[email protected]> 1703159156 +0100 chore: fix typo define public interface instead use *logrus.Logger add example fix lint issue with exhaustive add new api improve tests, format code Update klogrus.go Improve existing documentation Update klogrus.go Fix typos Update klogrus.go remove period kgo source: use the proper topic-to-id map when forgetting topics Adding topics to a session needs to use the fetch request's topic2id map (which then promotes IDs into the session t2id map). Importantly, and previously this was wrong / not the case: removing topics from a session needs to use the session's t2id map. The topic does not exist in the request's topic2id map, because well, it's being forgotten. It's not in the fetch request. Adds some massive comments explaining the situation. Closes twmb#620. consuming: reset to nearest if we receive OOOR while fetching If we receive OOOR while fetching after a fetch was previously successful, something odd happened in the broker. Either what we were consuming was truncated underfoot, which is normal and expected (for very slow consumers), or data loss occurred without a leadership transfer. We will reset to the nearest offset after our prior consumed offset (by time!) because, well, that's what's most valid: we previously had a valid offset, and now it is invalid. Closes twmb#621. use bytes buffer instead of ReadAll CHANGELOG: note incoming v1.15.3 pkg/sr: improve base URL and resource path joining * use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element * also move hardDelete to a request context query parameter kfake: add SleepControl This function allows you to sleep a function you are controlling until your wakeup function returns. The control function effectively yields to other requests. Note that since requests must be handled in order, you need to be a bit careful to not block other requests (unless you intentionally do that). This basically does what it says on the tin. The behavior of everything else is unchanged -- you can KeepControl, you can return false to say it wasn't handled, etc. The logic and control flow is a good bit ugly, but it works and is fairly documented and "well contained". In working on this, I also found and fixed a bug that resulted in correllation errors when handling join&sync. kgo group tests still work against kfake's "hidden" main.go, and I have tested SleepControl with/without KeepControl, and with/without returning handled=true. build(deps): bump golang.org/x/crypto in /pkg/sasl/kerberos Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.14.0 to 0.17.0. - [Commits](golang/crypto@v0.14.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> build(deps): bump golang.org/x/crypto from 0.13.0 to 0.17.0 in /pkg/kadm Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.13.0 to 0.17.0. - [Commits](golang/crypto@v0.13.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> build(deps): bump golang.org/x/crypto in /examples/bench Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.11.0 to 0.17.0. - [Commits](golang/crypto@v0.11.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> bump all deps, except klauspost/compress bumping klauspost/compress requires go1.19. We'll do this bump with v1.16. fix go.* kgo source: use the proper topic-to-id map when forgetting topics Adding topics to a session needs to use the fetch request's topic2id map (which then promotes IDs into the session t2id map). Importantly, and previously this was wrong / not the case: removing topics from a session needs to use the session's t2id map. The topic does not exist in the request's topic2id map, because well, it's being forgotten. It's not in the fetch request. Adds some massive comments explaining the situation. Closes twmb#620. consuming: reset to nearest if we receive OOOR while fetching If we receive OOOR while fetching after a fetch was previously successful, something odd happened in the broker. Either what we were consuming was truncated underfoot, which is normal and expected (for very slow consumers), or data loss occurred without a leadership transfer. We will reset to the nearest offset after our prior consumed offset (by time!) because, well, that's what's most valid: we previously had a valid offset, and now it is invalid. Closes twmb#621. use bytes buffer instead of ReadAll CHANGELOG: note incoming v1.15.3 pkg/sr: improve base URL and resource path joining * use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element * also move hardDelete to a request context query parameter kfake: add SleepControl This function allows you to sleep a function you are controlling until your wakeup function returns. The control function effectively yields to other requests. Note that since requests must be handled in order, you need to be a bit careful to not block other requests (unless you intentionally do that). This basically does what it says on the tin. The behavior of everything else is unchanged -- you can KeepControl, you can return false to say it wasn't handled, etc. The logic and control flow is a good bit ugly, but it works and is fairly documented and "well contained". In working on this, I also found and fixed a bug that resulted in correllation errors when handling join&sync. kgo group tests still work against kfake's "hidden" main.go, and I have tested SleepControl with/without KeepControl, and with/without returning handled=true. fix go.* chore: fix typo define public interface instead use *logrus.Logger add example fix lint issue with exhaustive add new api improve tests, format code Update klogrus.go Improve existing documentation Update klogrus.go Fix typos Update klogrus.go remove period kgo source: use the proper topic-to-id map when forgetting topics Adding topics to a session needs to use the fetch request's topic2id map (which then promotes IDs into the session t2id map). Importantly, and previously this was wrong / not the case: removing topics from a session needs to use the session's t2id map. The topic does not exist in the request's topic2id map, because well, it's being forgotten. It's not in the fetch request. Adds some massive comments explaining the situation. Closes twmb#620. consuming: reset to nearest if we receive OOOR while fetching If we receive OOOR while fetching after a fetch was previously successful, something odd happened in the broker. Either what we were consuming was truncated underfoot, which is normal and expected (for very slow consumers), or data loss occurred without a leadership transfer. We will reset to the nearest offset after our prior consumed offset (by time!) because, well, that's what's most valid: we previously had a valid offset, and now it is invalid. Closes twmb#621. use bytes buffer instead of ReadAll CHANGELOG: note incoming v1.15.3 pkg/sr: improve base URL and resource path joining * use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element * also move hardDelete to a request context query parameter kfake: add SleepControl This function allows you to sleep a function you are controlling until your wakeup function returns. The control function effectively yields to other requests. Note that since requests must be handled in order, you need to be a bit careful to not block other requests (unless you intentionally do that). This basically does what it says on the tin. The behavior of everything else is unchanged -- you can KeepControl, you can return false to say it wasn't handled, etc. The logic and control flow is a good bit ugly, but it works and is fairly documented and "well contained". In working on this, I also found and fixed a bug that resulted in correllation errors when handling join&sync. kgo group tests still work against kfake's "hidden" main.go, and I have tested SleepControl with/without KeepControl, and with/without returning handled=true. build(deps): bump golang.org/x/crypto in /pkg/sasl/kerberos Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.14.0 to 0.17.0. - [Commits](golang/crypto@v0.14.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> build(deps): bump golang.org/x/crypto from 0.13.0 to 0.17.0 in /pkg/kadm Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.13.0 to 0.17.0. - [Commits](golang/crypto@v0.13.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> build(deps): bump golang.org/x/crypto in /examples/bench Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.11.0 to 0.17.0. - [Commits](golang/crypto@v0.11.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> bump all deps, except klauspost/compress bumping klauspost/compress requires go1.19. We'll do this bump with v1.16. fix go.* kgo source: use the proper topic-to-id map when forgetting topics Adding topics to a session needs to use the fetch request's topic2id map (which then promotes IDs into the session t2id map). Importantly, and previously this was wrong / not the case: removing topics from a session needs to use the session's t2id map. The topic does not exist in the request's topic2id map, because well, it's being forgotten. It's not in the fetch request. Adds some massive comments explaining the situation. Closes twmb#620. consuming: reset to nearest if we receive OOOR while fetching If we receive OOOR while fetching after a fetch was previously successful, something odd happened in the broker. Either what we were consuming was truncated underfoot, which is normal and expected (for very slow consumers), or data loss occurred without a leadership transfer. We will reset to the nearest offset after our prior consumed offset (by time!) because, well, that's what's most valid: we previously had a valid offset, and now it is invalid. Closes twmb#621. use bytes buffer instead of ReadAll CHANGELOG: note incoming v1.15.3 pkg/sr: improve base URL and resource path joining * use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element * also move hardDelete to a request context query parameter kfake: add SleepControl This function allows you to sleep a function you are controlling until your wakeup function returns. The control function effectively yields to other requests. Note that since requests must be handled in order, you need to be a bit careful to not block other requests (unless you intentionally do that). This basically does what it says on the tin. The behavior of everything else is unchanged -- you can KeepControl, you can return false to say it wasn't handled, etc. The logic and control flow is a good bit ugly, but it works and is fairly documented and "well contained". In working on this, I also found and fixed a bug that resulted in correllation errors when handling join&sync. kgo group tests still work against kfake's "hidden" main.go, and I have tested SleepControl with/without KeepControl, and with/without returning handled=true. fix go.* kfake: add DropControl, SleepOutOfOrder, CoordinatorFor, RehashCoordinators * Sleeping was a bit limited because if two requests came in on the same connection, you could not really chain logic. Sleeping out of order allows you to at least run some logic to gate how requests behave with each other. It's not the most obvious, so it is not the default. * Adds SleepOutOfOrder * Adds CoordinatorFor so you can see which "broker" a coordinator request will go to * Adds RehashCoordinators to change where requests are hashed to The latter two allow you to loop rehashing until a coordinator for your key changes, if you want to force NotCoordinator requests. kgo: do not cancel FindCoordinator if the parent context cancels Some load testing in Redpanda showed a failure where consuming quit unexpectedly and unrecoverably. The sequence of events is: * if OffsetCommit is issued just before Heartbeat * and the group needs to be loaded so FindCoordinator is triggered, * and OffsetCommit happens again, canceling the prior commit's context Then, * FindCoordinator would cancel * Heartbeat, which is waiting on the same load, would fail with context.Canceled * This error is seen as a group leave error * The group management logic would quit entirely. Now, the context used for FindCoordinator is the client context, which is only closed on client close. This is also better anyway -- if two requests are waiting for the same coordinator load, we don't want the first request canceling to error the second request. If all requests cancel and we have a stray FindCoordinator in flight, that's ok too, because well, worst case we'll just eventually have a little bit of extra data cached that is likely needed in the future anyway. Closes redpanda-data/redpanda#15131 CHANGELOG: document incoming v1.15.4

…5131

go.{mod,sum}: Move to franz-go 1.15.4 to fix redpanda-data/redpanda#15131

mfleming self-assigned this Nov 29, 2023

StephanDollberg mentioned this issue Dec 14, 2023

CI Failure (key symptom) in ManyPartitionsTest.test_many_partitions_compacted #15548

Closed

StephanDollberg mentioned this issue Dec 15, 2023

CI Failure (Waiting for group repeat01 to be ready) in ManyPartitionsTest.test_many_partitions_compacted #15624

Closed

StephanDollberg mentioned this issue Dec 20, 2023

CI Failure (Waiting for group repeat01 to be ready) in ManyPartitionsTest.test_many_partitions #15791

Closed

twmb mentioned this issue Dec 21, 2023

kgo: do not cancel FindCoordinator if the parent context cancels twmb/franz-go#650

Merged

twmb closed this as completed in twmb/franz-go#650 Dec 21, 2023

mfleming added a commit to mfleming/kgo-verifier that referenced this issue Dec 21, 2023

go.{mod,sum}: Move to franz-go 1.15.4 to fix redpanda-data/redpanda#1…

46f08ef

…5131

mfleming added a commit to redpanda-data/kgo-verifier that referenced this issue Dec 22, 2023

Merge pull request #49 from mfleming/bump-franz-go

6d22451

go.{mod,sum}: Move to franz-go 1.15.4 to fix redpanda-data/redpanda#15131

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MPT flakiness #15131

Fix MPT flakiness #15131

mfleming commented Nov 24, 2023

mfleming commented Nov 29, 2023

mfleming commented Dec 4, 2023

mfleming commented Dec 8, 2023 •

edited

Loading

mfleming commented Dec 13, 2023 •

edited

Loading

twmb commented Dec 13, 2023

mfleming commented Dec 13, 2023 •

edited

Loading

mfleming commented Dec 14, 2023

piyushredpanda commented Dec 15, 2023

twmb commented Dec 18, 2023

twmb commented Dec 18, 2023

Fix MPT flakiness #15131

Fix MPT flakiness #15131

Comments

mfleming commented Nov 24, 2023

mfleming commented Nov 29, 2023

mfleming commented Dec 4, 2023

mfleming commented Dec 8, 2023 • edited Loading

mfleming commented Dec 13, 2023 • edited Loading

twmb commented Dec 13, 2023

mfleming commented Dec 13, 2023 • edited Loading

mfleming commented Dec 14, 2023

piyushredpanda commented Dec 15, 2023

twmb commented Dec 18, 2023

twmb commented Dec 18, 2023

mfleming commented Dec 8, 2023 •

edited

Loading

mfleming commented Dec 13, 2023 •

edited

Loading

mfleming commented Dec 13, 2023 •

edited

Loading