storage: endless spam from split queue retries #38183

jordanlewis · 2019-06-16T22:03:19Z

When doing various operations on a local cluster, I've gotten endless rapid fire spam from the split queue that looks like this:

I190616 22:01:16.651467 303148 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.656905 303151 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56656] (zone config)
I190616 22:01:16.663793 303151 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.669182 303154 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56657] (zone config)
I190616 22:01:16.676599 303154 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.682135 301143 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56658] (zone config)
I190616 22:01:16.689591 301143 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.694930 302860 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56659] (zone config)
I190616 22:01:16.701739 302860 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.707172 302863 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56660] (zone config)
I190616 22:01:16.714354 302863 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.719807 302866 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56661] (zone config)
I190616 22:01:16.726593 302866 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.731990 301147 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56662] (zone config)
I190616 22:01:16.738716 301147 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.744122 303159 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56663] (zone config)
I190616 22:01:16.750865 303159 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.756206 303162 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56664] (zone config)
I190616 22:01:16.763056 303162 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.768365 301151 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56665] (zone config)
I190616 22:01:16.776631 301151 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.782099 303205 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56666] (zone config)
I190616 22:01:16.789057 303205 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.794444 299042 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56667] (zone config)
I190616 22:01:16.801219 299042 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.806649 302414 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56668] (zone config)
I190616 22:01:16.813199 302414 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying

It doesn't cease until I restart the cluster. @tbg could you triage this? Let me know if you need help reproing - I haven't tried to make a small repro case yet because I thought the core team might understand what's going on quickly.

Jira issue: CRDB-10827

The text was updated successfully, but these errors were encountered:

tbg · 2019-06-16T22:09:42Z

Interesting, I haven't seen that (maybe just didn't run CRDB enough recently). The error comes from here

cockroach/pkg/storage/split_queue.go

Line 146 in 90e8985

log.Infof(ctx, "split saw concurrent descriptor modification; maybe retrying")

and basically says that the descriptor changed while the split was ongoing. This is a little weird; if it happens once, ok, but if the descriptor changes the replica carrying out the split should see its descriptor updated very soon (at the very least after a few ms on a local cluster, or let's call it a few seconds, surely you waited that long).

This is on master, right? If you have a surefire repro I'd appreciate it.

tbg · 2019-06-16T22:13:25Z

Actually alternative theory, could it be #38147? See

https://github.com/cockroachdb/cockroach/pull/38147/files#r294102511

jordanlewis · 2019-06-17T00:36:42Z

Yeah, on master. And I waited several minutes and it didn't cease.

38302: storage: CPut bytes instead of a proto when updating RangeDescriptors r=tbg a=danhhz In #38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in #38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Touches #38308 Closes #38183 Release note: None Co-authored-by: Daniel Harrison <[email protected]>

darinpp · 2021-10-24T19:59:49Z

I just experienced this issue with the latest master. It happened in a multi-tenant environment while load testing a single tenant. It caused a range to grow to more than 700MB. See range 1278 below:

  range_id |                start_pretty                 |                 end_pretty                  | replicas | voting_replicas | non_voting_replicas | learner_replicas | lease_holder | range_size
-----------+---------------------------------------------+---------------------------------------------+----------+-----------------+---------------------+------------------+--------------+-------------
      1278 | /Tenant/15                                  | /Tenant/15/Table/225/1/-6249445251639448483 | {1,5,6}  | {1,6,5}         | {}                  | {}               |            5 |  782029373
      2918 | /Tenant/15/Table/225/1/-6249445251639448483 | /Tenant/15/Table/225/1/-6028235662291439790 | {1,3,6}  | {1,6,3}         | {}                  | {}               |            6 |   50162048
      3026 | /Tenant/15/Table/225/1/-6028235662291439790 | /Tenant/15/Table/225/1/-5815943635577107112 | {1,3,6}  | {1,6,3}         | {}                  | {}               |            6 |   50388488
...

This was repeating in the log of the leaseholder

...
I211023 22:16:03.638226 23416732 kv/kvserver/split_queue.go:170 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10100998  split saw concurrent descriptor modification; maybe retrying
I211023 22:16:03.886837 23416734 kv/kvserver/replica_command.go:395 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10100999  initiating a split of this range at key ‹/Tenant/15/Table/225/1/-8498412310403633290› [r5044559] (‹746 MiB above threshold size 512 MiB›)‹›
I211023 22:16:03.887594 23416734 kv/kvserver/split_queue.go:170 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10101000  split saw concurrent descriptor modification; maybe retrying
I211023 22:16:04.138557 23416736 kv/kvserver/replica_command.go:395 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10101001  initiating a split of this range at key ‹/Tenant/15/Table/225/1/-8498412310403633290› [r5044560] (‹746 MiB above threshold size 512 MiB›)‹›
...

Rolling restart fixed the issue.

I don't see a recent update in the replica descriptor proto so not sure if that is somehow related.
Would it help if we add the actual err to the log line like this:

log.Infof(ctx, "split saw concurrent descriptor modification; maybe retrying: %+v", err)

or something else that would give a clue as to what is being changed back and forth?

tbg · 2021-10-25T15:22:15Z

@AlexTalks this issue could (could!) be related to #71640. Here, there is a tight retry loop because the split queue is doing the processing and apparently isn't backing off. In our case, it's a manual split and so there is a backoff and no logging.

github-actions · 2023-11-13T11:06:36Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 16, 2019

jordanlewis assigned tbg Jun 16, 2019

tbg mentioned this issue Jun 16, 2019

storage,roachpb: proto changes for learners #38147

Merged

danhhz mentioned this issue Jun 20, 2019

storage: CPut bytes instead of a proto when updating RangeDescriptors #38302

Merged

craig bot closed this as completed in #38302 Jun 20, 2019

darinpp reopened this Oct 24, 2021

blathers-crl bot added the T-kv KV Team label May 18, 2022

tbg removed their assignment May 18, 2022

github-actions bot added the no-issue-activity label Nov 13, 2023

github-actions bot added the X-stale label Nov 27, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2023

exalate-issue-sync bot closed this as completed Nov 27, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: endless spam from split queue retries #38183

storage: endless spam from split queue retries #38183

jordanlewis commented Jun 16, 2019 •

edited by cockroach-jira-scripts

Loading

tbg commented Jun 16, 2019

tbg commented Jun 16, 2019

jordanlewis commented Jun 17, 2019

darinpp commented Oct 24, 2021 •

edited

Loading

tbg commented Oct 25, 2021

github-actions bot commented Nov 13, 2023

storage: endless spam from split queue retries #38183

storage: endless spam from split queue retries #38183

Comments

jordanlewis commented Jun 16, 2019 • edited by cockroach-jira-scripts Loading

tbg commented Jun 16, 2019

tbg commented Jun 16, 2019

jordanlewis commented Jun 17, 2019

darinpp commented Oct 24, 2021 • edited Loading

tbg commented Oct 25, 2021

github-actions bot commented Nov 13, 2023

jordanlewis commented Jun 16, 2019 •

edited by cockroach-jira-scripts

Loading

darinpp commented Oct 24, 2021 •

edited

Loading