Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: endless spam from split queue retries #38183

Closed
jordanlewis opened this issue Jun 16, 2019 · 6 comments · Fixed by #38302
Closed

storage: endless spam from split queue retries #38183

jordanlewis opened this issue Jun 16, 2019 · 6 comments · Fixed by #38302
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-kv KV Team X-stale

Comments

@jordanlewis
Copy link
Member

jordanlewis commented Jun 16, 2019

When doing various operations on a local cluster, I've gotten endless rapid fire spam from the split queue that looks like this:

I190616 22:01:16.651467 303148 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.656905 303151 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56656] (zone config)
I190616 22:01:16.663793 303151 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.669182 303154 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56657] (zone config)
I190616 22:01:16.676599 303154 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.682135 301143 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56658] (zone config)
I190616 22:01:16.689591 301143 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.694930 302860 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56659] (zone config)
I190616 22:01:16.701739 302860 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.707172 302863 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56660] (zone config)
I190616 22:01:16.714354 302863 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.719807 302866 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56661] (zone config)
I190616 22:01:16.726593 302866 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.731990 301147 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56662] (zone config)
I190616 22:01:16.738716 301147 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.744122 303159 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56663] (zone config)
I190616 22:01:16.750865 303159 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.756206 303162 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56664] (zone config)
I190616 22:01:16.763056 303162 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.768365 301151 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56665] (zone config)
I190616 22:01:16.776631 301151 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.782099 303205 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56666] (zone config)
I190616 22:01:16.789057 303205 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.794444 299042 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56667] (zone config)
I190616 22:01:16.801219 299042 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying
I190616 22:01:16.806649 302414 storage/replica_command.go:290  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] initiating a split of this range at key /Table/3871 [r56668] (zone config)
I190616 22:01:16.813199 302414 storage/split_queue.go:146  [n1,split,s1,r8613/1:/{Table/3870/2…-Max}] split saw concurrent descriptor modification; maybe retrying

It doesn't cease until I restart the cluster. @tbg could you triage this? Let me know if you need help reproing - I haven't tried to make a small repro case yet because I thought the core team might understand what's going on quickly.

Jira issue: CRDB-10827

@jordanlewis jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 16, 2019
@tbg
Copy link
Member

tbg commented Jun 16, 2019

Interesting, I haven't seen that (maybe just didn't run CRDB enough recently). The error comes from here

log.Infof(ctx, "split saw concurrent descriptor modification; maybe retrying")

and basically says that the descriptor changed while the split was ongoing. This is a little weird; if it happens once, ok, but if the descriptor changes the replica carrying out the split should see its descriptor updated very soon (at the very least after a few ms on a local cluster, or let's call it a few seconds, surely you waited that long).

This is on master, right? If you have a surefire repro I'd appreciate it.

@tbg
Copy link
Member

tbg commented Jun 16, 2019

Actually alternative theory, could it be #38147? See

https://github.com/cockroachdb/cockroach/pull/38147/files#r294102511

@jordanlewis
Copy link
Member Author

Yeah, on master. And I waited several minutes and it didn't cease.

craig bot pushed a commit that referenced this issue Jun 20, 2019
38302: storage: CPut bytes instead of a proto when updating RangeDescriptors r=tbg a=danhhz

In #38147, a new non-nullable field was added to the ReplicaDescriptor
proto that is serialized inside RangeDescriptors. RangeDescriptors are
updated using CPut to detect races. This means that if a RangeDescriptor
had been written by an old version of cockroach and we then attempt to
update it, the CPut will fail because the encoded version of it is
different (non-nullable proto2 fields are always included).

A similar issue was introduced in #38004 which made the StickyBit field
on RangeDescriptor non-nullable.

We could keep the fields as nullable, but this has allocation costs (and
is also more annoying to work with).

Worse, the proto spec is pretty explicit about not relying on
serialization of a given message to always produce exactly the same
bytes:

    From https://developers.google.com/protocol-buffers/docs/encoding#implications
    Do not assume the byte output of a serialized message is stable.

So instead, we've decided to stop CPut-ing protos and to change the
interface of CPut to take the expected value as bytes. To CPut a proto,
the encoded value will be read from kv, passed around with the decoded
struct, and used in the eventual CPut. This work is upcoming, but the
above two PRs are real breakages, so this commit fixes those first.

Neither of these PRs made the last alpha so no release note.

Touches #38308
Closes #38183

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
@craig craig bot closed this as completed in #38302 Jun 20, 2019
@darinpp
Copy link
Contributor

darinpp commented Oct 24, 2021

I just experienced this issue with the latest master. It happened in a multi-tenant environment while load testing a single tenant. It caused a range to grow to more than 700MB. See range 1278 below:

  range_id |                start_pretty                 |                 end_pretty                  | replicas | voting_replicas | non_voting_replicas | learner_replicas | lease_holder | range_size
-----------+---------------------------------------------+---------------------------------------------+----------+-----------------+---------------------+------------------+--------------+-------------
      1278 | /Tenant/15                                  | /Tenant/15/Table/225/1/-6249445251639448483 | {1,5,6}  | {1,6,5}         | {}                  | {}               |            5 |  782029373
      2918 | /Tenant/15/Table/225/1/-6249445251639448483 | /Tenant/15/Table/225/1/-6028235662291439790 | {1,3,6}  | {1,6,3}         | {}                  | {}               |            6 |   50162048
      3026 | /Tenant/15/Table/225/1/-6028235662291439790 | /Tenant/15/Table/225/1/-5815943635577107112 | {1,3,6}  | {1,6,3}         | {}                  | {}               |            6 |   50388488
...

This was repeating in the log of the leaseholder

...
I211023 22:16:03.638226 23416732 kv/kvserver/split_queue.go:170 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10100998  split saw concurrent descriptor modification; maybe retrying
I211023 22:16:03.886837 23416734 kv/kvserver/replica_command.go:395 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10100999  initiating a split of this range at key ‹/Tenant/15/Table/225/1/-8498412310403633290› [r5044559] (‹746 MiB above threshold size 512 MiB›)‹›
I211023 22:16:03.887594 23416734 kv/kvserver/split_queue.go:170 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10101000  split saw concurrent descriptor modification; maybe retrying
I211023 22:16:04.138557 23416736 kv/kvserver/replica_command.go:395 ⋮ [n5,split,s5,r1278/3:‹/Tenant/15{-/Table/…}›] 10101001  initiating a split of this range at key ‹/Tenant/15/Table/225/1/-8498412310403633290› [r5044560] (‹746 MiB above threshold size 512 MiB›)‹›
...

Rolling restart fixed the issue.

I don't see a recent update in the replica descriptor proto so not sure if that is somehow related.
Would it help if we add the actual err to the log line like this:

log.Infof(ctx, "split saw concurrent descriptor modification; maybe retrying: %+v", err)

or something else that would give a clue as to what is being changed back and forth?

@darinpp darinpp reopened this Oct 24, 2021
@tbg
Copy link
Member

tbg commented Oct 25, 2021

@AlexTalks this issue could (could!) be related to #71640. Here, there is a tight retry loop because the split queue is doing the processing and apparently isn't backing off. In our case, it's a manual split and so there is a backoff and no logging.

@blathers-crl blathers-crl bot added the T-kv KV Team label May 18, 2022
@tbg tbg removed their assignment May 18, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants