-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: endless spam from split queue retries #38183
Comments
Interesting, I haven't seen that (maybe just didn't run CRDB enough recently). The error comes from here cockroach/pkg/storage/split_queue.go Line 146 in 90e8985
and basically says that the descriptor changed while the split was ongoing. This is a little weird; if it happens once, ok, but if the descriptor changes the replica carrying out the split should see its descriptor updated very soon (at the very least after a few ms on a local cluster, or let's call it a few seconds, surely you waited that long). This is on master, right? If you have a surefire repro I'd appreciate it. |
Actually alternative theory, could it be #38147? See https://github.com/cockroachdb/cockroach/pull/38147/files#r294102511 |
Yeah, on master. And I waited several minutes and it didn't cease. |
38302: storage: CPut bytes instead of a proto when updating RangeDescriptors r=tbg a=danhhz In #38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in #38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Touches #38308 Closes #38183 Release note: None Co-authored-by: Daniel Harrison <[email protected]>
I just experienced this issue with the latest master. It happened in a multi-tenant environment while load testing a single tenant. It caused a range to grow to more than 700MB. See range 1278 below:
This was repeating in the log of the leaseholder
Rolling restart fixed the issue. I don't see a recent update in the replica descriptor proto so not sure if that is somehow related.
or something else that would give a clue as to what is being changed back and forth? |
@AlexTalks this issue could (could!) be related to #71640. Here, there is a tight retry loop because the split queue is doing the processing and apparently isn't backing off. In our case, it's a manual split and so there is a backoff and no logging. |
We have marked this issue as stale because it has been inactive for |
When doing various operations on a local cluster, I've gotten endless rapid fire spam from the split queue that looks like this:
It doesn't cease until I restart the cluster. @tbg could you triage this? Let me know if you need help reproing - I haven't tried to make a small repro case yet because I thought the core team might understand what's going on quickly.
Jira issue: CRDB-10827
The text was updated successfully, but these errors were encountered: