-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: investigate exceeding uncommitted entry size #100096
Comments
cc @cockroachdb/replication |
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cc @cockroachdb/replication |
Surprise plot twist - I added logging that would fire whenever we'd get Details
This smells like a buglet. I think we propose a big batch of entries (which goes in, but now there's negative uncommitted size left), and then we for some reason - likely accidentally - hit the "unquiesce" path here cockroach/pkg/kv/kvserver/replica_raft.go Lines 1955 to 1956 in c2460f1
and try to propose an empty entry, which gets rejected (because there is not a scrap of budget left). I wouldn't be surprised if we hit this path on each raft iteration. Possibly |
Looking a bit more, we can only hit this unquiesce path if quiesced1 so it must be a little more subtle than that. How can the replica be quiesced if it refuses an empty entry (meaning it has uncommitted log entries)? It should fail this check: cockroach/pkg/kv/kvserver/replica_raft_quiesce.go Lines 289 to 294 in 76afb00
or this check: cockroach/pkg/kv/kvserver/replica_raft_quiesce.go Lines 318 to 323 in 76afb00
or this one: cockroach/pkg/kv/kvserver/replica_raft_quiesce.go Lines 385 to 391 in 76afb00
You get the idea. Given how easy this is to reproduce, probably best to slap it with a bit of printf and have another go. Footnotes |
The bytes printed after "wrote" where the append bytes only, this was confusing. Consolidate. Also, no need to print whether it's sync or non-blocking-sync again because we already printed that in the timing section. Found in cockroachdb#100096. Epic: none Release note: None
My current thinking focuses on this kind of logging, which we see frequently:
First of all, how can we have written 49 B when apply is 61 MiB (in a single entry no less)? The 49B is the append batch1 which is ~empty here; this is a logging buglet that I'll send a PR for. Second, I think that because applying this giant 61MiB batch (this is not an outlier) takes ~1s, there is plenty of time for the tick loop to come around and decide to quiesce the range - this does not acquire What follows is my best attempt at a timeline (but there never is a point in time at which the group could quiesce while having uncommitted bytes):
Perhaps there are additional interleavings between tick and raft handling that I'm not seeing. It is not great that interleavings are even possible here. It would be a sight easier if quiescence happened on the raft goroutine. Time for another run with more logging... Footnotes |
Hmm, the additional logging paints the picture that there might be something wrong with the uncommitted size tracking. First, we see the replica quiesce at index 33 (stack trace shows that this is, unsurprisingly, via a tick)
then, ~400ms later, it unquiesces (from the raft handling loop) and logs that it dropped the empty entry proposed as a result:
and then we append 30mb worth of entries, which for some reason is not dropped:
and which then immediately gets applied:
The mystery is that we're refusing to append an empty entry but then, without applying anything, accept 30mb of entries. When we apply entries, we step the Needs more digging, probably via a raft fork that dumps extra info while we have uncommitted entry size tracked. |
Can this be due to etcd-io/raft#11? The uncommitted log size can be reset upon election. UPD: enabled logging and did not see "campaigning" messages around these dropped proposals and appends. |
This doesn't look like a new 30 MB proposal being accepted. These 30 MB is probably what the uncommitted buffer contained, and the reason why it rejected the unquiescence proposal. What we see in this log message is that the 30 MB finally gets flushed to the storage (via raft Ready handling). |
Pass the proposals corresponding to the `ents` slice into `proposeBatch`. Log into each proposal's context and also log a message whenever we're dropping proposals on the floor. See cockroachdb#100096. Epic: none Release note: None
The bytes printed after "wrote" where the append bytes only, this was confusing. Consolidate. Also, no need to print whether it's sync or non-blocking-sync again because we already printed that in the timing section. Found in cockroachdb#100096. Epic: none Release note: None
Pass the proposals corresponding to the `ents` slice into `proposeBatch`. Log into each proposal's context and also log a message whenever we're dropping proposals on the floor. See cockroachdb#100096. Epic: none Release note: None
100270: kvserver: touch up raft ready handling log r=erikgrinaker a=tbg The bytes printed after "wrote" where the append bytes only, this was confusing. Consolidate. Also, no need to print whether it's sync or non-blocking-sync again because we already printed that in the timing section. Found in #100096. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
I think I understand this better now. This test writes lots of large blobs, and leaves enough time between blobs to allow the range to quiesce. So we start with a quiesced replica, which now gets a ~50MiB proposal. First, this goes into the proposal buffer, which queues an update check: cockroach/pkg/kv/kvserver/replica_proposal_buf.go Lines 335 to 346 in 1cd507a
This triggers raft processing, which calls Peeking into cockroach/pkg/kv/kvserver/replica_raft.go Lines 1968 to 1992 in 1cd507a
So we first flush the proposal buffer, then unquiesce. But flushing a 50MiB proposal from the buffer into the unstable raft log will consume the unstable log budget in raft. So by the time we unquiesce, there is no more space and the unquiesce's append gets rejected. This isn't a problem - after all, if there's unstable log, raft already needs to distribute something to the followers and is thus going to wake them up. I don't think there's anything new here in this cycle other than that we made the I tried out this little patch which proposes a "true" empty command (instead of a nonempty command containing a "nil" CRDB payload) - true empty commands are exempt from uncommitted log size tracking - like this: @@ -89,11 +88,10 @@ func (r *Replica) maybeUnquiesceAndWakeLeaderLocked() bool {
r.store.unquiescedReplicas.Unlock()
r.maybeCampaignOnWakeLocked(ctx)
// Propose an empty command which will wake the leader.
- data := raftlog.EncodeRaftCommand(raftlog.EntryEncodingStandardWithoutAC, makeIDKey(), nil)
- _ = r.mu.internalRaftGroup.Propose(data)
+ _ = r.mu.internalRaftGroup.Propose(nil /* data */)
return true
} and voila, I'm (unsurprisingly) no longer seeing the messages. I'll send a PR for that patch. @aliher1911 I believe you mentioned also having seen this message "randomly". Do you have something I can look at? Maybe there are multiple things going on. |
We used to unquiesce via a "noop" but not nil log entry but it turns out that it can happen that raft is out of budget for nontrivial log entries when unquiescing. So, use a nil one which is identical to what raft proposes when leadership changes. Closes cockroachdb#100096. Epic: none Release note: None
Chatted with Erik, removing GA-blocker since there's nothing new/bad here. Will check in with Oleg when he's back to see where else he saw this message. |
100083: kvserver: record metrics for ErrProposalDropped r=pavelkalinnikov a=tbg Touches #100096. Epic: none Release note: None 105093: sql: use datum alloc for crdb_internal stmt stats rows r=dt a=dt Happened to observe a cluster running a customer test suite which included a query that inspected stmt stats often, causing the CRDB node to spend a considerable amount of CPU time in production of the stmt stats vtable, in particular allocating (and then GC'ing) individual datums, especially given how wide this table has become with the addition of storage stats. This change uses a datum allocator to produce those rows to reduce the number of separate allocations from the runtime. Release note: none. Epic: none. 105197: statusccl: skip flaky TenantStatusAPI tests r=zachlite a=zachlite Informs #92382, #99770, #99559 Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Zach Lite <[email protected]>
Describe the problem
See #99464. We were seeing lots of these messages:
That's not good! These proposals will have to be picked up by
refreshProposalsLocked
which takes ~200ms.If something changed recently that causes us to hit this more frequently (as seems to be the case, anecdotally) we need to add at least rudimentary flow control here.
It is somewhat expected to hit this in general. To generate proposals and to put them into raft requires only disk reads, but getting them committed requires a durable write, including likely round-trips to followers (to obtain quorum). The former can be done at much higher throughput than the latter.
To Reproduce
Running the test in #99464 reproduces this in 100% of all runs I did, averaging around ~70 related log messages per run.
I also seem to hit this reliably with the unit test(edit: not true, we're dropping proposals but not becaue of uncommitted entry size but because of no leader)BenchmarkReplicaProposal
Expected behavior
Some kind of flow control, similar to the quota pool but locally. Ideally we would delay grabbing a latch for a write if we can tell that raft is not currently accepting more proposals. Maybe we want to avoid doing most of this buffering inside of RawNode anyway, and instead improve our proposal buffer to provide this backpressure signal instead. Unsure, needs more thought.
Additional data / screenshots
Adding better visibility into this issue in #100083.
Jira issue: CRDB-26247
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: