-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: v22.1.4: nil pointer in poisonInflightLatches #86547
Comments
cc @cockroachdb/replication |
Can you have a look @tbg? |
This panics here: cockroach/pkg/kv/kvserver/replica_raft.go Line 1384 in d10bde7
At this point we know that cockroach/pkg/kv/kvserver/replica_send.go Lines 1159 to 1163 in 0a2396a
I poked around a bit and I'm not aware that we should ever have a proposal with a nil @nvanbenschoten do you expect to see a nil Guard here? |
@nvanbenschoten via DM
What's implicit here is that Nathan does not expect endCmds to be created and put into the map with a nil Guard, it must've been zeroed out later, probably as part of |
Looked at this again a bit. Still haven't sussed it all out but I think it's reasonable to assume that some proposals in the map can be finished (so their endCmds are cleared) without strict sync with the proposals map. It's cleared only from |
a heads up that this failure surfaced in a nightly run of backupccl's |
We've seen a similar looking failure again on TestDataDriven. We have aftifacts this time if they are helpful: https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_StressBazel/8179790:id/tmp/_tmp/95e138e66d69292427dfb9528cf06d04/logTestDataDriven4271694788/backupccltest.log
|
I think it will be easy to "fix" the bug - just add a nil check - but I think there is something to be understood yet. I am still not seeing how we can end up with a proposal in I went through all of the callers to Most of them "obviously" do (for some of them it's clear that the proposal is never added to the map in the first place).
|
if err := t.applyOneBatch(ctx, iter); err != nil { | |
// If the batch threw an error, reject all remaining commands in the | |
// iterator to avoid leaking resources or leaving a proposer hanging. | |
// | |
// NOTE: forEachCmdIter closes iter. | |
if rejectErr := forEachCmdIter(ctx, iter, func(cmd Command, ctx context.Context) error { | |
return cmd.AckErrAndFinish(ctx, err) | |
}); rejectErr != nil { | |
return rejectErr | |
} | |
return err |
The one error we expect to see from applyOneBatch
(thus triggering this code path) is ErrRemoved
. It's unlikely that we saw this one in this instance, since the system was basically defunct for over a minute and so replicaGC is not too likely, plus we don't see log messages that should have occurred prior to the crash. But I'll make a note that something is wrong here.
I also considered whether we might be signaling proposals that are in the map in the error path here:
cockroach/pkg/kv/kvserver/replica_application_result.go
Lines 111 to 120 in 54e3708
if pErr != nil { | |
log.Warningf(ctx, "failed to repropose with new lease index: %s", pErr) | |
cmd.response.Err = pErr | |
} else { | |
// Unbind the entry's local proposal because we just succeeded | |
// in reproposing it and we don't want to acknowledge the client | |
// yet. | |
cmd.proposal = nil | |
return | |
} |
but I had recently convinced myself2 that a proposal that hits this path was necessarily removed from the map already, in retrieveLocalProposals
1.
refreshProposalsLocked
This one is more interesting. Even though that method holds the appropriate locks throughout and does remove commands from the map prior to finishing them (in the few cases in which it does do that), what it does do in the common case is hand proposals that are in the map back to the proposal buffer (without finishing them):
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1345 to 1347 in ac3e633
for _, p := range reproposals { | |
log.Eventf(p.ctx, "re-submitting command %x to Raft: %s", p.idKey, reason) | |
if err := r.mu.proposalBuf.ReinsertLocked(ctx, p); err != nil { |
This is possibly problematic - we now have a (yet unfinished, presumably) proposal in both the map and the proposal buffer. Could the proposal now be applied, finished, removed from the map but then be reinserted3 due to being present in the proposal buffer? I'm actually not convinced this can happen, because the proposal buffer is flushed right at the beginning of raft processing (which includes entry application), meaning that by the time we might be applying the proposal the proposal buffer has already been emptied out4 and won't re-insert into the map.
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 718 to 719 in ac3e633
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { | |
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup) |
Next steps
The lifecycle of a proposal is pretty ad-hoc but there is clearly an invariant I am assuming should hold - that there isn't ever a finished proposal in the proposal map. We are not checking this invariant, but we should. We add to the map in a single place only3 and should verify this here. If we then still see the crash but not the assertion - we know the proposal was finished while remaining in the map, still in violation of the invariant.
As outlined above, we know of at least one invariant violation (the snippet in the AckOutcomeAndFinish
section) but maybe I missed another one. Actually, revisiting the log of the failing test above, maybe it is replicaGC-related after all, since this is a multi-region test and I see the replicaGCQueue repeatedly run into the circuit breaker. But I'm still unsure why there isn't any logging5 then.
Footnotes
-
https://github.com/cockroachdb/cockroach/blob/a94858bff9a53450e0c76ff8ed8757fd3d18a264/pkg/kv/kvserver/replica_application_decoder.go#L100-L114 ↩ ↩2
-
better docs on this stuff incoming in https://github.com/cockroachdb/cockroach/pull/94633 ↩ ↩2
-
https://github.com/cockroachdb/cockroach/blob/a94858bff9a53450e0c76ff8ed8757fd3d18a264/pkg/kv/kvserver/replica_proposal_buf.go#L1183-L1191 ↩ ↩2
-
https://github.com/cockroachdb/cockroach/blob/0703807d018fb1c0f5352e8d83315b463c90d1e9/pkg/kv/kvserver/replica_destroy.go#L191-L216 ↩
-
https://github.com/cockroachdb/cockroach/blob/35e0f4a6a13e6777e8d848100f4fa6311800c43f/pkg/kv/kvserver/store_remove_replica.go#L150 ↩
The conjecture in cockroachdb#86547 is that a finished proposal somehow makes its way into the proposal map, most likely by never being removed prior to being finished. This commit adds an assertion that we're never outright *inserting* a finished proposals, and better documents the places in which we're running a risk of violating the invariant. It also clarifies the handling of proposals in an apply batch when a replication change that removes the replica is encountered. I suspected that this could lead to a case in which proposals would be finished despite remaining in the proposals map. Upon inspection this turned out to be incorrect - the map (at least by code inspection) is empty at that point, so the invariant holds trivially. Unfortunately, that leaves me without an explanation for cockroachdb#86547, but the newly added invariants may prove helpful. Touches cockroachdb#86547.
@tbg, this unrelated test has been frequently failing due to this error. If you think the review process for #94825 will take some time, I may skip the test to avoid CI flakiness. |
I still don't understand how we can get a finished endCmds here, but while I scratch my head we don't need to be collecting CI failures. Touches cockroachdb#86547. Epic: none Release note: None
95744: kvserver: don't NPE in poisonInflightLatches r=erikgrinaker a=tbg I still don't understand how we can get a finished endCmds here, but while I scratch my head we don't need to be collecting CI failures. Touches #86547. Closes #94209. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
The conjecture in cockroachdb#86547 is that a finished proposal somehow makes its way into the proposal map, most likely by never being removed prior to being finished. This commit adds an assertion that we're never outright *inserting* a finished proposals, and better documents the places in which we're running a risk of violating the invariant. It also clarifies the handling of proposals in an apply batch when a replication change that removes the replica is encountered. I suspected that this could lead to a case in which proposals would be finished despite remaining in the proposals map. Upon inspection this turned out to be incorrect - the map (at least by code inspection) is empty at that point, so the invariant holds trivially. Unfortunately, that leaves me without an explanation for cockroachdb#86547, but the newly added invariants may prove helpful. Touches cockroachdb#86547.
94825: kvserver: prevent finished proposal from being present in proposals map r=nvanbenschoten a=tbg The conjecture in #86547 is that a finished proposal somehow makes its way into the proposal map, most likely by never being removed prior to being finished. This commit adds an assertion that we're never outright *inserting* a finished proposals, and better documents the places in which we're running a risk of violating the invariant. It also clarifies the handling of proposals in an apply batch when a replication change that removes the replica is encountered. I suspected that this could lead to a case in which proposals would be finished despite remaining in the proposals map. Upon inspection this turned out to be incorrect - the map (at least by code inspection) is empty at that point, so the invariant holds trivially. Unfortunately, that leaves me without an explanation for #86547, but the newly added invariants may prove helpful. Touches #86547. Epic: None Release note: None Co-authored-by: Tobias Grieger <[email protected]>
Unassigning since I'm no longer working on this and we've made the code path resilient to avoid the crash. It's likely that Epic CRDB-25287 will solve the underlying problem, which likely had to do with a finished proposal being re-inserted into the proposals map, which is something we know is possible at the moment. (This is not thought to cause double-application but can cause crashes in code that assumes it's not possible). |
The reproposal path has been simplified quite a bit as a result of CRDB-25287 and the problem is more or less understood: The work-around is still in place above and below: cockroach/pkg/kv/kvserver/replica_raft.go Lines 1496 to 1512 in fe1c4a0
but armed with this comment we should feel comfortable going through another round of assertions (best using Footnotes |
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3521479942/?referrer=webhooks_plugin
Panic message:
Stacktrace (expand for inline code snippets):
GOROOT/src/runtime/panic.go#L1037-L1039 in runtime.gopanic
GOROOT/src/runtime/panic.go#L220-L222 in runtime.panicmem
GOROOT/src/runtime/signal_unix.go#L734-L736 in runtime.sigpanic
https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1383-L1385 in pkg/kv/kvserver.(*Replica).poisonInflightLatches
https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go#L217-L219 in pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1
cockroach/pkg/util/stop/stopper.go
Lines 493 to 495 in 3c6c893
GOROOT/src/runtime/asm_amd64.s#L1580-L1582 in runtime.goexit
v22.1.4
Jira issue: CRDB-18809
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: