Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: finished proposal inserted into map #97605

Closed
tbg opened this issue Feb 24, 2023 · 4 comments
Closed

kvserver: finished proposal inserted into map #97605

tbg opened this issue Feb 24, 2023 · 4 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Feb 24, 2023

Describe the problem

Seen in #96149 and also in TestSQLStatsRegions1

The assertion was introduced relatively recently2 so this is likely not a regression.

Jira issue: CRDB-24780

Epic CRDB-25287

Footnotes

  1. https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/8811851?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

  2. 15b1c6a

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Feb 24, 2023
@tbg tbg self-assigned this Feb 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 24, 2023

cc @cockroachdb/replication

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2023
craig bot pushed a commit that referenced this issue Feb 24, 2023
96989: roachtest: improvements to `mixedversion` package r=herkolategan a=renatolabs

This commit introduces a few improvements to the `mixedversion` package, the recently introduced framework for mixed-version (and mixed-binary) roachtests.

Specifically, the following improvements are made:

* Removal of `DBNode()` function: having the planner pick a database that the individual steps will connect to is insufficient in many cases and could be misleading. The idea was that the user would be able to see, from the test plan itself, what node a certain step would be interacting with. However, the reality is that steps often need to run statements on multiple different nodes, or perhaps they need to pick one node specifically (e.g., the statement needs to run on a node in the old version). For that reason, the `DBNode()` function was dropped. Instead, steps have access to a random number generator that they can use to pick an arbitrary node themselves. The random number generators are unique to each user function, meaning each test run will see the same numbers being generated even if other steps are scheduled concurrently. The numbers observed by a user function will also be the same if the seed passed to `mixedversion.Test` is the same.

* Definition of a "test context" that is available to mixed-version tests. For now, the test context includes things like which version we are upgrading (or downgrading) to and from and which nodes are running which version. This allows tests to take actions based on, for example, the number of nodes upgraded. It also allows them to run certain operations on nodes that are known to be in a specific version.

* Introduction of a `helper` struct that is passed to user-functions. For now, the helper includes functions to connect to a specific node and get the current test context. The struct will help us provide common functionality to tests so that they don't have to duplicate code.

* Log cached binary and cluster versions before executing a step. This makes it easier to understand the state of the cluster when looking at the logs of one specific step.

* Internal improvement to the test runner: instead of assuming the first step of a mixed-version test plan will start the the cockroach nodes, we now check that that is the case, providing a clear error message if/when that assumption doesn't hold anymore (instead of a cryptic connection failure error).

Epic: CRDB-19321

Release note: None

97251: sql: add user_id column to system.database_role_settings table r=rafiss a=andyyang890

This patch adds a new `role_id` column to the `system.database_role_settings`
table, which corresponds to the existing `role_name` column. Migrations are
also added to alter and backfill the table in older clusters.

Part of #87079

Release note: None

97566: kvserver: assert uniqueness in registerProposalLocked r=pavelkalinnikov a=tbg

We routinely overwrite entries in the `r.mu.proposals` map. That is
"fine" (better if we didn't, but currently it is by design - it
happens in refreshProposalsLocked and during
tryReproposeWithNewLeaseIndex) but our overwrites should be no-ops, i.e.
reference the exact same `*ProposalData`.

This is now asserted.

One way this would trip is a CmdID collision.

Epic: none
Release note: None


97606: kvserver: disable assertion 'finished proposal inserted' r=pavelkalinnikov a=tbg

Touches #97605.

Epic: none
Release note: None


Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member Author

tbg commented Feb 27, 2023

The proposal was disabled in #97606.

Some musings on this.

The ownership model of proposals (*ProposalData) is quite complex. Proposals live in the r.mu.proposals map, but "at any time" they could be up for application, in which case there'd be access to them below raft (and they'd get removed from the map). "Safe" access to a *ProposalData (once it's in the proposals map) requires holding raftMu; the map itself is under repl.mu so you end up holding both. At the same time, a proposal can refer to multiple inflight log entries (tryReproposeWithNewLeaseIndex). If we called tryReproposeWithNewLeaseIndex during application of an entry, we've temporarily taken it out of the map and it's now sitting in the proposal buffer anew.

On to the proposal buffer. We hold r.mu (and raftMu) when inserting into the buffer, then we release both mutexes. The proposal is sitting in a slice1 after that point. (It may also still be in the proposal map). It won't be touched until the buffer is flushed, though, at which point both mutexes are held again.

I'm not sure what went wrong in #96149, but there is a lot going on here.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/4eb5451dcfbac79906eb48340f76597ca4d8ead2/pkg/kv/kvserver/replica_proposal_buf.go#L267

@tbg
Copy link
Member Author

tbg commented Feb 27, 2023

(*Replica).tick1 does not hold raftMu. So you can have the following interleaving. It's essentially caused by the fact that removing from the proposals map (to apply the command) doesn't also remove a reference that might be sitting in the proposal buffer.

  • proposal inserted into map & proposal buffer
  • proposal is flushed, replicated and just about to come up for application (but not yet!)
  • proposal is resubmitted to the proposal buffer via tick
  • application starts, command is accepted (proposal is finished)
  • but there is a copy sitting in the proposal buffer still!
  • when the proposal buffer next flushes (i.e. an instant later) the assertion fires.

The reason this is rare is that the tick fires only after a raft election timeout (i.e. seconds). You need that call to fall a hair before the proposal applies (and applies successfully). In the one reproduction I've looked at, we see

proposedAtTicks:462 createdAtTicks:431

which means the proposal spent 31 ticks = 6.2s inflight before hitting this assertion.

I looked at other ways of hitting this but this seems like the most reasonable one. In retrieveLocalProposals (which is the thing during cmd application that removes commands from the map before applying them) we need the rlock. So when the tick and this one race, and the tick comes first, we'll put the proposal that is just about to be finished into the propbuf and then once we remove the mutex it's immediately getting applied. Due to the fact that they both contend on the mu it's probably not so unlikely to get that interleaving.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/a363143b81f5045deab1ab8073da6d6949984936/pkg/kv/kvserver/replica_raft.go#L1313-L1434

@tbg
Copy link
Member Author

tbg commented Feb 27, 2023

There are some more dragons lurking here. When flushed from the propbuf, the command may also be rejected:

if b.maybeRejectUnsafeProposalLocked(ctx, raftGroup, p) {

we'd then be doubly-finishing a proposal. This won't crash (it becomes a noop the second time around) but it smells that this is possible.

There are some replica circuit breaker related crashes that I had looked into (on a plane, so can't link it right now) that seem related. They can probably be explained by variants of the above - the crash was caused by trying to poison a request's latches, but finding that request already finished. I had not understood why a finished request could be in the proposals map - that's precisely why I added the assertion1 - and now it makes more sense. In fact, the circuit breaker crash should now be replaced by the assertion crash discussed in this issue.

Footnotes

  1. 15b1c6a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

1 participant