-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: finished proposal inserted into map #97605
Comments
cc @cockroachdb/replication |
Touches cockroachdb#97605. Epic: none Release note: None
96989: roachtest: improvements to `mixedversion` package r=herkolategan a=renatolabs This commit introduces a few improvements to the `mixedversion` package, the recently introduced framework for mixed-version (and mixed-binary) roachtests. Specifically, the following improvements are made: * Removal of `DBNode()` function: having the planner pick a database that the individual steps will connect to is insufficient in many cases and could be misleading. The idea was that the user would be able to see, from the test plan itself, what node a certain step would be interacting with. However, the reality is that steps often need to run statements on multiple different nodes, or perhaps they need to pick one node specifically (e.g., the statement needs to run on a node in the old version). For that reason, the `DBNode()` function was dropped. Instead, steps have access to a random number generator that they can use to pick an arbitrary node themselves. The random number generators are unique to each user function, meaning each test run will see the same numbers being generated even if other steps are scheduled concurrently. The numbers observed by a user function will also be the same if the seed passed to `mixedversion.Test` is the same. * Definition of a "test context" that is available to mixed-version tests. For now, the test context includes things like which version we are upgrading (or downgrading) to and from and which nodes are running which version. This allows tests to take actions based on, for example, the number of nodes upgraded. It also allows them to run certain operations on nodes that are known to be in a specific version. * Introduction of a `helper` struct that is passed to user-functions. For now, the helper includes functions to connect to a specific node and get the current test context. The struct will help us provide common functionality to tests so that they don't have to duplicate code. * Log cached binary and cluster versions before executing a step. This makes it easier to understand the state of the cluster when looking at the logs of one specific step. * Internal improvement to the test runner: instead of assuming the first step of a mixed-version test plan will start the the cockroach nodes, we now check that that is the case, providing a clear error message if/when that assumption doesn't hold anymore (instead of a cryptic connection failure error). Epic: CRDB-19321 Release note: None 97251: sql: add user_id column to system.database_role_settings table r=rafiss a=andyyang890 This patch adds a new `role_id` column to the `system.database_role_settings` table, which corresponds to the existing `role_name` column. Migrations are also added to alter and backfill the table in older clusters. Part of #87079 Release note: None 97566: kvserver: assert uniqueness in registerProposalLocked r=pavelkalinnikov a=tbg We routinely overwrite entries in the `r.mu.proposals` map. That is "fine" (better if we didn't, but currently it is by design - it happens in refreshProposalsLocked and during tryReproposeWithNewLeaseIndex) but our overwrites should be no-ops, i.e. reference the exact same `*ProposalData`. This is now asserted. One way this would trip is a CmdID collision. Epic: none Release note: None 97606: kvserver: disable assertion 'finished proposal inserted' r=pavelkalinnikov a=tbg Touches #97605. Epic: none Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
The proposal was disabled in #97606. Some musings on this. The ownership model of proposals ( On to the proposal buffer. We hold I'm not sure what went wrong in #96149, but there is a lot going on here. Footnotes |
The reason this is rare is that the
which means the proposal spent 31 ticks = 6.2s inflight before hitting this assertion. I looked at other ways of hitting this but this seems like the most reasonable one. In Footnotes |
There are some more dragons lurking here. When flushed from the propbuf, the command may also be rejected:
we'd then be doubly-finishing a proposal. This won't crash (it becomes a noop the second time around) but it smells that this is possible. There are some replica circuit breaker related crashes that I had looked into (on a plane, so can't link it right now) that seem related. They can probably be explained by variants of the above - the crash was caused by trying to poison a request's latches, but finding that request already finished. I had not understood why a finished request could be in the proposals map - that's precisely why I added the assertion1 - and now it makes more sense. In fact, the circuit breaker crash should now be replaced by the assertion crash discussed in this issue. Footnotes |
Describe the problem
Seen in #96149 and also in TestSQLStatsRegions1
The assertion was introduced relatively recently2 so this is likely not a regression.
Jira issue: CRDB-24780
Epic CRDB-25287
Footnotes
https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/8811851?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true ↩
15b1c6a ↩
The text was updated successfully, but these errors were encountered: