-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: introduce RFC for long running migrations #48843
Conversation
(Still have to address a few of Tobi's comments, included inline, and possibly massage it into the RFC template.) |
f399f57
to
1254879
Compare
07d41c4
to
592863f
Compare
I have yet to convert this into the standard template for RFCs, right now it's just the document structure I'd used to understand it all (which, now, I think I mostly do). I'm going to continue prototyping this work while the RFC is under review, and back fill in the details as I run into stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @nvanbenschoten, and @tbg)
docs/RFCS/20200513_long_running_migrations.md, line 102 at r2 (raw file):
not been able to phase out old functionality. Consider the following examples: (i) There's a new FK representation (as of 19.2) and we want to make sure all the
You may want to add why this isn't using sqlmigrations. In a sense this is an unfortunate example because we wouldn't be able to use the new path, too (multi-tenancy). Really what's needed here is code (probably in sqlmigrations) that goes through the descriptors (at the KV level) and rewrites them.
In practice, since multi-tenancy starts only in the future, we'll never have to go through this migration there, and can use long-running migrations for it, though it will feel wrong.
docs/RFCS/20200513_long_running_migrations.md, line 113 at r2 (raw file):
we can guarantee that there are no old table descriptors left lying around. (ii) Jobs in 2.0 did not support resumability. Currently (as of 20.1), we
Ditto, unfortunately
docs/RFCS/20200513_long_running_migrations.md, line 139 at r2 (raw file):
This looks fragile, and is a maintenance overhead that we'd like to remove. (iii) The raft truncated state key was made unreplicated in 19.1. To phase in
the LeaseAppliedState is an earlier incarnation of the same pattern, might want to mention it here as well
docs/RFCS/20200513_long_running_migrations.md, line 209 at r2 (raw file):
We're thus no longer reliant on gossip to propagate version bumps. Each "version migration", as defined by the CRDB developer, is a version tag
I think we should consider s/hook/migration/ throughout. A hook is not a good intuitive description of what this method is.
docs/RFCS/20200513_long_running_migrations.md, line 250 at r2 (raw file):
[#hook-primitives](#hook-primitives) for more details), it's only after the successful completion of a given hook that the orchestrator is free to proceed with the next version. This version-tag + hook primitive can be used to string
It's sort of clear, but if there's no next version, then the upgrade is only marked as complete if the last version's migration runs to completion. That is, migrations are completed at least once for an upgrade to succeed.
docs/RFCS/20200513_long_running_migrations.md, line 261 at r2 (raw file):
orchestrator process kicked off will find the partially completed run and take over from there. To guard against multiple orchestrators running, we'll make use of `pkg/leasemanager` (though, everything should be idempotent anyway).
Yes, I think correctness follows from idempotency and we want (mostly) mutual exclusion for performance only.
docs/RFCS/20200513_long_running_migrations.md, line 337 at r2 (raw file):
decommissioned" state in the liveness record at the end of the decommissioning process. We can also use the Connect RPC to prevent nodes with mismatched binary versions from joining the cluster.
We also more generally use the liveness table (and it should be discussed whether it's the liveness table or an aux table, not sure we've weighed the pro/cons enough) as the authoritative registry of who's in the cluster, meaning it is populated synchronously at startup as opposed to now.
docs/RFCS/20200513_long_running_migrations.md, line 341 at r2 (raw file):
# Node Additions During Migrations When a new node joins the cluster, it will discover the active cluster version
Is it clear that the "active cluster version" is the version of the last initiated (not completed) migration?
docs/RFCS/20200513_long_running_migrations.md, line 342 at r2 (raw file):
When a new node joins the cluster, it will discover the active cluster version (from the already initialized node it connects to as part of the ConnectRPC),
Reminder to both of us to think through the Connect RPC more in the case in which there isn't quorum (for example restarting a fully down cluster). The Connect server must not rely on KV being available in the case of an initialized node joining (it can if the node joining has no state, since it can't possibly be needed for quorums yet).
Example that seems pretty bad:
Four nodes, n1-n3, initially at v1.
n1 gets terminated. n4 joins the cluster. n1 gets decomissioned in absentia.
n2-n4 upgrade to v2.
n2-n4 shut down.
Now:
a) n1 starts and has lots of critical-looking system replicas.
Now what? It can't possibly Connect()
. We can say fine, it tries to Connect() but has nobody to connect to, so it can't start. Which is fine because it's decommissioned.
But if instead:
b) n2 starts and has lots of critical-looking system replicas.
We want n2 to start without talking to anyone, but locally n2 looks exactly like n1.
One way out could be to require that in fact a quorum of nodes is started and then discover each other (say n2 and n3), but KV would not become available until both n2 and n3 start serving, at which point it's too late to gate any migrations.
It's awkward. We may have to compromise on the infallibility of the approach here.
docs/RFCS/20200513_long_running_migrations.md, line 351 at r2 (raw file):
this is our happy case. The new node will start off with all the relevant version gates enabled, and persist the same version persisted by all other nodes. (Still, it's uncomfortable that the orchestrator doesn't know about this
We could go the extra mile and make discovery of an ongoing migration part of the Connect RPC. You would basically get an obligation to register with the orchestrator or wait for the orchestrator lease to time out. But ultimately, why would the orchestrator need to know about the node? Maybe some migration wants to know "all" of the nodes in the step, including any that are just joining? Seems far-fetched, though possible, in which case the only solution is serializing joins and migrations (bad idea).
docs/RFCS/20200513_long_running_migrations.md, line 396 at r2 (raw file):
Before returning to the client, `AdminMigrate` will wait for the command to have been durably applied on all its followers. This is the mechanism that would let us fully migrate out _all_ the data in the system, and not have to
Discuss replicaGC - we need to run it on all nodes to make sure there aren't any old replicas around that never heard of the command (which would still have that old data and yes, they'll never get to really use it, but we want to be thorough here).
Discuss nodes going down (command has timeout - if it fails the migration restarts once nodes back up).
docs/RFCS/20200513_long_running_migrations.md, line 491 at r2 (raw file):
- ~~(Stretch) Clean up version usage to start using negative versions?~~ # Unresolved Questions
Also worth discussing whether changing the upgrade window (i.e. allowing upgrades over multiple versions as opposed to just the last one) would influence the design. Ideally it should "just work" by leaving MinSupportedVersion alone for longer, as it seems it would today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting stuff! This was an interesting read.
Reviewed 1 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @nvanbenschoten, and @tbg)
docs/RFCS/20200513_long_running_migrations.md, line 12 at r2 (raw file):
# Motivation CRDB gets older by the day, with each release bringing in changes that augment,
s/older/more mature/ :smiley:
docs/RFCS/20200513_long_running_migrations.md, line 72 at r2 (raw file):
These two frameworks existed side-by-side for a reason. Putting our multi-tenancy hats on:
Your hat wasn't already on?
docs/RFCS/20200513_long_running_migrations.md, line 149 at r2 (raw file):
deployment, coming back to life when running 19.2 or beyond, expecting to find the replicated truncated state key. This is getting to be a bit unwieldy.
It also causes bugs and makes us hesitant to refactor code! He's a recent example I remember: e3346c1, but there have probably been a dozen similar issues over the years.
docs/RFCS/20200513_long_running_migrations.md, line 158 at r2 (raw file):
# Proposal Assume `SET CLUSTER SETTING` is only allowable through the system tenant. We're
cc. @ajwerner, as we were just talking about this. There is a subset of SQL-level cluster settings that would be useful to allow secondary tenants to set. Andrew and I were discussing the idea of categorizing these cluster settings and exposing only the SQL-visible settings to tenants (e.g. defaults for session variables).
docs/RFCS/20200513_long_running_migrations.md, line 199 at r2 (raw file):
cluster health (remember: we need all the nodes in the system to be up and running during migrations). The migration will stall if nodes are down, but not decommissioned (which will be rectifiable by either decommissioning the node,
How are we going to surface this information to the administrator?
docs/RFCS/20200513_long_running_migrations.md, line 225 at r2 (raw file):
_ := VersionWithHook{ Tag: roachpb.Version{Major: 20, Minor: 1, Unstable: 7}, Hook: func (ctx context.Context, h *Helper) error {
nit: When we do end up implementing this, consider calling this IdempotentHook
or IdempotentMigration
so that it's impossible to miss that requirement when adding new migrations. Or even add a boolean field to VersionWithHook
called Idempotent
that is required to be true. Just some way to force every author of one of these to think all the way through this requirement and its implications on how they write their migration hook.
docs/RFCS/20200513_long_running_migrations.md, line 395 at r2 (raw file):
necessary to lock out relevant range processing, as needed. Before returning to the client, `AdminMigrate` will wait for the command to have been durably applied on all its followers. This is the mechanism that
Mention that this part already exists through use of the PerReplicaClient.WaitForApplication
RPC.
docs/RFCS/20200513_long_running_migrations.md, line 428 at r2 (raw file):
once finalized
Be a bit more explicit. "Once the v20.2 cluster version is finalized", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
docs/RFCS/20200513_long_running_migrations.md, line 342 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Reminder to both of us to think through the Connect RPC more in the case in which there isn't quorum (for example restarting a fully down cluster). The Connect server must not rely on KV being available in the case of an initialized node joining (it can if the node joining has no state, since it can't possibly be needed for quorums yet).
Example that seems pretty bad:
Four nodes, n1-n3, initially at v1.
n1 gets terminated. n4 joins the cluster. n1 gets decomissioned in absentia.
n2-n4 upgrade to v2.
n2-n4 shut down.Now:
a) n1 starts and has lots of critical-looking system replicas.
Now what? It can't possiblyConnect()
. We can say fine, it tries to Connect() but has nobody to connect to, so it can't start. Which is fine because it's decommissioned.
But if instead:
b) n2 starts and has lots of critical-looking system replicas.
We want n2 to start without talking to anyone, but locally n2 looks exactly like n1.
One way out could be to require that in fact a quorum of nodes is started and then discover each other (say n2 and n3), but KV would not become available until both n2 and n3 start serving, at which point it's too late to gate any migrations.It's awkward. We may have to compromise on the infallibility of the approach here.
Did you have any thoughts here @irfansharif? The problem here means that I don't see a way to use KV as a source of truth for the decommissioning bit. I think we would have to move to some invariant as "if the decommissioning bit is set, it is also written into the local storage of each participant node in the cluster", with the usual race-avoiding structure:
- get list of nodes
- contact them all
- get list again
- loop around if it changed
- write the decommissioning bit into kv (just for bookkeeping, the source of truth is each node's storage and is consulted during
Connect()
)
But now there's a UX problem if more than one node is down and the operator wants to decommission them. They can't do it one by one any more, they would have to specify all dead nodes at once. And of course you may not want to decommission them all in the first place.
Maybe this is fine, but I suspect not. It makes the process really difficult to script around. Though, again, perhaps it is fine? It is just the "permanently mark as dead" step that would need to know all of the nodes. We could make that last transition cluster-internal, i.e. not run by the user. This means you could continue to decommission nodes one by one, but they would be considered "decommissioned-won't-come-back" only when we've internally managed to get them into that state. The worst that could happen here is that an operator gets their upgrade refused because nodes are only marked as decommissioning, not as decommissioned-won't-come-back yet. This would then be easy to fix (the error message could be helpful). If we wanted to go that route, we would need an extra step in the state enum to avoid racing with the recommission
command (once we've told any node to refuse that NodeID, we can't recommission).
Curious (hopeful?) that you have better ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
docs/RFCS/20200513_long_running_migrations.md, line 342 at r2 (raw file):
Hm, this is pretty hairy, I was partial to having this authoritative state to look at, and work off of. I was thinking about this over the weekend, and am being very hand-wavey here, but I'm wondering what do you think of the following stream of gibberish conciousness:
The example above with n1-n4 looks an awful lot like a raft membership problem. We started off with a raft group with members n1-n3. Over here we have n1, which is no longer going to be considered part of the raft group, and n4, which now is. If they were all restarted, we don't want n1 to be participate in any proposals/elections/etc. It should only be possible for n2-n4 to assume authority (or just n2-n3, if n4 hasn't been caught up yet). Once this raft group does have a leader, we know that a majority of the nodes in the system have the authoritative set of fully decommissioned nodes (or rather, the authoritative commissioning state of every node in the cluster).
In raft, a replica with a higher term will refuse to acknowledge vote requests from a replica with a lower term. Here, similarly, a node should refuse "votes" from nodes it considers to be fully decommissioned (so n1, despite being able to solicit votes from n2-n3, will never actually receive them). Similar to raft, when n2 collects votes for itself, it does not count votes from n1 (seeing as how n2 knows n1 is no longer part of this raft group).
Ignoring for a second what code for the following would look like, wouldn't it be nice if, on full cluster restart, each restarting node tries to form this raft group over the range of keys representing commissioning state? And once that was settled, going forward we relied on this set of keys as the source of truth? Say now that this set of keys are actually the node liveness keys, and replicas of the node liveness range already exist on at least a quorum of nodes in the cluster (n1 might have a replica too, on disk, but it won't do it much good seeing as how it can't participate in the raft group). As for the Connect RPC, it now has the pre-requisite that this raft group is properly initialized and has a functioning leader/leaseholder. All its reads and writes will need to go through the leaseholder for this liveness range after all.
"if the decommissioning bit is set, it is also written into the local storage of each participant node in the cluster"
This is kind of still in the spirit of what we'd be doing, except we only need to write to the local storage of a quorum of the liveness range's replicas, and then require them to be up and running for the rest of the cluster to consult. I think we're now simply repurposing the "KV copy" of this decommissioned bit as the "local storage" for replicas of this range.
One way out could be to require that in fact a quorum of nodes is started and then discover each other (say n2 and n3), but KV would not become available until both n2 and n3 start serving, at which point it's too late to gate any migrations.
I don't think I fully understand why it's too late to gate any migrations at that point. With the hand-waveyness above, we're doing things in two steps: establishing a leaseholder over the liveness range, and then using that leaseholder (and quorum) to power the rest of the Connect RPC. Does that not work?
Does any of this make sense? Is it at all workable? Is it already what you were describing in different words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @irfansharif, @knz, and @tbg)
docs/RFCS/20200513_long_running_migrations.md, line 342 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Hm, this is pretty hairy, I was partial to having this authoritative state to look at, and work off of. I was thinking about this over the weekend, and am being very hand-wavey here, but I'm wondering what do you think of the following stream of gibberish conciousness:
The example above with n1-n4 looks an awful lot like a raft membership problem. We started off with a raft group with members n1-n3. Over here we have n1, which is no longer going to be considered part of the raft group, and n4, which now is. If they were all restarted, we don't want n1 to be participate in any proposals/elections/etc. It should only be possible for n2-n4 to assume authority (or just n2-n3, if n4 hasn't been caught up yet). Once this raft group does have a leader, we know that a majority of the nodes in the system have the authoritative set of fully decommissioned nodes (or rather, the authoritative commissioning state of every node in the cluster).
In raft, a replica with a higher term will refuse to acknowledge vote requests from a replica with a lower term. Here, similarly, a node should refuse "votes" from nodes it considers to be fully decommissioned (so n1, despite being able to solicit votes from n2-n3, will never actually receive them). Similar to raft, when n2 collects votes for itself, it does not count votes from n1 (seeing as how n2 knows n1 is no longer part of this raft group).
Ignoring for a second what code for the following would look like, wouldn't it be nice if, on full cluster restart, each restarting node tries to form this raft group over the range of keys representing commissioning state? And once that was settled, going forward we relied on this set of keys as the source of truth? Say now that this set of keys are actually the node liveness keys, and replicas of the node liveness range already exist on at least a quorum of nodes in the cluster (n1 might have a replica too, on disk, but it won't do it much good seeing as how it can't participate in the raft group). As for the Connect RPC, it now has the pre-requisite that this raft group is properly initialized and has a functioning leader/leaseholder. All its reads and writes will need to go through the leaseholder for this liveness range after all.
"if the decommissioning bit is set, it is also written into the local storage of each participant node in the cluster"
This is kind of still in the spirit of what we'd be doing, except we only need to write to the local storage of a quorum of the liveness range's replicas, and then require them to be up and running for the rest of the cluster to consult. I think we're now simply repurposing the "KV copy" of this decommissioned bit as the "local storage" for replicas of this range.
One way out could be to require that in fact a quorum of nodes is started and then discover each other (say n2 and n3), but KV would not become available until both n2 and n3 start serving, at which point it's too late to gate any migrations.
I don't think I fully understand why it's too late to gate any migrations at that point. With the hand-waveyness above, we're doing things in two steps: establishing a leaseholder over the liveness range, and then using that leaseholder (and quorum) to power the rest of the Connect RPC. Does that not work?
Does any of this make sense? Is it at all workable? Is it already what you were describing in different words?
The concern with starting up the whole stack and letting it talk to the rest of the cluster and then deciding whether that is legal is that it's hard to avoid undefined behavior. Ideally we get dead simple guarantees - you either make it into the cluster or not. The hope was that the Connect RPC could be an airtight barrier here.
You are right that this has aspects of a consensus problem. I think, and I think this is also the logical conclusion of what you're describing, that we should really have cluster membership management as a thin layer that can be instantiated without the thick stack that is CRDB. For example, if CRDB didn't use itself as a config store, this problem would be easy - we'd have an auxiliary system running (say etcd) and that's what the Connect RPC would reach out to - no problem (though of course other complexity crops up).
@andy-kimball has in the past prototyped such a layer (this came out of Acidlib), so we wouldn't be starting from scratch here. However, doing something like this is an insurmountable increase in scope, so we ought to put a lid on it.
I think to get to a workable solution, we need to rely on the operator a little more (though less than today). I'm thinking the following:
./cockroach node decommission
prefers decommissioning live targets. When there are no more replicas on the target, we instruct the target to write a killfile (we have such a mechanism already, seePreventedStartupFile
). That means that once the operator does shut down the node, it will "never" start again (mod the operator tampering with a file that clearly tells them to not do that).- if the target is down,
./cockroach node decommission
fails and instructs the operator to run with the--force
flag (which marks the node as never coming back) - here the operator guarantees that this is true (though we can broadcast a best-effort payload through the cluster that lets nodes refuse this node - this should in practice solve 99% of potential issues with nodes coming back) - the
decommissioning
field inNodeLiveness
becomes anenum { Active, Decommissioning, WaitGone, Gone }
(names tbd), enabling the above and below - on cluster upgrade attempt, we fail/block if we see a node in
Decommissioning
state - if we see
WaitGone
, we make sure it's beenTimeUntilStoreDead
until we initiate the bump (and we update toGone
at that point assuming node is still dead after the timeout)
Then we go into the actual upgrade, where we store the target cluster version together with an enum in a table:
type Upgrade struct {
Version roachpb.Version
State UpgradeState
}
Where UpgradeState
is either StateInProgress
or StateCompleted
, with the following meaning:
StateInProgress
: not all nodes were informed ofVersion
yet (though some may already have acked it, this could happen with coordinator failure)StateCompleted
: all nodes in the cluster have acked the end of the migration. (It is also guaranteed that nodes joining the cluster now will have to ack the end of the migation as well, i.e. it's a transitive property now).
Note that if a migration wants to have a hook that fires when all nodes have acked some work to be done, we can just use two consecutive versions to achieve this - we don't need a dedicated primitive.
As discussed elsewhere, the coordinator would read the list of nodes, try to get all of the acks, and then loop around to make sure it's got them all (avoiding join-races).
The reason the cluster version is easier to handle than the decommissioning bit is that we can afford to prevent updates when nodes are down. We have no such luxury for decommissioning. Otherwise, we could treat the "list of decommissioned nodes" like the cluster version, that is, the Connect
RPC would force the joiner to persist the whole list of decommissioned nodes locally before participating (let's ignore that this list can grow large). Then we could use the same technique above for the version above: we ack the decommission when all nodes have persisted it locally (and the list will transitively reach all new nodes joining the cluster, and is always available without KV, etc). We could still have situations where a bunch of decommissioned nodes reach out to each other (in which case they'll allow Connect
) so we would also need to run the Connect
checks at RPC handshake time (which I think we'll want to do regardless, and already do today for the cluster version).
Of course this is all a non-starter, because we don't want to force a user to decommission all down nodes at once, and because an ever-growing payload in the Connect RPC is tricky.
Back to the user-initiated decommissioning proposal above, I think there is possibly room to improve the UX. As written, user has to decommission the node, then shut the node down (or, if it's down, use --force
), then wait for TimeUntilStoreDead
to pass until the upgrade succeeds. It would be nice if, with an online node, to not have to wait out TimeUntilStoreDead
. I think the reasonable way to do this is that at the end of decommissioning an online node, the coordinator moves the node to WaitGone
. Then it issues a long-poll drain to the target node. When the node confirms that it has drained successfully, we send a remote termination with a deadline (i.e. we send an RPC saying "promise that you'll have terminated within 3s" and the node goes "ok" and then we can reduce TimeUntilStoreDead
to something >3s
or possibly just wait for the conn to drop).
Let me know if you think this is reasonable and in particular make sure there aren't any holes here (similar to my 4 node example above, where I think now everything will work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @irfansharif)
docs/RFCS/20200513_long_running_migrations.md, line 38 at r2 (raw file):
# Background We currently have two frameworks in place for "migrations":
There are actually 3.
The 3rd one (which you're missing here) is the one we invented very early on before sqlmigrations and the cluster version setting werre designed.
It's the "MaybeUpgrade" logic on SQL table descriptors. This code is doing context-free protobuf upgrades every time an "old" descriptor is used in a "new" version.
Due to the other factors you explain elsewhere in this text, this code is currently overly complicated because it needs to deal with upgrades across multiple major versions.
Either this RFC should aim to get rid of this (and explain how); or it should spell it out as out of scope (and then spell out the problem to solve to the SQL schema team to take over).
docs/RFCS/20200513_long_running_migrations.md, line 88 at r2 (raw file):
of this RFC. (For this RFC, we'll leave "sqlmigrations" as is and instead focus on the
Remove the parentheses around this paragraph. It defines scope and is thus essential.
docs/RFCS/20200513_long_running_migrations.md, line 111 at r2 (raw file):
the old (19.1) representation, and as a result, we had to keep maintaining the "upgrade on read" code path in 20.1, and will continue having to do so until we can guarantee that there are no old table descriptors left lying around.
In the meantime, there's code already written (MaybeUpgradeForeignKeyRepresentation
) which does it using the 3rd upgrade method that I mention above. I think either this paragraph should be modified to refer to that properly, or you should pick a different motivating example.
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node id. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster - we can prevent decommissioned nodes from joining the cluster - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. It wasn't possible to use cluster versions for this migrations because this happens very early in the node start up process, and the version gating this change will not be active until much later in the crdb process lifetime. --- There are some leftover TODOs that I'm looking to address in this PR. They should be tiny, and be easy to retro-fit into what we have so far. Specifically I'm going to plumb the client address into the RPC so the server is able to generate backlinks (and solve the bidirectionality problem). I'm also going to try and add the liveness record for a joining node as part of the join rpc. Right now the tests verifying connectivity/bootstrap/join flags pass out of the box, but I'm going to try adding more randomized testing here to test full connectivity once I address these TODOs. Release note: None
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node id. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster - we can prevent decommissioned nodes from joining the cluster - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. It wasn't possible to use cluster versions for this migrations because this happens very early in the node start up process, and the version gating this change will not be active until much later in the crdb process lifetime. --- There are some leftover TODOs that I'm looking to address in this PR. They should be tiny, and be easy to retro-fit into what we have so far. Specifically I'm going to plumb the client address into the RPC so the server is able to generate backlinks (and solve the bidirectionality problem). I'm also going to try and add the liveness record for a joining node as part of the join rpc. Right now the tests verifying connectivity/bootstrap/join flags pass out of the box, but I'm going to try adding more randomized testing here to test full connectivity once I address these TODOs. Release note: None
This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note(general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note(general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.
This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note (general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note (general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
938bbcd
to
e7bf5cd
Compare
Release note: None
e7bf5cd
to
3a6596e
Compare
With #58088, I'm going to go ahead and mark this as completed. I've (slightly) refreshed the text to link into the work that went into fleshing it out, and am proposing to merge the RFC as-is. |
Thanks! bors r+ |
Build succeeded: |
Release note: None