-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migration/startupmigrations: remove all startupmigrations #73813
Comments
See #73754 for an interaction with |
One challenge here is that we use the startupmigrations as an easy way to write the data using SQL. To get rid of these startupmigrations, we'd need to have a convenient way to write such data into the boostrapped state. One approach would be to write the data out in some static process to generate the kvs and check them in. |
I'm inclined to unhook this from CRDB-18596 because of #84881. |
i'm friendly to the idea but let's then find a new+better home for it. |
Have any ideas? I went diving into JIRA initiatives. Maybe https://cockroachlabs.atlassian.net/browse/CRDB-17124, but it's already done? |
@ajstorm could you assist us here? |
At least for some known painful cases, #84881 doesn't seem to solve the problem. For example on my laptop, I don't see a better home than CRDB-18596 considering the above. |
That's because the startup migrations run on cluster creation, however, CRDB-18596 does not care about cluster creation time. I have a different fix for demo coming soon (later this week) where we don't turn on the injected latency until later, that will fix the slow |
This patch introduces "permanent" upgrades - a type of upgrade that is tied to a particular cluster version (just like the existing upgrades) but that runs regardless of the version at which the cluster was bootstrapped (in contrast with the existing upgrades that do are not run when they're associated with a cluster version <= the bootstrap version). These upgrades are called "permanent" because they cannot be deleted from the codebase at a later point. Existing upgrades are explicitly or implicitly baked into the bootstrap image of the binary that introduced them. For example, an upgrade that creates a system table is only run when upgrading an existing, older-version, cluster to the new version; it does not run for a cluster bootstrapped by the binary that introduced the upgrade because the respective system tables are also included in the bootstrap metadata. For some upcoming upgrades, though, including them in the bootstrap image is difficult. For example, creating a job record at bootstrap time is proving to be difficult (the system.jobs table has indexes, so you want to insert into it through SQL because figuring out the kv's for a row is tedious, etc). This is where these new permanent upgrades come in. The permanent upgrades share some characteristics with the `startupmigrations` that don't have the `includedInBootstrap` flag set. In contrast with `startupmigrations`, though, the new upgrades have the advantage that, being upgrades, they are tied to a cluster version v1 - they only run when the cluster version is being bumped to a version v2 >= v1 (or when the cluster is bootstrapped at such a v2). So, they do not run when a v1 node joins a cluster with a lower cluster version. This is important for some things - for example, one cannot create a job of a newly-introduced type in a startupmigration because the old-version nodes would freak out to discover it. At the same time, being upgrades, these permanent upgrades come with the caveat that a node running a new binary cannot rely on the upgrade to have run during the node startup sequence. The new-version code cannot even fully rely on a migration associated with version v1 to have run if v1.IsActive() returns true; if the cluster is bootstrapped at v1 or higher, v1.IsActive() will return true before the respective upgrade runs. However, all such upgrades run before user traffic starts being accepted. Touches cockroachdb#73813 Release note: None Epic: None
Once demo is fixed, I'm fine decoupling this from CRDB-18596. If we don't have a suitable epic to attach it to, I'd suggest that we create a new epic for this issue. |
This patch introduces "permanent" upgrades - a type of upgrade that is tied to a particular cluster version (just like the existing upgrades) but that runs regardless of the version at which the cluster was bootstrapped (in contrast with the existing upgrades that are not run when they're associated with a cluster version <= the bootstrap version). These upgrades are called "permanent" because they cannot be deleted from the codebase at a later point, in contrast with the others that are deleted once the version they're tied drops below BinaryMinSupportedVersion). Existing upgrades are explicitly or implicitly baked into the bootstrap image of the binary that introduced them. For example, an upgrade that creates a system table is only run when upgrading an existing, older-version, cluster to the new version; it does not run for a cluster bootstrapped by the binary that introduced the upgrade because the respective system tables are also included in the bootstrap metadata. For some upcoming upgrades, though, including them in the bootstrap image is difficult. For example, creating a job record at bootstrap time is proving to be difficult (the system.jobs table has indexes, so you want to insert into it through SQL because figuring out the kv's for a row is tedious, etc). This is where these new permanent upgrades come in. These permanent upgrades replace the `startupmigrations` that don't have the `includedInBootstrap` field set. All such startupmigrations have been copied over as upgrades. None of the current `startupmigrations` have `includedInBootstrap` set (except one but that's dummy one since the actual migration code has been deleted), so the startupmigrations package is now deleted. That's a good thing - we had one too many migrations frameworks. These permanent upgrades, though, do not have exactly the same semantics as the startupmigrations they replace. To the extent that there is a difference, the new semantics are considered more desirable: - startupmigrations run when a node that has the code for a particular migration startups up for the first time. In other words, the startupmigrations were not associated with a cluster version; they were associated with a binary version. Migrations can run while old-version nodes are still around. This means that one cannot add a migration that is a problem for old nodes - e.g. a migration creating a job of a type that the old version wouldn't recognize. - upgrades are tied to a cluster version - they only run when the cluster's active version moves past the upgrade's version. This stays the case for the new permanent migrations too, so a v2 node will not immediately run the permant migrations introduced since v1 when it joins a v1 cluster. Instead, the migrations will run when the cluster version is bumped. As such, the migrations can be backwards incompatible. startupmigrations do arguably have a property that can be desirable: when there are no backwards compatibility issues, the v2 node can rely on the effects of the startupmigrations it knows about regardless of the cluster version. In contrast, with upgrades, not only is a node unable to simply assume that a particular upgrade has run during startup, but, more than that, a node is not even able to look at a version gate during the startup sequence in order to determine whether a particular upgrade has run or not (because, in clusters that are bootstrapped at v2, the active cluster version starts as v2 even before the upgrades run). This is a fact of life for existing upgrades, and now becomes a fact of life for permanent upgrades too. However, by the time user SQL traffic is admitted on a node, the node can rely on version gates to correspond to migrations that have run. After thinking about it, this possible advantage of startupmigrations doesn't seem too useful and so it's not reason enough to keep the startupmigrations machinery around. Since the relevant startupmigrations have been moved over to upgrades, and the two libraries use different methods for not running the same migration twice, a 23.1 node that comes up in a 22.2 cluster will re-run the several permanent upgrades in question, even though they had already run as startupmigrations. This is OK since both startupmigrations and upgrades are idempotent. None of the current permanent upgrades are too expensive. Closes cockroachdb#73813 Release note: None Epic: None
This patch introduces "permanent" upgrades - a type of upgrade that is tied to a particular cluster version (just like the existing upgrades) but that runs regardless of the version at which the cluster was bootstrapped (in contrast with the existing upgrades that are not run when they're associated with a cluster version <= the bootstrap version). These upgrades are called "permanent" because they cannot be deleted from the codebase at a later point, in contrast with the others that are deleted once the version they're tied drops below BinaryMinSupportedVersion). Existing upgrades are explicitly or implicitly baked into the bootstrap image of the binary that introduced them. For example, an upgrade that creates a system table is only run when upgrading an existing, older-version, cluster to the new version; it does not run for a cluster bootstrapped by the binary that introduced the upgrade because the respective system tables are also included in the bootstrap metadata. For some upcoming upgrades, though, including them in the bootstrap image is difficult. For example, creating a job record at bootstrap time is proving to be difficult (the system.jobs table has indexes, so you want to insert into it through SQL because figuring out the kv's for a row is tedious, etc). This is where these new permanent upgrades come in. These permanent upgrades replace the `startupmigrations` that don't have the `includedInBootstrap` field set. All such startupmigrations have been copied over as upgrades. None of the current `startupmigrations` have `includedInBootstrap` set (except one but that's dummy one since the actual migration code has been deleted), so the startupmigrations package is now deleted. That's a good thing - we had one too many migrations frameworks. These permanent upgrades, though, do not have exactly the same semantics as the startupmigrations they replace. To the extent that there is a difference, the new semantics are considered more desirable: - startupmigrations run when a node that has the code for a particular migration startups up for the first time. In other words, the startupmigrations were not associated with a cluster version; they were associated with a binary version. Migrations can run while old-version nodes are still around. This means that one cannot add a migration that is a problem for old nodes - e.g. a migration creating a job of a type that the old version wouldn't recognize. - upgrades are tied to a cluster version - they only run when the cluster's active version moves past the upgrade's version. This stays the case for the new permanent migrations too, so a v2 node will not immediately run the permant migrations introduced since v1 when it joins a v1 cluster. Instead, the migrations will run when the cluster version is bumped. As such, the migrations can be backwards incompatible. startupmigrations do arguably have a property that can be desirable: when there are no backwards compatibility issues, the v2 node can rely on the effects of the startupmigrations it knows about regardless of the cluster version. In contrast, with upgrades, not only is a node unable to simply assume that a particular upgrade has run during startup, but, more than that, a node is not even able to look at a version gate during the startup sequence in order to determine whether a particular upgrade has run or not (because, in clusters that are bootstrapped at v2, the active cluster version starts as v2 even before the upgrades run). This is a fact of life for existing upgrades, and now becomes a fact of life for permanent upgrades too. However, by the time user SQL traffic is admitted on a node, the node can rely on version gates to correspond to migrations that have run. After thinking about it, this possible advantage of startupmigrations doesn't seem too useful and so it's not reason enough to keep the startupmigrations machinery around. Since the relevant startupmigrations have been moved over to upgrades, and the two libraries use different methods for not running the same migration twice, a 23.1 node that comes up in a 22.2 cluster will re-run the several permanent upgrades in question, even though they had already run as startupmigrations. This is OK since both startupmigrations and upgrades are idempotent. None of the current permanent upgrades are too expensive. Closes cockroachdb#73813 Release note: None Epic: None
This patch introduces "permanent" upgrades - a type of upgrade that is tied to a particular cluster version (just like the existing upgrades) but that runs regardless of the version at which the cluster was bootstrapped (in contrast with the existing upgrades that are not run when they're associated with a cluster version <= the bootstrap version). These upgrades are called "permanent" because they cannot be deleted from the codebase at a later point, in contrast with the others that are deleted once the version they're tied drops below BinaryMinSupportedVersion). Existing upgrades are explicitly or implicitly baked into the bootstrap image of the binary that introduced them. For example, an upgrade that creates a system table is only run when upgrading an existing, older-version, cluster to the new version; it does not run for a cluster bootstrapped by the binary that introduced the upgrade because the respective system tables are also included in the bootstrap metadata. For some upcoming upgrades, though, including them in the bootstrap image is difficult. For example, creating a job record at bootstrap time is proving to be difficult (the system.jobs table has indexes, so you want to insert into it through SQL because figuring out the kv's for a row is tedious, etc). This is where these new permanent upgrades come in. These permanent upgrades replace the `startupmigrations` that don't have the `includedInBootstrap` field set. All such startupmigrations have been copied over as upgrades. None of the current `startupmigrations` have `includedInBootstrap` set (except one but that's dummy one since the actual migration code has been deleted), so the startupmigrations package is now deleted. That's a good thing - we had one too many migrations frameworks. These permanent upgrades, though, do not have exactly the same semantics as the startupmigrations they replace. To the extent that there is a difference, the new semantics are considered more desirable: - startupmigrations run when a node that has the code for a particular migration startups up for the first time. In other words, the startupmigrations were not associated with a cluster version; they were associated with a binary version. Migrations can run while old-version nodes are still around. This means that one cannot add a migration that is a problem for old nodes - e.g. a migration creating a job of a type that the old version wouldn't recognize. - upgrades are tied to a cluster version - they only run when the cluster's active version moves past the upgrade's version. This stays the case for the new permanent migrations too, so a v2 node will not immediately run the permant migrations introduced since v1 when it joins a v1 cluster. Instead, the migrations will run when the cluster version is bumped. As such, the migrations can be backwards incompatible. startupmigrations do arguably have a property that can be desirable: when there are no backwards compatibility issues, the v2 node can rely on the effects of the startupmigrations it knows about regardless of the cluster version. In contrast, with upgrades, not only is a node unable to simply assume that a particular upgrade has run during startup, but, more than that, a node is not even able to look at a version gate during the startup sequence in order to determine whether a particular upgrade has run or not (because, in clusters that are bootstrapped at v2, the active cluster version starts as v2 even before the upgrades run). This is a fact of life for existing upgrades, and now becomes a fact of life for permanent upgrades too. However, by the time user SQL traffic is admitted on a node, the node can rely on version gates to correspond to migrations that have run. After thinking about it, this possible advantage of startupmigrations doesn't seem too useful and so it's not reason enough to keep the startupmigrations machinery around. Since the relevant startupmigrations have been moved over to upgrades, and the two libraries use different methods for not running the same migration twice, a 23.1 node that comes up in a 22.2 cluster will re-run the several permanent upgrades in question, even though they had already run as startupmigrations. This is OK since both startupmigrations and upgrades are idempotent. None of the current permanent upgrades are too expensive. Closes cockroachdb#73813 Release note: None Epic: None
91394: changefeedccl: roachtest refactor and initial-scan-only r=samiskin a=samiskin Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19057 Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an `initial_scan_only` test that terminated upon Job success. We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process. This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition. Changing the specialized `runCDCKafkaAuth`, `runCDCBank`, and `runCDCSchemaRegistry` functions were left out of scope for this first big change. The main APIs involved in basic roachtests are now: - `newCDCTester`: This creates a tester struct to run the rest of the APIs and initializes the database - `tester.runTPCCWorkload(tpccArgs)`: Starts a TPCC workload from the last node in the cluster - `tester.runLedgerWorkload(ledgerArgs)`: Starts a Ledger workload from the last node in the cluster - `tester.runFeedLatencyVerifier(changefeedJob, latencyTargets)`: starts a routine that monitors the changefeed latency until the tester is `Close`'d - `tester.waitForWorkload`: waits for a workload started by `setupAndRunWorkload` to complete its duration - `tester.startCRDBChaos`: This starts a Chaos routine that periodically shuts nodes down and brings them back up - `tester.newChangefeed(feedArgs)`: starts a new changefeed on the cluster and returns `changefeedJob` object - `changefeedJob.waitForCompletion`: waits for a changefeed to complete (either success or failure) - `tester.startGrafana`: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana, this automatically runs unless `--skip-init` is provided. If `--debug` is not used, `StopGrafana` will be called on test teardown to publish prometheus metrics to the artifacts directory. An API that is going to be more useful for experimentation are: - `changefeedJob.runFeedPoller(ctx, stopper, onInfo)`: runs a given callback every second with the changefeed info Roachtests can be ran locally with the `--local` flag or on an existing cluster without destroying it afterwards with `--cluster="my-cluster" --debug` Ex: After adding a new test (lets say `"cdc/my-test"`) to the `registerCDC` function you can keep running ```bash ./dev build cockroach --cross # if changes made to crdb ./dev build roachtest # if changes made to the test ./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug ``` as you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the `--cockroach="path-to-binary"` flag to test against those instead. If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's `--wipe` and `--skip-init` flags. Release note: None 91627: upgrade: introduce "permanent" upgrades r=andreimatei a=andreimatei This patch introduces "permanent" upgrades - a type of upgrade that is tied to a particular cluster version (just like the existing upgrades) but that runs regardless of the version at which the cluster was bootstrapped (in contrast with the existing upgrades that are not run when they're associated with a cluster version <= the bootstrap version). These upgrades are called "permanent" because they cannot be deleted from the codebase at a later point, in contrast with the others that are deleted once the version they're tied drops below BinaryMinSupportedVersion). Existing upgrades are explicitly or implicitly baked into the bootstrap image of the binary that introduced them. For example, an upgrade that creates a system table is only run when upgrading an existing, older-version, cluster to the new version; it does not run for a cluster bootstrapped by the binary that introduced the upgrade because the respective system tables are also included in the bootstrap metadata. For some upcoming upgrades, though, including them in the bootstrap image is difficult. For example, creating a job record at bootstrap time is proving to be difficult (the system.jobs table has indexes, so you want to insert into it through SQL because figuring out the kv's for a row is tedious, etc). This is where these new permanent upgrades come in. These permanent upgrades replace the `startupmigrations` that don't have the `includedInBootstrap` field set. All such startupmigrations have been copied over as upgrades. None of the current `startupmigrations` have `includedInBootstrap` set (except one but that's dummy one since the actual migration code has been deleted), so the startupmigrations package is now deleted. That's a good thing - we had one too many migrations frameworks. These permanent upgrades, though, do not have exactly the same semantics as the startupmigrations they replace. To the extent that there is a difference, the new semantics are considered more desirable: - startupmigrations run when a node that has the code for a particular migration startups up for the first time. In other words, the startupmigrations were not associated with a cluster version; they were associated with a binary version. Migrations can run while old-version nodes are still around. This means that one cannot add a migration that is a problem for old nodes - e.g. a migration creating a job of a type that the old version wouldn't recognize. - upgrades are tied to a cluster version - they only run when the cluster's active version moves past the upgrade's version. This stays the case for the new permanent migrations too, so a v2 node will not immediately run the permant migrations introduced since v1 when it joins a v1 cluster. Instead, the migrations will run when the cluster version is bumped. As such, the migrations can be backwards incompatible. startupmigrations do arguably have a property that can be desirable: when there are no backwards compatibility issues, the v2 node can rely on the effects of the startupmigrations it knows about regardless of the cluster version. In contrast, with upgrades, not only is a node unable to simply assume that a particular upgrade has run during startup, but, more than that, a node is not even able to look at a version gate during the startup sequence in order to determine whether a particular upgrade has run or not (because, in clusters that are bootstrapped at v2, the active cluster version starts as v2 even before the upgrades run). This is a fact of life for existing upgrades, and now becomes a fact of life for permanent upgrades too. However, by the time user SQL traffic is admitted on a node, the node can rely on version gates to correspond to migrations that have run. After thinking about it, this possible advantage of startupmigrations doesn't seem too useful and so it's not reason enough to keep the startupmigrations machinery around. Since the relevant startupmigrations have been moved over to upgrades, and the two libraries use different methods for not running the same migration twice, a 23.1 node that comes up in a 22.2 cluster will re-run the several permanent upgrades in question, even though they had already run as startupmigrations. This is OK since both startupmigrations and upgrades are idempotent. None of the current permanent upgrades are too expensive. Closes #73813 Release note: None Epic: None Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
Is your feature request related to a problem? Please describe.
Before long-running migrations, there were migrations which ran at startup. We haven't really leveraged these in a while (none since 20.2). There are, however, "permanent" migrations which set up some initial cluster state. We should get rid of that concept and all the complexity that comes with it.
Describe the solution you'd like
Move any permanent migrations into bootstrap. Some of the remaining seem to not really be permanent.
Remaining permanent migrations:
optIntoDiagnosticsReporting
populateVersionSetting
addRootUser
addAdminRole
addRootToAdminRole
initializeClusterSecret
disallowPublicUserOrRole
createDefaultDatabases
retireOldTsPurgeIntervalSettings
updateSystemLocationData
extendCreateRoleWithCreateLogin
Describe alternatives you've considered
We could not do this. An advantage to startup migrations is that they get to leverage the running system and, in doing so, may gain benefit from more code reuse.
Additional context
The pressing migration which motivated this was
createDefaultDatabases
.Jira issue: CRDB-11759
Epic CRDB-18596
The text was updated successfully, but these errors were encountered: