multitenant: implement tenant upgrade interlock #94998

ajstorm · 2023-01-10T14:50:08Z

CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit.

The commit adds a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue.

Epic: CRDB-18735

Release note: None

cockroach-teamcity · 2023-01-10T14:50:20Z

This change is

ajstorm · 2023-01-10T14:54:23Z

Still have to rebase, but wanted to get this on people's radars as it could take a bit of time to review.

I tried breaking it up into smaller commits, but there didn't seem to be an easy way to do so. The bulk of the changes however, are in testing, so at first it may look like a larger change than it actually is. The interlock logic is relatively small, and is mostly copied/massaged from the node interlock code (in pkg/upgrade/upgrademanager/manager.go)

ajstorm · 2023-01-10T15:00:23Z

Nevermind - rebase was straightforward. This is ready for review now.

ajwerner

My concerns are around edge cases, but if we want to really rely on this, we should sort them out.

Reviewed 5 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)

pkg/server/tenant_migration.go line 62 at r2 (raw file):

			"binary's minimum supported version %s",
			targetCV, tenantVersion.BinaryMinSupportedVersion())
		log.Warningf(ctx, "%s", msg)

I think the better thing to do here would be to construct the error using our error constructors which should do a good job with the redaction of the arguments and then pass the error to the logging function because transitively the redaction will be handled. Same for the next stanza.

pkg/server/tenant_migration.go line 69 at r2 (raw file):

		msg := fmt.Sprintf("sql pod %s is running a binary version %s which is "+
			"less than the attempted upgrade version %s",
			m.sqlServer.sqlIDContainer.SQLInstanceID().String(),

nit: don't call String here, it'll work against what I think you want to do regarding redaction.

pkg/upgrade/system_upgrade.go line 32 at r2 (raw file):

	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

I'm not up to date on the latest terminology. This is fine with me but a bit of a mouthful. What about NumServers= and ForEveryServer?

Code quote:

	// NumNodesOrTenantPods returns the number of nodes or tenant pods in the
	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

pkg/upgrade/upgradecluster/tenant_cluster.go line 132 at r2 (raw file):

	// Dialer allows for the construction of connections to other SQL pods.
	Dialer         NodeDialer
	InstanceReader *instancestorage.Reader

A delicate thing that I'm not so certain this PR deals with is the semantics of GetAllInstances. In particular, GetAllInstances uses a cache and can return stale results. What I think we want is to know the set of instances which might be live. I think this means we need a transactional mechanism to determine the set of instances. Even this isn't fundamentally safe.

0: podA starts up
1: podA begins migration protocol
2: podA determines it is the only pod, so it decides to bump the version gate
3: podB starts up and is using an old version (v1)
4: podB reads the cluster setting at v1
5: podA bumps the cluster version to v2
6: podA reads all the instances after the bump (assuming you've fixed the caching thing) and finds only itself
7: podB writes its instance entry and proceeds to serve traffic -- violating the invariant

I think that we need some way to synchronize the writing of the instance entry with the validation of the current version during startup.

One way we could deal with this would be to read the version from KV again after writing the instance row and making sure it's kosher. It's an extra RPC, but with the MR plans, it shouldn't be too painful.

pkg/upgrade/upgradecluster/tenant_cluster.go line 190 at r2 (raw file):

			conn, err := t.Dialer.Dial(ctx, roachpb.NodeID(id), rpc.DefaultClass)
			if err != nil {
				if strings.Contains(err.Error(), "failed to connect") {

this is suspicious to me, is there no more structured way to get at this condition? Do you have a test which exercises this path so we can go and poke at the error structure?

ajwerner · 2023-01-18T23:14:51Z

I was speaking with @jeffswenson about a similar set of tenant upgrades. Synchronizing pods and versions was relevant. He had the good idea that one way we can massively simplify this interlock is if we read the version in the code that writes the instance row and we read the instances in the transaction which writes the new version to the setting. It's much simpler.

@jeffswenson is going to be writing a helper to read the version with a *kv.Txn.

Concretely:

Confirm that the version is new enough in the same transaction that writes the instance table.
- This will require using the above mentioned helper.
- If the version is newer than your binary version, die.
Run the bumpClusterVersion as you would for the fence version, but have it keep track of the set of instances it communicated with.
Write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

ajstorm · 2023-01-23T15:12:43Z

Clever idea. For this part:

In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

Instead of re-reading, could we not rely on read-set validation to do this for us? Revised protocol would look like this:

Confirm that the version is new enough in the same transaction that writes the instance table.
- This will require using the above mentioned helper.
- If the version is newer than your binary version, die.
Run the bumpClusterVersion as you would for the fence version ~~, but have it keep track of the set of instances it communicated with.~~
Re-read all instances of this tenant from the instances table. In the same transaction, write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). ~~In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same,~~ Commit the transaction. If it succeeds, you're done. If you get a serializability error and the transaction aborts, go to step 2.

ajwerner · 2023-01-23T15:16:22Z

What if between step 2 and step 3, a new instance writes to the instances table?

ajwerner · 2023-01-23T15:34:35Z

To clarify the question given the async nature of this exchange. If a new instance writes itself to the table, it will see the previous version and it will succeed (we haven't bumped the stored version to the fence version until step 3). Yes, in step 3 we'll see this new instance, but we won't know that it's new (right?) and so we'll just say that everything is fine. It was to address this that I suggested that we remember the set of instances which we had bumped.

ajstorm

All comments addressed. RFAL.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)

pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I am asking to check that this loop indeed exits when the process that runs it is asked to shut down gracefully. Both standalone SQL pods and embedded secondary tenant SQL servers are subject to the same shutdown / quiescence logic, so if it works for one it will work for the other.

Right now, I do not have confidence that this loop won't keep forever running and prevent a graceful shutdown. Because i don't see logic through which the ctx would get canceled in that case.

OK, testing here seems to be OK. I did a modified version of what you suggested above:

Modified the loop so that it just retried forever and every time through the loop printed "retrying forever" to the logs.
Modified TestTenantUpgrade so that it created a 3 node instead of a 1 node cluster.
Added a go routine in TestTenantUpgrade right before performing the upgrade which logs its presence to the logs, sleeps for 15 seconds and then performs one of a few actions {stop the whole cluster, stop the node on which the tenant is running, stops the tenant (using a custom stopper), closes the DB connection in which the upgrade is running}

The findings are interesting. If we stop the cluster, node or the tenant, the loop exits. I think this is the behaviour that you were worried about, so I think the code is good as-is.

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good, but it does indicate that my earlier comment is true - that without that custom stopper on SQL server creation, the SQL server instance won't get shut down.

Either way, I think as far as this PR is concerned, we don't have an issue here.

pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok then you can use require.ErrorContains https://github.com/stretchr/testify/blob/master/require/require.go#L328

Done.

ajstorm

Last forced push was just a rebase.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)

knz

Reviewed 7 of 20 files at r21, 10 of 12 files at r22, 2 of 2 files at r23, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)

pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good

I believe this is expected -- once the order is given to start upgrading, the cluster will continue to attempt to perform the upgrade until it can. This has always been the case. We could build a mechanism to interrupt that process if we find it important.

CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit. The commit adds to SQL servers, a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods, as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue. It's worth noting that if a SQL server fails while we're in the process of upgrading (or shortly before), the upgrade process will see the failed SQL server instance, be unable to contact it via RPC, and fail the tenant upgrade. The workaround for this problem is to wait until the SQL server is cleaned up (by default, 10 minutes after it fails) and retry the tenant upgrade. Epic: CRDB-18735 Release note: None

ajstorm · 2023-03-19T01:26:42Z

TFTR all!

bors r=knz,ajwerner

knz · 2023-03-19T10:14:26Z

i believe you meant

bors r=knz,ajwerner

craig · 2023-03-19T10:32:31Z

Build failed:

Bazel Essential CI (Cockroach)

healthy-pod · 2023-03-19T11:36:18Z

bors retry

craig · 2023-03-19T11:48:58Z

Build failed:

Bazel Essential CI (Cockroach)

healthy-pod · 2023-03-19T11:50:37Z

bors retry

craig · 2023-03-19T12:07:33Z

Build failed:

Bazel Essential CI (Cockroach)

healthy-pod · 2023-03-19T12:08:57Z

bors retry

craig · 2023-03-19T12:25:29Z

Build failed:

Bazel Essential CI (Cockroach)

ajstorm · 2023-03-19T13:01:21Z

Should be good now that #98894 is in the queue.

bors retry

craig · 2023-03-19T13:02:50Z

Already running a review

craig · 2023-03-19T13:28:34Z

Build failed:

Bazel Essential CI (Cockroach)

ajstorm · 2023-03-19T14:13:54Z

Gonna give it one last try now that #98894 made it in.

bors retry

craig · 2023-03-19T14:54:11Z

Build succeeded:

Bazel Essential CI (Cockroach)

ajstorm requested review from knz, ajwerner, healthy-pod and a team January 10, 2023 14:50

ajstorm requested review from a team as code owners January 10, 2023 14:50

ajstorm requested a review from a team January 10, 2023 14:50

ajstorm requested a review from a team as a code owner January 10, 2023 14:50

ajstorm requested review from herkolategan and srosenberg and removed request for a team January 10, 2023 14:50

ajstorm removed request for a team, herkolategan and srosenberg January 10, 2023 14:50

ajstorm force-pushed the ajstorm-version-interlock branch from 1978b99 to 286e55e Compare January 10, 2023 14:59

healthy-pod mentioned this pull request Jan 10, 2023

multitenant: add SQL server startup guardrails #94973

Merged

ajstorm force-pushed the ajstorm-version-interlock branch 2 times, most recently from 9c72940 to d1264fb Compare January 16, 2023 22:11

ajwerner reviewed Jan 18, 2023

View reviewed changes

ajstorm requested a review from knz March 17, 2023 19:24

ajstorm commented Mar 17, 2023

View reviewed changes

ajstorm force-pushed the ajstorm-version-interlock branch from 8836298 to 3b1451c Compare March 17, 2023 19:47

ajstorm commented Mar 17, 2023

View reviewed changes

knz approved these changes Mar 18, 2023

View reviewed changes

ajstorm force-pushed the ajstorm-version-interlock branch from 3b1451c to d32fa23 Compare March 19, 2023 00:25

ajstorm force-pushed the ajstorm-version-interlock branch from d32fa23 to 6577de4 Compare March 19, 2023 00:28

craig bot merged commit f07818f into cockroachdb:master Mar 19, 2023

ajstorm mentioned this pull request Mar 20, 2023

roachtest: multitenant-upgrade failed #97016

Closed

ajstorm mentioned this pull request Mar 27, 2023

kvtenantccl: flake in TestTenantUpgradeFailure #90690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multitenant: implement tenant upgrade interlock #94998

multitenant: implement tenant upgrade interlock #94998

ajstorm commented Jan 10, 2023

cockroach-teamcity commented Jan 10, 2023

ajstorm commented Jan 10, 2023

ajstorm commented Jan 10, 2023

ajwerner left a comment

ajwerner commented Jan 18, 2023

ajstorm commented Jan 23, 2023

ajwerner commented Jan 23, 2023

ajwerner commented Jan 23, 2023

ajstorm left a comment

ajstorm left a comment

knz left a comment

ajstorm commented Mar 19, 2023 •

edited

Loading

knz commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

ajstorm commented Mar 19, 2023 •

edited

Loading

craig bot commented Mar 19, 2023

craig bot commented Mar 19, 2023

ajstorm commented Mar 19, 2023

craig bot commented Mar 19, 2023

multitenant: implement tenant upgrade interlock #94998

multitenant: implement tenant upgrade interlock #94998

Conversation

ajstorm commented Jan 10, 2023

cockroach-teamcity commented Jan 10, 2023

ajstorm commented Jan 10, 2023

ajstorm commented Jan 10, 2023

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner commented Jan 18, 2023

ajstorm commented Jan 23, 2023

ajwerner commented Jan 23, 2023

ajwerner commented Jan 23, 2023

ajstorm left a comment

Choose a reason for hiding this comment

ajstorm left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

ajstorm commented Mar 19, 2023 • edited Loading

knz commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

healthy-pod commented Mar 19, 2023

craig bot commented Mar 19, 2023

ajstorm commented Mar 19, 2023 • edited Loading

craig bot commented Mar 19, 2023

craig bot commented Mar 19, 2023

ajstorm commented Mar 19, 2023

craig bot commented Mar 19, 2023

ajstorm commented Mar 19, 2023 •

edited

Loading

ajstorm commented Mar 19, 2023 •

edited

Loading