Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: implement tenant upgrade interlock #94998

Merged
merged 1 commit into from
Mar 19, 2023

Conversation

ajstorm
Copy link
Collaborator

@ajstorm ajstorm commented Jan 10, 2023

CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit.

The commit adds a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue.

Epic: CRDB-18735

Release note: None

@ajstorm ajstorm requested review from knz, ajwerner, healthy-pod and a team January 10, 2023 14:50
@ajstorm ajstorm requested review from a team as code owners January 10, 2023 14:50
@ajstorm ajstorm requested a review from a team January 10, 2023 14:50
@ajstorm ajstorm requested a review from a team as a code owner January 10, 2023 14:50
@ajstorm ajstorm requested review from herkolategan and srosenberg and removed request for a team January 10, 2023 14:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajstorm ajstorm removed request for a team, herkolategan and srosenberg January 10, 2023 14:50
@ajstorm
Copy link
Collaborator Author

ajstorm commented Jan 10, 2023

Still have to rebase, but wanted to get this on people's radars as it could take a bit of time to review.

I tried breaking it up into smaller commits, but there didn't seem to be an easy way to do so. The bulk of the changes however, are in testing, so at first it may look like a larger change than it actually is. The interlock logic is relatively small, and is mostly copied/massaged from the node interlock code (in pkg/upgrade/upgrademanager/manager.go)

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch from 1978b99 to 286e55e Compare January 10, 2023 14:59
@ajstorm
Copy link
Collaborator Author

ajstorm commented Jan 10, 2023

Nevermind - rebase was straightforward. This is ready for review now.

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch 2 times, most recently from 9c72940 to d1264fb Compare January 16, 2023 22:11
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns are around edge cases, but if we want to really rely on this, we should sort them out.

Reviewed 5 of 18 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)


pkg/server/tenant_migration.go line 62 at r2 (raw file):

			"binary's minimum supported version %s",
			targetCV, tenantVersion.BinaryMinSupportedVersion())
		log.Warningf(ctx, "%s", msg)

I think the better thing to do here would be to construct the error using our error constructors which should do a good job with the redaction of the arguments and then pass the error to the logging function because transitively the redaction will be handled. Same for the next stanza.


pkg/server/tenant_migration.go line 69 at r2 (raw file):

		msg := fmt.Sprintf("sql pod %s is running a binary version %s which is "+
			"less than the attempted upgrade version %s",
			m.sqlServer.sqlIDContainer.SQLInstanceID().String(),

nit: don't call String here, it'll work against what I think you want to do regarding redaction.


pkg/upgrade/system_upgrade.go line 32 at r2 (raw file):

	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

I'm not up to date on the latest terminology. This is fine with me but a bit of a mouthful. What about NumServers= and ForEveryServer?

Code quote:

	// NumNodesOrTenantPods returns the number of nodes or tenant pods in the
	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

pkg/upgrade/upgradecluster/tenant_cluster.go line 132 at r2 (raw file):

	// Dialer allows for the construction of connections to other SQL pods.
	Dialer         NodeDialer
	InstanceReader *instancestorage.Reader

A delicate thing that I'm not so certain this PR deals with is the semantics of GetAllInstances. In particular, GetAllInstances uses a cache and can return stale results. What I think we want is to know the set of instances which might be live. I think this means we need a transactional mechanism to determine the set of instances. Even this isn't fundamentally safe.

0: podA starts up
1: podA begins migration protocol
2: podA determines it is the only pod, so it decides to bump the version gate
3: podB starts up and is using an old version (v1)
4: podB reads the cluster setting at v1
5: podA bumps the cluster version to v2
6: podA reads all the instances after the bump (assuming you've fixed the caching thing) and finds only itself
7: podB writes its instance entry and proceeds to serve traffic -- violating the invariant

I think that we need some way to synchronize the writing of the instance entry with the validation of the current version during startup.

One way we could deal with this would be to read the version from KV again after writing the instance row and making sure it's kosher. It's an extra RPC, but with the MR plans, it shouldn't be too painful.


pkg/upgrade/upgradecluster/tenant_cluster.go line 190 at r2 (raw file):

			conn, err := t.Dialer.Dial(ctx, roachpb.NodeID(id), rpc.DefaultClass)
			if err != nil {
				if strings.Contains(err.Error(), "failed to connect") {

this is suspicious to me, is there no more structured way to get at this condition? Do you have a test which exercises this path so we can go and poke at the error structure?

@ajwerner
Copy link
Contributor

I was speaking with @jeffswenson about a similar set of tenant upgrades. Synchronizing pods and versions was relevant. He had the good idea that one way we can massively simplify this interlock is if we read the version in the code that writes the instance row and we read the instances in the transaction which writes the new version to the setting. It's much simpler.

@jeffswenson is going to be writing a helper to read the version with a *kv.Txn.

Concretely:

  1. Confirm that the version is new enough in the same transaction that writes the instance table.
    • This will require using the above mentioned helper.
    • If the version is newer than your binary version, die.
  2. Run the bumpClusterVersion as you would for the fence version, but have it keep track of the set of instances it communicated with.
  3. Write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

@ajstorm
Copy link
Collaborator Author

ajstorm commented Jan 23, 2023

Clever idea. For this part:

In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

Instead of re-reading, could we not rely on read-set validation to do this for us? Revised protocol would look like this:

  1. Confirm that the version is new enough in the same transaction that writes the instance table.
    • This will require using the above mentioned helper.
    • If the version is newer than your binary version, die.
  2. Run the bumpClusterVersion as you would for the fence version , but have it keep track of the set of instances it communicated with.
  3. Re-read all instances of this tenant from the instances table. In the same transaction, write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, Commit the transaction. If it succeeds, you're done. If you get a serializability error and the transaction aborts, go to step 2.

@ajwerner
Copy link
Contributor

What if between step 2 and step 3, a new instance writes to the instances table?

@ajwerner
Copy link
Contributor

To clarify the question given the async nature of this exchange. If a new instance writes itself to the table, it will see the previous version and it will succeed (we haven't bumped the stored version to the fence version until step 3). Yes, in step 3 we'll see this new instance, but we won't know that it's new (right?) and so we'll just say that everything is fine. It was to address this that I suggested that we remember the set of instances which we had bumped.

@ajstorm ajstorm requested a review from knz March 17, 2023 19:24
Copy link
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments addressed. RFAL.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I am asking to check that this loop indeed exits when the process that runs it is asked to shut down gracefully. Both standalone SQL pods and embedded secondary tenant SQL servers are subject to the same shutdown / quiescence logic, so if it works for one it will work for the other.

Right now, I do not have confidence that this loop won't keep forever running and prevent a graceful shutdown. Because i don't see logic through which the ctx would get canceled in that case.

OK, testing here seems to be OK. I did a modified version of what you suggested above:

  1. Modified the loop so that it just retried forever and every time through the loop printed "retrying forever" to the logs.
  2. Modified TestTenantUpgrade so that it created a 3 node instead of a 1 node cluster.
  3. Added a go routine in TestTenantUpgrade right before performing the upgrade which logs its presence to the logs, sleeps for 15 seconds and then performs one of a few actions {stop the whole cluster, stop the node on which the tenant is running, stops the tenant (using a custom stopper), closes the DB connection in which the upgrade is running}

The findings are interesting. If we stop the cluster, node or the tenant, the loop exits. I think this is the behaviour that you were worried about, so I think the code is good as-is.

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good, but it does indicate that my earlier comment is true - that without that custom stopper on SQL server creation, the SQL server instance won't get shut down.

Either way, I think as far as this PR is concerned, we don't have an issue here.


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok then you can use require.ErrorContains https://github.com/stretchr/testify/blob/master/require/require.go#L328

Done.

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch from 8836298 to 3b1451c Compare March 17, 2023 19:47
Copy link
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last forced push was just a rebase.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 20 files at r21, 10 of 12 files at r22, 2 of 2 files at r23, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good

I believe this is expected -- once the order is given to start upgrading, the cluster will continue to attempt to perform the upgrade until it can. This has always been the case. We could build a mechanism to interrupt that process if we find it important.

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch from 3b1451c to d32fa23 Compare March 19, 2023 00:25
CRDB has an upgrade interlock between nodes to ensure that as a cluster is
upgrading, the node that is driving the upgrade keeps all other nodes in sync
(and prevents nodes with a down-level binary from joining the cluster). This
interlock mechanism hasn't existed for tenant pods during a tenant upgrade
until this commit.

The commit adds to SQL servers, a similar interlock to the one used for nodes.
When a tenant pod begins upgrading it will first confirm that all other running
tenant pods are on an acceptable binary version and that the version for the
attempted upgrade is less than the binary version of all tenant pods, as well
as greater than (or equal to) the minimum supported binary version. Then, it
will begin to run migrations (upgrades). After each migration, it will push out
the intermediate cluster version to all running tenant pods and revalidate that
the upgrade can continue.

It's worth noting that if a SQL server fails while we're in the process of
upgrading (or shortly before), the upgrade process will see the failed SQL
server instance, be unable to contact it via RPC, and fail the tenant upgrade.
The workaround for this problem is to wait until the SQL server is cleaned up
(by default, 10 minutes after it fails) and retry the tenant upgrade.

Epic: CRDB-18735

Release note: None
@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch from d32fa23 to 6577de4 Compare March 19, 2023 00:28
@ajstorm
Copy link
Collaborator Author

ajstorm commented Mar 19, 2023

TFTR all!

bors r=knz,ajwerner

@knz
Copy link
Contributor

knz commented Mar 19, 2023

i believe you meant

bors r=knz,ajwerner

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Contributor

bors retry

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Contributor

bors retry

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Contributor

bors retry

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build failed:

@ajstorm
Copy link
Collaborator Author

ajstorm commented Mar 19, 2023

Should be good now that #98894 is in the queue.

bors retry

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Already running a review

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build failed:

@ajstorm
Copy link
Collaborator Author

ajstorm commented Mar 19, 2023

Gonna give it one last try now that #98894 made it in.

bors retry

@craig
Copy link
Contributor

craig bot commented Mar 19, 2023

Build succeeded:

@craig craig bot merged commit f07818f into cockroachdb:master Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants