-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multitenant: implement tenant upgrade interlock #94998
Conversation
Still have to rebase, but wanted to get this on people's radars as it could take a bit of time to review. I tried breaking it up into smaller commits, but there didn't seem to be an easy way to do so. The bulk of the changes however, are in testing, so at first it may look like a larger change than it actually is. The interlock logic is relatively small, and is mostly copied/massaged from the node interlock code (in |
1978b99
to
286e55e
Compare
Nevermind - rebase was straightforward. This is ready for review now. |
9c72940
to
d1264fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns are around edge cases, but if we want to really rely on this, we should sort them out.
Reviewed 5 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)
pkg/server/tenant_migration.go
line 62 at r2 (raw file):
"binary's minimum supported version %s", targetCV, tenantVersion.BinaryMinSupportedVersion()) log.Warningf(ctx, "%s", msg)
I think the better thing to do here would be to construct the error using our error constructors which should do a good job with the redaction of the arguments and then pass the error to the logging function because transitively the redaction will be handled. Same for the next stanza.
pkg/server/tenant_migration.go
line 69 at r2 (raw file):
msg := fmt.Sprintf("sql pod %s is running a binary version %s which is "+ "less than the attempted upgrade version %s", m.sqlServer.sqlIDContainer.SQLInstanceID().String(),
nit: don't call String
here, it'll work against what I think you want to do regarding redaction.
pkg/upgrade/system_upgrade.go
line 32 at r2 (raw file):
// cluster. This is merely a convenience method and is not meant to be used // to infer cluster stability; for that, use UntilClusterStable. NumNodesOrTenantPods(ctx context.Context) (int, error)
I'm not up to date on the latest terminology. This is fine with me but a bit of a mouthful. What about NumServers
= and ForEveryServer
?
Code quote:
// NumNodesOrTenantPods returns the number of nodes or tenant pods in the
// cluster. This is merely a convenience method and is not meant to be used
// to infer cluster stability; for that, use UntilClusterStable.
NumNodesOrTenantPods(ctx context.Context) (int, error)
pkg/upgrade/upgradecluster/tenant_cluster.go
line 132 at r2 (raw file):
// Dialer allows for the construction of connections to other SQL pods. Dialer NodeDialer InstanceReader *instancestorage.Reader
A delicate thing that I'm not so certain this PR deals with is the semantics of GetAllInstances
. In particular, GetAllInstances
uses a cache and can return stale results. What I think we want is to know the set of instances which might be live. I think this means we need a transactional mechanism to determine the set of instances. Even this isn't fundamentally safe.
0: podA starts up
1: podA begins migration protocol
2: podA determines it is the only pod, so it decides to bump the version gate
3: podB starts up and is using an old version (v1)
4: podB reads the cluster setting at v1
5: podA bumps the cluster version to v2
6: podA reads all the instances after the bump (assuming you've fixed the caching thing) and finds only itself
7: podB writes its instance entry and proceeds to serve traffic -- violating the invariant
I think that we need some way to synchronize the writing of the instance entry with the validation of the current version during startup.
One way we could deal with this would be to read the version from KV again after writing the instance row and making sure it's kosher. It's an extra RPC, but with the MR plans, it shouldn't be too painful.
pkg/upgrade/upgradecluster/tenant_cluster.go
line 190 at r2 (raw file):
conn, err := t.Dialer.Dial(ctx, roachpb.NodeID(id), rpc.DefaultClass) if err != nil { if strings.Contains(err.Error(), "failed to connect") {
this is suspicious to me, is there no more structured way to get at this condition? Do you have a test which exercises this path so we can go and poke at the error structure?
I was speaking with @jeffswenson about a similar set of tenant upgrades. Synchronizing pods and versions was relevant. He had the good idea that one way we can massively simplify this interlock is if we read the version in the code that writes the instance row and we read the instances in the transaction which writes the new version to the setting. It's much simpler. @jeffswenson is going to be writing a helper to read the version with a Concretely:
|
Clever idea. For this part:
Instead of re-reading, could we not rely on read-set validation to do this for us? Revised protocol would look like this:
|
What if between step 2 and step 3, a new instance writes to the instances table? |
To clarify the question given the async nature of this exchange. If a new instance writes itself to the table, it will see the previous version and it will succeed (we haven't bumped the stored version to the fence version until step 3). Yes, in step 3 we'll see this new instance, but we won't know that it's new (right?) and so we'll just say that everything is fine. It was to address this that I suggested that we remember the set of instances which we had bumped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments addressed. RFAL.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)
pkg/upgrade/upgradecluster/tenant_cluster.go
line 276 at r12 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I am asking to check that this loop indeed exits when the process that runs it is asked to shut down gracefully. Both standalone SQL pods and embedded secondary tenant SQL servers are subject to the same shutdown / quiescence logic, so if it works for one it will work for the other.
Right now, I do not have confidence that this loop won't keep forever running and prevent a graceful shutdown. Because i don't see logic through which the ctx would get canceled in that case.
OK, testing here seems to be OK. I did a modified version of what you suggested above:
- Modified the loop so that it just retried forever and every time through the loop printed "retrying forever" to the logs.
- Modified
TestTenantUpgrade
so that it created a 3 node instead of a 1 node cluster. - Added a go routine in
TestTenantUpgrade
right before performing the upgrade which logs its presence to the logs, sleeps for 15 seconds and then performs one of a few actions {stop the whole cluster, stop the node on which the tenant is running, stops the tenant (using a custom stopper), closes the DB connection in which the upgrade is running}
The findings are interesting. If we stop the cluster, node or the tenant, the loop exits. I think this is the behaviour that you were worried about, so I think the code is good as-is.
In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good, but it does indicate that my earlier comment is true - that without that custom stopper on SQL server creation, the SQL server instance won't get shut down.
Either way, I think as far as this PR is concerned, we don't have an issue here.
pkg/ccl/serverccl/tenant_migration_test.go
line 239 at r12 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
ok then you can use
require.ErrorContains
https://github.com/stretchr/testify/blob/master/require/require.go#L328
Done.
8836298
to
3b1451c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last forced push was just a rebase.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 20 files at r21, 10 of 12 files at r22, 2 of 2 files at r23, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)
pkg/upgrade/upgradecluster/tenant_cluster.go
line 276 at r12 (raw file):
In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good
I believe this is expected -- once the order is given to start upgrading, the cluster will continue to attempt to perform the upgrade until it can. This has always been the case. We could build a mechanism to interrupt that process if we find it important.
3b1451c
to
d32fa23
Compare
CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit. The commit adds to SQL servers, a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods, as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue. It's worth noting that if a SQL server fails while we're in the process of upgrading (or shortly before), the upgrade process will see the failed SQL server instance, be unable to contact it via RPC, and fail the tenant upgrade. The workaround for this problem is to wait until the SQL server is cleaned up (by default, 10 minutes after it fails) and retry the tenant upgrade. Epic: CRDB-18735 Release note: None
d32fa23
to
6577de4
Compare
TFTR all! bors r=knz,ajwerner |
i believe you meant bors r=knz,ajwerner |
Build failed: |
bors retry |
Build failed: |
bors retry |
Build failed: |
bors retry |
Build failed: |
Should be good now that #98894 is in the queue. bors retry |
Already running a review |
Build failed: |
Gonna give it one last try now that #98894 made it in. bors retry |
Build succeeded: |
CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit.
The commit adds a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue.
Epic: CRDB-18735
Release note: None