-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vtctld / wrangler] Extract some reparent methods out to functions for shared use between wrangler and VtctldServer #7434
Conversation
…ons, with tests This is for later code sharing between the legacy and new vtctld APIs. This commit also contains a bugfix in `FindValidEmergencyReparentCandidates` / `(*Wrangler).findValidReparentCandidates` that causes a non-deterministic panic, depending on what order the `replicationStatusMap` is first iterated over in. The panic happens as follows: Assume you have 2 tablets in the status map. One tablet has a GTID-based relay log position, and the other has a non GTID-based relay log position. If the GTID tablet comes first, then `isGtidBased` gets set to a pointer to true, and when we come to the second, non-GTID tablet, we fall through to the 3rd case in the switch. If the non-GTID tablet's position is zero, then we return an error, and everything's fine. _However_, if it's non-zero _and_ non-GTID, then we successfully make it through the first loop, and later on, we try to `FindErrantGTIDs`, because `*isGtidBased` is true. This is where the panic happens, because we give it a mix of `Mysql56GTIDSet` and non-`Mysql56GTIDSet` positions. <hr> The fix here is to iterate through the full status map, and not break out early. We track whether we've seen _any_ GTID-based or non-GTID based positions, and fail early accordingly. Signed-off-by: Andrew Mason <[email protected]>
…code Signed-off-by: Andrew Mason <[email protected]>
These test failures in wrangler/testlib look legit, and related to my logic change in
and
|
…urn the first error Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Okay I've changed the strategy for the bugfix to use two booleans, one to keep track if we've ever seen a GTID-based relay log, and one to keep track if we've ever seen a non GTID-based relay log. Separately, we put potential errors onto a Then, we fail if:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, @deepthi this should be ready for review now.
// FindValidEmergencyReparentCandidates will find candidates for an emergency | ||
// reparent, and, if successful, return a mapping of those tablet aliases (as | ||
// raw strings) to their replication positions for later comparison. | ||
func FindValidEmergencyReparentCandidates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to call particular attention to this function in review. It's the only function I deliberately changed the behavior of, when I managed to write a test case that sometimes passed, and sometimes panicked. It's detailed in the PR description, but the tl;dr is that is I believe a mixed set of GTID-based and non GTID-based tablets were not always handled safely, and could result in panics from FindErrantGTIDs
, depending on the order of iteration the first time we scan through statusMap
to set *isGtidBased
in the original implementation.
cc @deepthi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of nits, otherwise LGTM
|
||
masterStatus, err = tmc.DemoteMaster(groupCtx, tabletInfo.Tablet) | ||
if err != nil { | ||
msg := "replica %v think it's master but we failed to demote it" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: think -> thinks
} | ||
|
||
if status.RelayLogPosition.IsZero() { | ||
// Potentially bail. If any other tablet hits the non-default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a minute to understand this, it isn't clear what non-default means. Can you rephrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sorry, this was written at the time when I was using a switch case with a default here, and I forgot to go back and update the comment. Will fix.
|
||
for _, tablet := range tabletMap { | ||
switch { | ||
case primaryCell != "" && tablet.Alias.Cell != primaryCell: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bothered me at first, but I think we looked at it once before and said that by default PRS won't choose a tablet in another cell, but you can always provide it a chosen tablet which will be respected. So I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a preservation of functionality. Maybe later we can make a vtctld startup flag to turn this off (I'd prefer to be able to PRS cross-cell without having to pick the tablet myself), but definitely far beyond the scope of this change.
cc @PrismaPhonic in case you wanna take a look. |
I'm confused by what real world use case would cause us to have mixed gtidset types. |
We don't support that. The change in implementation here is to detect it correctly and return an error. Previously, there were certain cases where we wouldn't catch it, call |
Signed-off-by: Andrew Mason <[email protected]>
[vtctld / wrangler] Extract some reparent methods out to functions for shared use between wrangler and VtctldServer Signed-off-by: Richard Bailey <[email protected]>
Description
This PR extracts some reparent code from methods on the
Wrangler
struct to top-level functions, and adds tests for them. I'm doing this in advance to allow some code sharing between the legacy and new vtctld grpc APIs, to make those diffs smaller and less risky.Important: This PR also contains a bugfix in
FindValidEmergencyReparentCandidates
(formerly(*Wrangler).findValidReparentCandidates
) that would cause a non-deterministic panic, depending on what order thereplicationStatusMap
is first iterated over in. The panic happens as follows:Assume you have 2 tablets in the status map. One tablet has a GTID-based relay log position, and the other has a non GTID-based relay log position.
If the GTID tablet comes first, then
isGtidBased
gets set to a pointer to true, and when we come to the second, non-GTID tablet, we fall through to the 3rd case in the switch (case status.RelayPosition.IsZero()
).If the non-GTID tablet's position is zero, then we return an error, and everything's fine. However, if it's non-zero and non GTID-based, then we successfully make it through the first loop, and later on, we try to
FindErrantGTIDs
, because*isGtidBased
is true. This is where the panic happens, because we give it a mix ofMysql56GTIDSet
and non-Mysql56GTIDSet
positions.The fix here is to iterate through the full status map, and not break out early. We track whether we've seen _any_ GTID-based or non-GTID based positions, and fail early accordingly.
Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: