Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vtctld / wrangler] Extract some reparent methods out to functions for shared use between wrangler and VtctldServer #7434

Merged
merged 5 commits into from
Feb 4, 2021

Conversation

ajm188
Copy link
Contributor

@ajm188 ajm188 commented Feb 3, 2021

Description

This PR extracts some reparent code from methods on the Wrangler struct to top-level functions, and adds tests for them. I'm doing this in advance to allow some code sharing between the legacy and new vtctld grpc APIs, to make those diffs smaller and less risky.

Important: This PR also contains a bugfix in FindValidEmergencyReparentCandidates (formerly (*Wrangler).findValidReparentCandidates) that would cause a non-deterministic panic, depending on what order the
replicationStatusMap is first iterated over in. The panic happens as follows:

Assume you have 2 tablets in the status map. One tablet has a GTID-based relay log position, and the other has a non GTID-based relay log position.

If the GTID tablet comes first, then isGtidBased gets set to a pointer to true, and when we come to the second, non-GTID tablet, we fall through to the 3rd case in the switch (case status.RelayPosition.IsZero()).

If the non-GTID tablet's position is zero, then we return an error, and everything's fine. However, if it's non-zero and non GTID-based, then we successfully make it through the first loop, and later on, we try to FindErrantGTIDs, because *isGtidBased is true. This is where the panic happens, because we give it a mix of Mysql56GTIDSet and non-Mysql56GTIDSet positions.


The fix here is to iterate through the full status map, and not break out early. We track whether we've seen _any_ GTID-based or non-GTID based positions, and fail early accordingly.

Related Issue(s)

Checklist

  • Should this PR be backported? No
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

…ons, with tests

This is for later code sharing between the legacy and new vtctld APIs.

This commit also contains a bugfix in `FindValidEmergencyReparentCandidates` / `(*Wrangler).findValidReparentCandidates`
that causes a non-deterministic panic, depending on what order the
`replicationStatusMap` is first iterated over in. The panic happens as
follows:

Assume you have 2 tablets in the status map. One tablet has a GTID-based
relay log position, and the other has a non GTID-based relay log position.

If the GTID tablet comes first, then `isGtidBased` gets set to a pointer
to true, and when we come to the second, non-GTID tablet, we fall through
to the 3rd case in the switch.

If the non-GTID tablet's position is zero, then we return an error, and
everything's fine. _However_, if it's non-zero _and_ non-GTID, then we
successfully make it through the first loop, and later on, we try to
`FindErrantGTIDs`, because `*isGtidBased` is true. This is where the panic
happens, because we give it a mix of `Mysql56GTIDSet` and
non-`Mysql56GTIDSet` positions.

<hr>

The fix here is to iterate through the full status map, and not break
out early. We track whether we've seen _any_ GTID-based or non-GTID
based positions, and fail early accordingly.

Signed-off-by: Andrew Mason <[email protected]>
@ajm188 ajm188 requested a review from deepthi February 3, 2021 02:54
@ajm188
Copy link
Contributor Author

ajm188 commented Feb 3, 2021

These test failures in wrangler/testlib look legit, and related to my logic change in StopReplicationAndBuildStatusMaps; I'll look into them today:

--- FAIL: TestEmergencyReparentShard (0.03s)
Error:     vtctl_pipe.go:100: I0203 03:03:58.720256 replication.go:170] getting replication position from cell1-0000000001
Error:     vtctl_pipe.go:100: I0203 03:03:58.720358 replication.go:170] getting replication position from cell2-0000000003
Error:     vtctl_pipe.go:100: I0203 03:03:58.720879 replication.go:170] getting replication position from cell1-0000000002
Error:     vtctl_pipe.go:100: I0203 03:03:58.721351 replication.go:170] getting replication position from cell1-0000000000
Error:     vtctl_pipe.go:100: W0203 03:03:58.727724 replication.go:194] failed to get replication status from cell1-0000000000: rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on cell1-0000000000 error: before status failed: no replication status: before status failed: no replication status
    emergency_reparent_shard_test.go:174: 
        	Error Trace:	emergency_reparent_shard_test.go:174
        	Error:      	Received unexpected error:
        	            	rpc error: code = Unknown desc = we got a filled-in relay log position, but it's not of type Mysql56GTIDSet, even thought we've determined we need to use GTID based assesment
        	Test:       	TestEmergencyReparentShard

and

--- FAIL: TestEmergencyReparentShardMasterElectNotBest (0.02s)
    emergency_reparent_shard_test.go:280: 
        	Error Trace:	emergency_reparent_shard_test.go:280
        	Error:      	"we got a filled-in relay log position, but it's not of type Mysql56GTIDSet, even thought we've determined we need to use GTID based assesment" does not contain "is not fully caught up"
        	Test:       	TestEmergencyReparentShardMasterElectNotBest

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 3, 2021

Okay I've changed the strategy for the bugfix to use two booleans, one to keep track if we've ever seen a GTID-based relay log, and one to keep track if we've ever seen a non GTID-based relay log. Separately, we put potential errors onto a FirstErrorRecorder if we see any empty, non GTID-based relay log positions.

Then, we fail if:

  • we've seen any tablets with both GTID-based and non GTID-based relay logs or
  • we've seen a tablet with an empty relay log position as well as another tablet with GTID-based relay log positions
> go test -v -race ./go/vt/vtctl/reparentutil/... ./go/vt/wrangler/testlib/... 
ok      vitess.io/vitess/go/vt/vtctl/reparentutil       0.983s
ok      vitess.io/vitess/go/vt/wrangler/testlib 15.413s

Copy link
Contributor Author

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, @deepthi this should be ready for review now.

// FindValidEmergencyReparentCandidates will find candidates for an emergency
// reparent, and, if successful, return a mapping of those tablet aliases (as
// raw strings) to their replication positions for later comparison.
func FindValidEmergencyReparentCandidates(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to call particular attention to this function in review. It's the only function I deliberately changed the behavior of, when I managed to write a test case that sometimes passed, and sometimes panicked. It's detailed in the PR description, but the tl;dr is that is I believe a mixed set of GTID-based and non GTID-based tablets were not always handled safely, and could result in panics from FindErrantGTIDs, depending on the order of iteration the first time we scan through statusMap to set *isGtidBased in the original implementation.

cc @deepthi

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits, otherwise LGTM


masterStatus, err = tmc.DemoteMaster(groupCtx, tabletInfo.Tablet)
if err != nil {
msg := "replica %v think it's master but we failed to demote it"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: think -> thinks

}

if status.RelayLogPosition.IsZero() {
// Potentially bail. If any other tablet hits the non-default
Copy link
Member

@deepthi deepthi Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a minute to understand this, it isn't clear what non-default means. Can you rephrase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sorry, this was written at the time when I was using a switch case with a default here, and I forgot to go back and update the comment. Will fix.


for _, tablet := range tabletMap {
switch {
case primaryCell != "" && tablet.Alias.Cell != primaryCell:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bothered me at first, but I think we looked at it once before and said that by default PRS won't choose a tablet in another cell, but you can always provide it a chosen tablet which will be respected. So I think this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a preservation of functionality. Maybe later we can make a vtctld startup flag to turn this off (I'd prefer to be able to PRS cross-cell without having to pick the tablet myself), but definitely far beyond the scope of this change.

@deepthi
Copy link
Member

deepthi commented Feb 4, 2021

cc @PrismaPhonic in case you wanna take a look.

@PrismaPhonic
Copy link
Contributor

I'm confused by what real world use case would cause us to have mixed gtidset types.

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 4, 2021

I'm confused by what real world use case would cause us to have mixed gtidset types.

We don't support that. The change in implementation here is to detect it correctly and return an error. Previously, there were certain cases where we wouldn't catch it, call status.FindErrantGTIDs with the mixed types, which would cause a panic.

@deepthi deepthi merged commit b1e4df8 into vitessio:master Feb 4, 2021
@askdba askdba added this to the v10.0 milestone Feb 8, 2021
@ajm188 ajm188 deleted the am_reparentutil branch March 4, 2021 16:32
setassociative pushed a commit to tinyspeck/vitess that referenced this pull request Mar 8, 2021
[vtctld / wrangler] Extract some reparent methods out to functions for shared use between wrangler and VtctldServer

Signed-off-by: Richard Bailey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants