Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwitchTraffic: check vreplication lag before switching #9538

Merged

Conversation

rohit-nayak-ps
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps commented Jan 19, 2022

Description

This PR adds logic to measure the estimated lag between the last transaction at the source and the last transaction seen by the target. The target persists the transaction timestamp of each seen event in _vt.vreplication. The target may not receive events from the source in two cases:

  • where there is no activity at the source relevant to the target
  • when the source is disconnected from the target (due to a vstreamer error, network issue or source tablets being unavailable).

To differentiate between this the source starts sending "heartbeats" (special internal VEvents) every second if there are no actual events to be sent. As part of this PR we also start storing the last heartbeat time in _vt.vreplication.

We calculate the transaction lag based on the transaction timestamp and the last heartbeat for each stream and the maximum across streams is used as the workflow lag: https://github.com/vitessio/vitess/blob/dc74ebcc2d1c70f36030ca07ec857c009afd6954/go/vt/wrangler/vexec.go#L536

We now add a check before switching traffic (both read and write) to ensure:

  1. that the lag is within acceptable bounds (defined by the duration flag -max_transaction_lag_allowed, default 30s)
  2. that there is no stream that has an error
  3. that we are not still in the copy phase (this already exists)

Note that the Workflow Show command does already show a MaxVReplicationLag. This was implemented more to determine if there was an issue with the source not being available than measuring transaction lag. Since the name is already in-use, for backward compatibility, a new variable MaxVReplicationTransactionLag has been added for the purposes of this PR.

Other changes:

  • Minor refactoring of the test framework for ease of maintenance
  • Intial doc to describe the test framework and changes needed for this PR. We will update this in upcoming PRs.

New flag documented in website PR: vitessio/website#957

Related Issue(s)

#9525

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

@rohit-nayak-ps rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes labels Jan 19, 2022
@rohit-nayak-ps rohit-nayak-ps changed the title SwitchTraffic: check (configuratble) vreplication lag before switching SwitchTraffic: check (configurable) vreplication lag before switching Jan 20, 2022
@aquarapid
Copy link
Contributor

One question/comment I would have is what the default for this timeout check should be. In the interests of not changing current behavior, an argument can be made for it to be infinite? Or at least equal to the switch wait timeout?

@mattlord
Copy link
Contributor

One question/comment I would have is what the default for this timeout check should be. In the interests of not changing current behavior, an argument can be made for it to be infinite? Or at least equal to the switch wait timeout?

Why would we NOT re-use the existing flag? That being: -timeout

If we check replication lag ahead of time and see that it's beyond that window, it's reasonable to assume it's not likely to catch up within that window.

See problems/issues with that?

@rohit-nayak-ps
Copy link
Contributor Author

Why would we NOT re-use the existing flag? That being: [-timeout]

I thought of using a different flag here because:

  • -timeout essentially specifies the maximum downtime that is acceptable while switching writes. This includes the overhead of coordinating between all the shards and across cells.
  • a new flag -max_lag_allowed would allow specifying a small lag, say 2 or 5 seconds, for both reads (minimize reading stale data) and writes

If we use a single flag for both we could end up not being able to specify a small lag because it will always timeout due to the overhead.

As Jacques suggested we could use the current wait timeout as default for the new flag, if we do decide to go with two separate flags.

@rohit-nayak-ps rohit-nayak-ps changed the title SwitchTraffic: check (configurable) vreplication lag before switching SwitchTraffic: check vreplication lag before switching Jan 25, 2022
@@ -403,18 +406,18 @@ type ReplicationStatus struct {
CopyState []copyState
}

func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row []sqltypes.Value, primary *topo.TabletInfo) (*ReplicationStatus, string, error) {
func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row sqltypes.RowNamedValues, primary *topo.TabletInfo) (*ReplicationStatus, string, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this function are mainly refactoring to use name references instead of index-based (and accessing the new time_heartbeat column)

@rohit-nayak-ps
Copy link
Contributor Author

One functionality that might "break" due to this PR is that some SwitchTraffics that succeeded in the past will temporarily error out since the lag is too high. This can happen, for example, if the user created a workflow and immediately tried to SwitchTraffic without letting the workflows catchup. Earlier it would have just taken longer for the call to return since it would mostly like manage to catchup before the wait timeout (def: 30s) (or just timeout if there was too much data).

Now it will return an error and user will need to monitor the lag and/or try again later. This is the correct way to use SwitchTraffic and one we want to encourage (and indeed the reason for this PR). So I don't believe we should keep a larger default but noting it here since it might give rise to a few extra support questions.

@rohit-nayak-ps rohit-nayak-ps requested review from mattlord and a team January 30, 2022 13:47
@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review January 30, 2022 13:48
go/vt/vtctl/vtctl.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few nits and questions, but otherwise LGTM! I really appreciate the time you put into the tests, docs, and refactoring!

I'll approve so that you're free to merge after addressing any valid issues.

}

// AsBytes returns the named field as a byte array, or default value if nonexistent/error
func (r RowNamedValues) AsBytes(fieldName string, def []byte) []byte {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I would call the parameter default or defaultValue. At first I missed the comment and thought we were passing that in as a pointer to be written to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, def is confusing. Since used in all other functions, so I went with this rather than just name it differently here or fix it everywhere. But I should have ...

duration -= waitDuration
}

if duration <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to be correct, I think we should only do duration -= waitDuration in the else clause when we know we should loop again. Otherwise I think we could successfully switch in the last iteration of the loop and the duration may be <= 0 even though we were successful.

Copy link
Contributor Author

@rohit-nayak-ps rohit-nayak-ps Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if has a break from the for loop, so the duration will not decrement if we were successful.

go/test/endtoend/vreplication/cluster.go Show resolved Hide resolved
@@ -933,9 +936,35 @@ func verifyClusterHealth(t *testing.T, cluster *VitessCluster) {
iterateTablets(t, cluster, checkTabletHealth)
}

const acceptableLagSeconds = 5

func waitForLowLag(t *testing.T, keyspace, workflow string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to share this code between the two end2end tests? If not, see my earlier comments about this code in sharded_buffer_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, classic anti-pattern :( Because they are in different packages and we don't yet have a utils package to reuse such code, I took the short-cut. Merging as is now to catch the release deadline, will address "soon".

go/vt/binlog/binlogplayer/binlog_player.go Show resolved Hide resolved
go/vt/wrangler/doc_test.md Show resolved Hide resolved
go/vt/wrangler/doc_test.md Outdated Show resolved Hide resolved
go/vt/wrangler/vexec.go Outdated Show resolved Hide resolved
go/vt/wrangler/vexec.go Outdated Show resolved Hide resolved
…r the updated query to set/get this value. Compute workflow transaction lag and expose in Workflow Show and update related tests

Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
…ed! e2e tests: wait for lag before switching writes. unit test: add expected queries. Show Frozen state in global workflow status for visibility.

Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants