SwitchTraffic: check vreplication lag before switching #9538

rohit-nayak-ps · 2022-01-19T21:09:47Z

Description

This PR adds logic to measure the estimated lag between the last transaction at the source and the last transaction seen by the target. The target persists the transaction timestamp of each seen event in _vt.vreplication. The target may not receive events from the source in two cases:

where there is no activity at the source relevant to the target
when the source is disconnected from the target (due to a vstreamer error, network issue or source tablets being unavailable).

To differentiate between this the source starts sending "heartbeats" (special internal VEvents) every second if there are no actual events to be sent. As part of this PR we also start storing the last heartbeat time in _vt.vreplication.

We calculate the transaction lag based on the transaction timestamp and the last heartbeat for each stream and the maximum across streams is used as the workflow lag: https://github.com/vitessio/vitess/blob/dc74ebcc2d1c70f36030ca07ec857c009afd6954/go/vt/wrangler/vexec.go#L536

We now add a check before switching traffic (both read and write) to ensure:

that the lag is within acceptable bounds (defined by the duration flag -max_transaction_lag_allowed, default 30s)
that there is no stream that has an error
that we are not still in the copy phase (this already exists)

Note that the Workflow Show command does already show a MaxVReplicationLag. This was implemented more to determine if there was an issue with the source not being available than measuring transaction lag. Since the name is already in-use, for backward compatibility, a new variable MaxVReplicationTransactionLag has been added for the purposes of this PR.

Other changes:

Minor refactoring of the test framework for ease of maintenance
Intial doc to describe the test framework and changes needed for this PR. We will update this in upcoming PRs.

New flag documented in website PR: vitessio/website#957

Related Issue(s)

#9525

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

aquarapid · 2022-01-20T18:54:00Z

One question/comment I would have is what the default for this timeout check should be. In the interests of not changing current behavior, an argument can be made for it to be infinite? Or at least equal to the switch wait timeout?

mattlord · 2022-01-20T21:26:21Z

One question/comment I would have is what the default for this timeout check should be. In the interests of not changing current behavior, an argument can be made for it to be infinite? Or at least equal to the switch wait timeout?

Why would we NOT re-use the existing flag? That being: -timeout

If we check replication lag ahead of time and see that it's beyond that window, it's reasonable to assume it's not likely to catch up within that window.

See problems/issues with that?

rohit-nayak-ps · 2022-01-21T16:16:51Z

Why would we NOT re-use the existing flag? That being: [-timeout]

I thought of using a different flag here because:

-timeout essentially specifies the maximum downtime that is acceptable while switching writes. This includes the overhead of coordinating between all the shards and across cells.
a new flag -max_lag_allowed would allow specifying a small lag, say 2 or 5 seconds, for both reads (minimize reading stale data) and writes

If we use a single flag for both we could end up not being able to specify a small lag because it will always timeout due to the overhead.

As Jacques suggested we could use the current wait timeout as default for the new flag, if we do decide to go with two separate flags.

rohit-nayak-ps · 2022-01-25T21:32:57Z

go/vt/wrangler/vexec.go

@@ -403,18 +406,18 @@ type ReplicationStatus struct {
 	CopyState []copyState
 }

-func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row []sqltypes.Value, primary *topo.TabletInfo) (*ReplicationStatus, string, error) {
+func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row sqltypes.RowNamedValues, primary *topo.TabletInfo) (*ReplicationStatus, string, error) {


The changes in this function are mainly refactoring to use name references instead of index-based (and accessing the new time_heartbeat column)

rohit-nayak-ps · 2022-01-30T13:47:04Z

One functionality that might "break" due to this PR is that some SwitchTraffics that succeeded in the past will temporarily error out since the lag is too high. This can happen, for example, if the user created a workflow and immediately tried to SwitchTraffic without letting the workflows catchup. Earlier it would have just taken longer for the call to return since it would mostly like manage to catchup before the wait timeout (def: 30s) (or just timeout if there was too much data).

Now it will return an error and user will need to monitor the lag and/or try again later. This is the correct way to use SwitchTraffic and one we want to encourage (and indeed the reason for this PR). So I don't believe we should keep a larger default but noting it here since it might give rise to a few extra support questions.

go/vt/vtctl/vtctl.go

mattlord

I had a few nits and questions, but otherwise LGTM! I really appreciate the time you put into the tests, docs, and refactoring!

I'll approve so that you're free to merge after addressing any valid issues.

mattlord · 2022-02-01T00:40:34Z

go/sqltypes/named_result.go

+}
+
+// AsBytes returns the named field as a byte array, or default value if nonexistent/error
+func (r RowNamedValues) AsBytes(fieldName string, def []byte) []byte {


Nit, but I would call the parameter default or defaultValue. At first I missed the comment and thought we were passing that in as a pointer to be written to.

Agree, def is confusing. Since used in all other functions, so I went with this rather than just name it differently here or fix it everywhere. But I should have ...

go/test/endtoend/tabletgateway/buffer/reshard/sharded_buffer_test.go

mattlord · 2022-02-01T00:50:12Z

go/test/endtoend/tabletgateway/buffer/reshard/sharded_buffer_test.go

+		duration -= waitDuration
+	}
+
+	if duration <= 0 {


For this to be correct, I think we should only do duration -= waitDuration in the else clause when we know we should loop again. Otherwise I think we could successfully switch in the last iteration of the loop and the duration may be <= 0 even though we were successful.

The if has a break from the for loop, so the duration will not decrement if we were successful.

go/test/endtoend/vreplication/cluster.go

mattlord · 2022-02-01T01:00:21Z

go/test/endtoend/vreplication/vreplication_test.go

@@ -933,9 +936,35 @@ func verifyClusterHealth(t *testing.T, cluster *VitessCluster) {
 	iterateTablets(t, cluster, checkTabletHealth)
 }

+const acceptableLagSeconds = 5
+
+func waitForLowLag(t *testing.T, keyspace, workflow string) {


Possible to share this code between the two end2end tests? If not, see my earlier comments about this code in sharded_buffer_test.go

Yeah, classic anti-pattern :( Because they are in different packages and we don't yet have a utils package to reuse such code, I took the short-cut. Merging as is now to catch the release deadline, will address "soon".

go/vt/binlog/binlogplayer/binlog_player.go

go/vt/wrangler/doc_test.md

go/vt/wrangler/vexec.go

…e replication lag Signed-off-by: Rohit Nayak <[email protected]>

…r the updated query to set/get this value. Compute workflow transaction lag and expose in Workflow Show and update related tests Signed-off-by: Rohit Nayak <[email protected]>

…ch functionality. Signed-off-by: Rohit Nayak <[email protected]>

Signed-off-by: Rohit Nayak <[email protected]>

…ed! e2e tests: wait for lag before switching writes. unit test: add expected queries. Show Frozen state in global workflow status for visibility. Signed-off-by: Rohit Nayak <[email protected]>

Signed-off-by: Rohit Nayak <[email protected]>

rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes labels Jan 19, 2022

rohit-nayak-ps changed the title ~~SwitchTraffic: check (configuratble) vreplication lag before switching~~ SwitchTraffic: check (configurable) vreplication lag before switching Jan 20, 2022

rohit-nayak-ps changed the title ~~SwitchTraffic: check (configurable) vreplication lag before switching~~ SwitchTraffic: check vreplication lag before switching Jan 25, 2022

rohit-nayak-ps commented Jan 25, 2022

View reviewed changes

rohit-nayak-ps added the Skip Upgrade Downgrade label Jan 26, 2022

rohit-nayak-ps force-pushed the rn-validate-lag-before-switch-traffic branch from 5e343a6 to 9bd68b0 Compare January 29, 2022 12:34

rohit-nayak-ps mentioned this pull request Jan 30, 2022

Document -max_replication_lag_allowed flag vitessio/website#957

Merged

rohit-nayak-ps requested review from mattlord and a team January 30, 2022 13:47

rohit-nayak-ps marked this pull request as ready for review January 30, 2022 13:48

rohit-nayak-ps requested review from ajm188, deepthi and doeg as code owners January 30, 2022 13:48

rohit-nayak-ps removed request for doeg and ajm188 January 30, 2022 13:48

deepthi reviewed Jan 31, 2022

View reviewed changes

go/vt/vtctl/vtctl.go Outdated Show resolved Hide resolved

mattlord approved these changes Feb 1, 2022

View reviewed changes

rohit-nayak-ps added 7 commits February 1, 2022 10:43

Check that workflow is ready for switching traffic, using configurabl…

c074437

…e replication lag Signed-off-by: Rohit Nayak <[email protected]>

Persist last heartbeat time to better estimate lag. Fix some tests fo…

0ef85a3

…r the updated query to set/get this value. Compute workflow transaction lag and expose in Workflow Show and update related tests Signed-off-by: Rohit Nayak <[email protected]>

Improve debugging fakesqldb issues. Fix lag computation. Test canSwit…

cce9c25

…ch functionality. Signed-off-by: Rohit Nayak <[email protected]>

Fix streamInfoQuery

cea47c6

Signed-off-by: Rohit Nayak <[email protected]>

Fix wrangler tests

435e5f2

Signed-off-by: Rohit Nayak <[email protected]>

Fix tests. UnixNano() was used in places instead of Unix()

7bdf042

Signed-off-by: Rohit Nayak <[email protected]>

Fixed logic. Some cleanup

82b90ed

Signed-off-by: Rohit Nayak <[email protected]>

rohit-nayak-ps added 5 commits February 1, 2022 10:45

Don't look for lag if switching reads and writes have not been switch…

61953a5

…ed! e2e tests: wait for lag before switching writes. unit test: add expected queries. Show Frozen state in global workflow status for visibility. Signed-off-by: Rohit Nayak <[email protected]>

wait for acceptable lag in buffer tests

b5453d7

Signed-off-by: Rohit Nayak <[email protected]>

Add idempotency tests back

d89cba6

Signed-off-by: Rohit Nayak <[email protected]>

Update test doc

3d5b0d8

Signed-off-by: Rohit Nayak <[email protected]>

Address review comments

0bb4b29

Signed-off-by: Rohit Nayak <[email protected]>

rohit-nayak-ps force-pushed the rn-validate-lag-before-switch-traffic branch from 4e9ff2d to 0bb4b29 Compare February 1, 2022 10:03

rohit-nayak-ps merged commit 6443bb5 into vitessio:main Feb 1, 2022

rohit-nayak-ps deleted the rn-validate-lag-before-switch-traffic branch February 1, 2022 11:15

rohit-nayak-ps mentioned this pull request Feb 15, 2022

VReplication Workflows: Use WithDDL while updating time_heartbeat to be backwardly compatible with upgrades to existing cluster #9700

Merged

3 tasks

mattlord mentioned this pull request Feb 16, 2022

Fix missing time_heartbeat column error #9687

Closed

3 tasks

shlomi-noach mentioned this pull request Mar 23, 2022

Online DDL: identify VReplication retrying failure, terminate migration #9958

Closed

3 tasks

mattlord mentioned this pull request May 11, 2022

Support checking workflow lag before initiating SwitchTraffic (SwitchReads/SwitchWrites) #9525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwitchTraffic: check vreplication lag before switching #9538

SwitchTraffic: check vreplication lag before switching #9538

rohit-nayak-ps commented Jan 19, 2022 •

edited

Loading

aquarapid commented Jan 20, 2022

mattlord commented Jan 20, 2022

rohit-nayak-ps commented Jan 21, 2022

rohit-nayak-ps Jan 25, 2022

rohit-nayak-ps commented Jan 30, 2022

mattlord left a comment

mattlord Feb 1, 2022

rohit-nayak-ps Feb 1, 2022

mattlord Feb 1, 2022

rohit-nayak-ps Feb 1, 2022 •

edited

Loading

mattlord Feb 1, 2022

rohit-nayak-ps Feb 1, 2022

SwitchTraffic: check vreplication lag before switching #9538

SwitchTraffic: check vreplication lag before switching #9538

Conversation

rohit-nayak-ps commented Jan 19, 2022 • edited Loading

Description

Related Issue(s)

Checklist

aquarapid commented Jan 20, 2022

mattlord commented Jan 20, 2022

rohit-nayak-ps commented Jan 21, 2022

rohit-nayak-ps Jan 25, 2022

Choose a reason for hiding this comment

rohit-nayak-ps commented Jan 30, 2022

mattlord left a comment

Choose a reason for hiding this comment

mattlord Feb 1, 2022

Choose a reason for hiding this comment

rohit-nayak-ps Feb 1, 2022

Choose a reason for hiding this comment

mattlord Feb 1, 2022

Choose a reason for hiding this comment

rohit-nayak-ps Feb 1, 2022 • edited Loading

Choose a reason for hiding this comment

mattlord Feb 1, 2022

Choose a reason for hiding this comment

rohit-nayak-ps Feb 1, 2022

Choose a reason for hiding this comment

rohit-nayak-ps commented Jan 19, 2022 •

edited

Loading

rohit-nayak-ps Feb 1, 2022 •

edited

Loading