-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SwitchTraffic: check vreplication lag before switching #9538
SwitchTraffic: check vreplication lag before switching #9538
Conversation
One question/comment I would have is what the default for this timeout check should be. In the interests of not changing current behavior, an argument can be made for it to be infinite? Or at least equal to the switch wait timeout? |
Why would we NOT re-use the existing flag? That being: If we check replication lag ahead of time and see that it's beyond that window, it's reasonable to assume it's not likely to catch up within that window. See problems/issues with that? |
I thought of using a different flag here because:
If we use a single flag for both we could end up not being able to specify a small lag because it will always timeout due to the overhead. As Jacques suggested we could use the current wait timeout as default for the new flag, if we do decide to go with two separate flags. |
@@ -403,18 +406,18 @@ type ReplicationStatus struct { | |||
CopyState []copyState | |||
} | |||
|
|||
func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row []sqltypes.Value, primary *topo.TabletInfo) (*ReplicationStatus, string, error) { | |||
func (wr *Wrangler) getReplicationStatusFromRow(ctx context.Context, row sqltypes.RowNamedValues, primary *topo.TabletInfo) (*ReplicationStatus, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this function are mainly refactoring to use name references instead of index-based (and accessing the new time_heartbeat
column)
5e343a6
to
9bd68b0
Compare
One functionality that might "break" due to this PR is that some SwitchTraffics that succeeded in the past will temporarily error out since the lag is too high. This can happen, for example, if the user created a workflow and immediately tried to SwitchTraffic without letting the workflows catchup. Earlier it would have just taken longer for the call to return since it would mostly like manage to catchup before the Now it will return an error and user will need to monitor the lag and/or try again later. This is the correct way to use SwitchTraffic and one we want to encourage (and indeed the reason for this PR). So I don't believe we should keep a larger default but noting it here since it might give rise to a few extra support questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few nits and questions, but otherwise LGTM! I really appreciate the time you put into the tests, docs, and refactoring!
I'll approve so that you're free to merge after addressing any valid issues.
} | ||
|
||
// AsBytes returns the named field as a byte array, or default value if nonexistent/error | ||
func (r RowNamedValues) AsBytes(fieldName string, def []byte) []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I would call the parameter default
or defaultValue
. At first I missed the comment and thought we were passing that in as a pointer to be written to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, def
is confusing. Since used in all other functions, so I went with this rather than just name it differently here or fix it everywhere. But I should have ...
go/test/endtoend/tabletgateway/buffer/reshard/sharded_buffer_test.go
Outdated
Show resolved
Hide resolved
duration -= waitDuration | ||
} | ||
|
||
if duration <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to be correct, I think we should only do duration -= waitDuration
in the else
clause when we know we should loop again. Otherwise I think we could successfully switch in the last iteration of the loop and the duration may be <= 0 even though we were successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
has a break
from the for loop, so the duration will not decrement if we were successful.
@@ -933,9 +936,35 @@ func verifyClusterHealth(t *testing.T, cluster *VitessCluster) { | |||
iterateTablets(t, cluster, checkTabletHealth) | |||
} | |||
|
|||
const acceptableLagSeconds = 5 | |||
|
|||
func waitForLowLag(t *testing.T, keyspace, workflow string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to share this code between the two end2end tests? If not, see my earlier comments about this code in sharded_buffer_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, classic anti-pattern :( Because they are in different packages and we don't yet have a utils
package to reuse such code, I took the short-cut. Merging as is now to catch the release deadline, will address "soon".
…e replication lag Signed-off-by: Rohit Nayak <[email protected]>
…r the updated query to set/get this value. Compute workflow transaction lag and expose in Workflow Show and update related tests Signed-off-by: Rohit Nayak <[email protected]>
…ch functionality. Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
…ed! e2e tests: wait for lag before switching writes. unit test: add expected queries. Show Frozen state in global workflow status for visibility. Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
4e9ff2d
to
0bb4b29
Compare
Description
This PR adds logic to measure the estimated lag between the last transaction at the source and the last transaction seen by the target. The target persists the transaction timestamp of each seen event in
_vt.vreplication
. The target may not receive events from the source in two cases:To differentiate between this the source starts sending "heartbeats" (special internal VEvents) every second if there are no actual events to be sent. As part of this PR we also start storing the last heartbeat time in
_vt.vreplication
.We calculate the transaction lag based on the transaction timestamp and the last heartbeat for each stream and the maximum across streams is used as the workflow lag: https://github.com/vitessio/vitess/blob/dc74ebcc2d1c70f36030ca07ec857c009afd6954/go/vt/wrangler/vexec.go#L536
We now add a check before switching traffic (both read and write) to ensure:
-max_transaction_lag_allowed
, default 30s)Note that the
Workflow Show
command does already show a MaxVReplicationLag. This was implemented more to determine if there was an issue with the source not being available than measuring transaction lag. Since the name is already in-use, for backward compatibility, a new variable MaxVReplicationTransactionLag has been added for the purposes of this PR.Other changes:
New flag documented in website PR: vitessio/website#957
Related Issue(s)
#9525
Checklist