-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify replication lag selection logic #5251
Simplify replication lag selection logic #5251
Conversation
7eb7dac
to
015f1ef
Compare
@mpawliszyn @rafael @demmer @tirsen If you don't have objections, we can review this more carefully and get this submitted. |
@brirams was actually just looking at this part of the codebase. He can give good input on this. I think I'm also on board in making this simpler. One high level comment, if we decide to move forward, we should make this new approach an opt-in behavior via a flag, so it's easier to rollback if we run into regressions. |
I agree with the overall sentiment around simplifying this logic and I think this PR does just However, my concern is that it also changes the existing contract of |
This is definitely the behavior we want. In fact, at the two-hour replication lag mark, the vttablet has a default setting that would make it non-serving: https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletmanager/healthcheck.go#L219. @sayap: will you be able to introduce the flag to allow for fallback? |
761ff06
to
05441fb
Compare
I have added a flag to opt-out from the new simplified logic, for users who rely on the existing (legacy) algorithm. |
go/vt/discovery/replicationlag.go
Outdated
lowReplicationLag = flag.Duration("discovery_low_replication_lag", 30*time.Second, "the replication lag that is considered low enough to be healthy") | ||
highReplicationLagMinServing = flag.Duration("discovery_high_replication_lag_minimum_serving", 2*time.Hour, "the replication lag that is considered too high when selecting the minimum num vttablets for serving") | ||
minNumTablets = flag.Int("min_number_serving_vttablets", 2, "the minimum number of vttablets that will be continue to be used even with low replication lag") | ||
legacyReplicationLagAlgorithm = flag.Bool("legacy_replication_lag_algorithm", false, "use the legacy algorithm when selecting the vttablets for serving") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should initially default to true. Otherwise behavior would change for existent users without them opting in.
We can then go through a process of:
- Opt in to new behavior.
- New behavior becomes default.
- Deprecate legacy behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now default to true.
To avoid surprise, the simplified logic is now purely based on the 3 config values, i.e. low lag / high lag / min tablets, without any outlier calculation or special case. Add flag -legacy_replication_lag_algorithm to toggle between the legacy algorithm and the simplified logic, default to true (i.e. the former) for now. Signed-off-by: Yap Sok Ann <[email protected]>
05441fb
to
da9bfed
Compare
@rafael This looks ready to go. Can you do a final once-over? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
To avoid surprise, the logic is now purely based on the 3 config values, i.e. low lag / high lag / min tablets, without any outlier calculation or special case.
This gets the logic to match with the unhappy description of "will serve traffic only if there are no fully healthy tablets", shown in the vttablet status page.