-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tablet throttle: support "/throttle/check-self" available on all tablets #7319
Tablet throttle: support "/throttle/check-self" available on all tablets #7319
Conversation
Signed-off-by: Shlomi Noach <[email protected]>
…ets that only checks the replication lag on the tablet's backend MySQL server Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
… keyword Signed-off-by: Shlomi Noach <[email protected]>
…-throttle-lag Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…-throttle-lag Signed-off-by: Shlomi Noach <[email protected]>
Ready for review! This is a dependency for #7324 |
@@ -308,6 +321,37 @@ func (throttler *Throttler) createThrottlerUser(ctx context.Context) (password s | |||
return password, nil | |||
} | |||
|
|||
// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql. | |
// readSelfMySQLThrottleMetric reads the mysql metric from this very tablet's backend mysql. |
@@ -71,7 +72,7 @@ var ( | |||
sqlGrantThrottlerUser = []string{ | |||
`GRANT SELECT ON _vt.heartbeat TO %s`, | |||
} | |||
replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) from _vt.heartbeat` | |||
replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) as replication_lag from _vt.heartbeat` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this query semantically guaranteed to work, or could there possibly be implementation-specific race conditions in a sluggish scenario?
At least for MySQL (https://dev.mysql.com/doc/refman/5.6/en/date-and-time-functions.html#function_now):
NOW() returns a constant time that indicates the time at which the statement began to execute.
What I don't know is whether a SELECT
, outside of a transaction, with autocommit=1
, will use a consistent snapshot for its read; and if so, whether that snapshot will be at the same time, or after, the value of NOW()
.
Suppose there is no consistent snapshot, or there is but it comes from a timepoint later than NOW()
. Then couldn't the query compute a timestamp, then wait a while due to sluggishness, then read max(ts)
and see a value that is close enough to NOW()
, and return 200 ok.
Whereas in the same scenario, if there was a consistent snapshot taken at NOW()
, if there was lag, you would definitely see it, and not return 200 ok.
I'm just wondering if, to be guaranteed to detect sluggishness, whether this works, or if you need a different query such that the timestamp is guaranteed to be computed after max(ts)
. e.g. something like
select max(ts/1000000000) as ts_max_heartbeat from _vt.heartbeat;
select unix_timestamp(SYSDATE(6)) as ts_now;
Here there's the risk that you could get false-negatives since the timestamp is computed after the heartbeat. But if things aren't sluggish and there isn't replica lag, hopefully the delay should be so minimal as to be inconsequential.
I'm far from an expert here, so I don't know if this line of reasoning is valid or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thinking! Experience with this type of query throughout the years (and specifically in my previous employer, as this is adapted from https://github.com/github/freno/), shows that this is not an issue.
One nice thing about this mechanism is that the 1sec threshold is not an absolute restriction. It is OK if we pass the threshold (indeed, the logic pretty much ensures we will -- e.g. we will be able to write to the server when we're at 0.99sec lag, which means our own write will quite possibly push lag beyond 1sec).
So, back to the scenario you depicted, let's say replica is lagging by 2sec and has slugishness of 1sec, which leads us to believe the lag is 0.99sec. This means we push forward. This means in the near future lag will be pushed to be above 2sec, at which time the 1sec slugishness is not sluggish enough to make us think the lag is < 1sec. So if the system remains at 1sec slugishness, we're at worst case lagging at 2sec.
Just to illustrate how crazy time evaluations are, you could use SYSDATE()
and still evaluate the wrong thing if the time is evaluated first, then slugishness begins, then evaluation of table row takes place.
Having said all that, I don't have a strong opinion either way; just my experience that this logic hasn't created any harm in my past experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I appreciate the explanation.
The case I'm still wondering about is the primary.
In your scenario you discussed a replica, where replication lag and sluggishness are both at play. Sluggishness can cause false positives, but if replication lag gets high enough, it won't matter.
On the primary, there is no replication lag to consider, only sluggishness. If the sluggishness of the select from _vt.heartbeat
is equal to the sluggishness of insert into _vt.heartbeat
, wouldn't they cancel out and always appear to never be any sluggishness? I guess the question is, is that possible, or in a high-sluggishness scenario, would sluggishness for an insert almost always be greater than the sluggishness of a select?
Anyway, I'm not concerned if you're not concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I'm still wondering about is the primary.
I see what you mean, and your point is valid. I still don't know if slugishness on the primary won't affect the timestamp in the same way (that is, CURDATE
is computed when MySQL is ready to insert, and then the INSERT stalls).
That is to say, I don't know, I'm speculating here; and because I haven't known it to be an issue (and I've worked on sluggish systems) I assume this is not a big risk.
But point taken, let me look into it, I don't think there's any risk in converting the insert
to use curdate()
, in which case -- why not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But point taken, let me look into it, I don't think there's any risk in converting the insert to use curdate(), in which case -- why not.
Do you mean now()
or systime()
?
Why would you switch the insert code (which, from what I can tell, currently gets the timestamp from go)? Wouldn't that potentially make the heartbeat less suitable for detecting sluggishness, since there's more of a chance that the timestamp might be computed close to when the insert commits, even in the face of sluggishness?
I'd guess that keeping what you have now - which has been used a lot in production already without known problems, and where in the worst case scenario it's equivalent to today's behavior which is to have no check-self at all - is probably better than introducing something we're less sure of that might introduce new false negatives.
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
some nits
@@ -156,19 +175,44 @@ func TestLag(t *testing.T) { | |||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these durations used in the tests (2 seconds, 10 seconds, 5 seconds) based on values configured elsewhere? Just wondering if these can become flaky if a configuration is changed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent question. There are some hard coded intervals here:
vitess/go/vt/vttablet/tabletserver/throttle/throttler.go
Lines 37 to 41 in 627f092
leaderCheckInterval = 5 * time.Second | |
mysqlCollectInterval = 100 * time.Millisecond | |
mysqlDormantCollectInterval = 5 * time.Second | |
mysqlRefreshInterval = 10 * time.Second | |
mysqlAggregateInterval = 100 * time.Millisecond |
Basically:
- after a cluster starts, it takes ~5sec for throttler to know about all replicas and collect data from those replicas and aggregate that data -- meaning an API
/throttler/check
can provide a reliable answer. Hence waiting for10sec
(can reduce that to e.g.7s
but I feel10sec
is much safer against flakyness. - user apps will cache throttler results for some
250ms
(e.g. see in the vreplication PRs), so a1sec
sleep can ensure the cache is cleared - The default lag threshold is
1s
, hence the2 * time.Second
sleep afterStopReplication()
, to make sure when we next check for lag, there is a throttle-grade lag. Having said that, you are right that this is overridable -- so I just added
"-throttle_threshold", "1s",
to this test's VtTabletExtraArgs
to ensure the threshold is 1s
.
- The sleep for
5 * time.Second
afterStartReplication
is just heuristic to allow the replication to catch up, and does not depend on the throttler configuration. Again, catch up is likely to happen in less than1s
but I feel5s
is great against flakyness. - The sleep for
10 * time.Second
afterChangeTabletType
is because it will take that amount of time for the throttler to identify the new roster. It's hard coded inmysqlRefreshInterval = 10 * time.Second
, and now I've actually uppsed the test to sleep for12 * time.Second
to avoid flakyness.
I've moreover now made these numbers as constants in the test. Now the waits are named, and its clearer what each wait means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new code:
const (
throttlerInitWait = 10 * time.Second
accumulateLagWait = 2 * time.Second
mysqlRefreshIntervalWait = 12 * time.Second
replicationCatchUpWait = 5 * time.Second
)
...
time.Sleep(mysqlRefreshIntervalWait)
go/vt/vttablet/tabletserver/throttle/mysql/mysql_throttle_metric.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Description
We add a new throttler endpoint,
/throttle/check-self
, which is available on all tablets, and which tests the throttle state of the specific tablet (or of its backend MySQL server).Originally, we only considered throttling on Primary tablets: the idea of throttling was that we would write a lot of data onto the shard's primary, and throttle based on replication lag on the shard's replicas.
@rohit-nayak-ps suggested a new use case: VReplication can create a substantial load on the source by reading too intensively. e.g. if Vreplication is configured to read from a Replica, it can cause so much IO overhead that the replica would lag as result.
Hence, we now introduce a lag-based throttle on all tablet types. VReplication will e.g. call
/throttle/check-self
on the replica it reads from, to validate it's not putting too much read load.I'm applying the same replication lag evaluation even on a Primary tablet. This is an interesting scenario: how do you measure the load you create on a Primary tablet? I see three possibilities:
Threads_running
. I've known this to be a good indicator for load on primary; when it is loaded, transactions take longer time to commit, and this causes a spike in concurrent queries.So, for now, I'm using the latter as the indicator for self-throttling on primaries, same as on replicas.
Added
endtoend
tests to confirm behavior, and documentation will have to be updated, as well.I will next look into how to use this functionality in VReplication.
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: