Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet throttle: support "/throttle/check-self" available on all tablets #7319

Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jan 19, 2021

Description

We add a new throttler endpoint, /throttle/check-self, which is available on all tablets, and which tests the throttle state of the specific tablet (or of its backend MySQL server).

Originally, we only considered throttling on Primary tablets: the idea of throttling was that we would write a lot of data onto the shard's primary, and throttle based on replication lag on the shard's replicas.

@rohit-nayak-ps suggested a new use case: VReplication can create a substantial load on the source by reading too intensively. e.g. if Vreplication is configured to read from a Replica, it can cause so much IO overhead that the replica would lag as result.

Hence, we now introduce a lag-based throttle on all tablet types. VReplication will e.g. call /throttle/check-self on the replica it reads from, to validate it's not putting too much read load.

I'm applying the same replication lag evaluation even on a Primary tablet. This is an interesting scenario: how do you measure the load you create on a Primary tablet? I see three possibilities:

  • measure Threads_running. I've known this to be a good indicator for load on primary; when it is loaded, transactions take longer time to commit, and this causes a spike in concurrent queries.
  • measure History List Length (part of InnoDB's SHOW INNODB STATUS). A transactions that is open for a long time will cause this MVCC to keep more and more rows out of the main index tree, and that in turn causes more IO. The history list length number indicates how much of "stalled" versions we have. I'm used to measuring it in millions.
  • evaluate replication lag, yes, on the Primary. The thing is that we inject a heartbeat on the primary itself. That means we are able to compare the heartbeat timestamp on the primary, with current time, same as on replicas. The interesting part is that if the primary stalls on writes, then we will see an imaginary "lag" on the primary itself.

So, for now, I'm using the latter as the indicator for self-throttling on primaries, same as on replicas.

Added endtoend tests to confirm behavior, and documentation will have to be updated, as well.

I will next look into how to use this functionality in VReplication.

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required TODO

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build
  • VTAdmin

…ets that only checks the replication lag on the tablet's backend MySQL server

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

Ready for review! This is a dependency for #7324

@@ -308,6 +321,37 @@ func (throttler *Throttler) createThrottlerUser(ctx context.Context) (password s
return password, nil
}

// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// readSelfMySQLThrottleMetric reads the mysql metric from thi very tablet's backend mysql.
// readSelfMySQLThrottleMetric reads the mysql metric from this very tablet's backend mysql.

@@ -71,7 +72,7 @@ var (
sqlGrantThrottlerUser = []string{
`GRANT SELECT ON _vt.heartbeat TO %s`,
}
replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) from _vt.heartbeat`
replicationLagQuery = `select unix_timestamp(now(6))-max(ts/1000000000) as replication_lag from _vt.heartbeat`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this query semantically guaranteed to work, or could there possibly be implementation-specific race conditions in a sluggish scenario?

At least for MySQL (https://dev.mysql.com/doc/refman/5.6/en/date-and-time-functions.html#function_now):

NOW() returns a constant time that indicates the time at which the statement began to execute.

What I don't know is whether a SELECT, outside of a transaction, with autocommit=1, will use a consistent snapshot for its read; and if so, whether that snapshot will be at the same time, or after, the value of NOW().

Suppose there is no consistent snapshot, or there is but it comes from a timepoint later than NOW(). Then couldn't the query compute a timestamp, then wait a while due to sluggishness, then read max(ts) and see a value that is close enough to NOW(), and return 200 ok.

Whereas in the same scenario, if there was a consistent snapshot taken at NOW(), if there was lag, you would definitely see it, and not return 200 ok.

I'm just wondering if, to be guaranteed to detect sluggishness, whether this works, or if you need a different query such that the timestamp is guaranteed to be computed after max(ts). e.g. something like

select max(ts/1000000000) as ts_max_heartbeat from _vt.heartbeat;
select unix_timestamp(SYSDATE(6)) as ts_now;

Here there's the risk that you could get false-negatives since the timestamp is computed after the heartbeat. But if things aren't sluggish and there isn't replica lag, hopefully the delay should be so minimal as to be inconsequential.

I'm far from an expert here, so I don't know if this line of reasoning is valid or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thinking! Experience with this type of query throughout the years (and specifically in my previous employer, as this is adapted from https://github.com/github/freno/), shows that this is not an issue.

One nice thing about this mechanism is that the 1sec threshold is not an absolute restriction. It is OK if we pass the threshold (indeed, the logic pretty much ensures we will -- e.g. we will be able to write to the server when we're at 0.99sec lag, which means our own write will quite possibly push lag beyond 1sec).

So, back to the scenario you depicted, let's say replica is lagging by 2sec and has slugishness of 1sec, which leads us to believe the lag is 0.99sec. This means we push forward. This means in the near future lag will be pushed to be above 2sec, at which time the 1sec slugishness is not sluggish enough to make us think the lag is < 1sec. So if the system remains at 1sec slugishness, we're at worst case lagging at 2sec.

Just to illustrate how crazy time evaluations are, you could use SYSDATE() and still evaluate the wrong thing if the time is evaluated first, then slugishness begins, then evaluation of table row takes place.

Having said all that, I don't have a strong opinion either way; just my experience that this logic hasn't created any harm in my past experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I appreciate the explanation.

The case I'm still wondering about is the primary.

In your scenario you discussed a replica, where replication lag and sluggishness are both at play. Sluggishness can cause false positives, but if replication lag gets high enough, it won't matter.

On the primary, there is no replication lag to consider, only sluggishness. If the sluggishness of the select from _vt.heartbeat is equal to the sluggishness of insert into _vt.heartbeat, wouldn't they cancel out and always appear to never be any sluggishness? I guess the question is, is that possible, or in a high-sluggishness scenario, would sluggishness for an insert almost always be greater than the sluggishness of a select?

Anyway, I'm not concerned if you're not concerned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I'm still wondering about is the primary.

I see what you mean, and your point is valid. I still don't know if slugishness on the primary won't affect the timestamp in the same way (that is, CURDATE is computed when MySQL is ready to insert, and then the INSERT stalls).

That is to say, I don't know, I'm speculating here; and because I haven't known it to be an issue (and I've worked on sluggish systems) I assume this is not a big risk.

But point taken, let me look into it, I don't think there's any risk in converting the insert to use curdate(), in which case -- why not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But point taken, let me look into it, I don't think there's any risk in converting the insert to use curdate(), in which case -- why not.

Do you mean now() or systime()?

Why would you switch the insert code (which, from what I can tell, currently gets the timestamp from go)? Wouldn't that potentially make the heartbeat less suitable for detecting sluggishness, since there's more of a chance that the timestamp might be computed close to when the insert commits, even in the face of sluggishness?

I'd guess that keeping what you have now - which has been used a lot in production already without known problems, and where in the worst case scenario it's equivalent to today's behavior which is to have no check-self at all - is probably better than introducing something we're less sure of that might introduce new false negatives.

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
some nits

@@ -156,19 +175,44 @@ func TestLag(t *testing.T) {
time.Sleep(2 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these durations used in the tests (2 seconds, 10 seconds, 5 seconds) based on values configured elsewhere? Just wondering if these can become flaky if a configuration is changed elsewhere.

Copy link
Contributor Author

@shlomi-noach shlomi-noach Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent question. There are some hard coded intervals here:

leaderCheckInterval = 5 * time.Second
mysqlCollectInterval = 100 * time.Millisecond
mysqlDormantCollectInterval = 5 * time.Second
mysqlRefreshInterval = 10 * time.Second
mysqlAggregateInterval = 100 * time.Millisecond

Basically:

  • after a cluster starts, it takes ~5sec for throttler to know about all replicas and collect data from those replicas and aggregate that data -- meaning an API /throttler/check can provide a reliable answer. Hence waiting for 10sec (can reduce that to e.g.7s but I feel 10sec is much safer against flakyness.
  • user apps will cache throttler results for some 250ms (e.g. see in the vreplication PRs), so a 1sec sleep can ensure the cache is cleared
  • The default lag threshold is 1s, hence the 2 * time.Second sleep after StopReplication(), to make sure when we next check for lag, there is a throttle-grade lag. Having said that, you are right that this is overridable -- so I just added
			"-throttle_threshold", "1s",

to this test's VtTabletExtraArgs to ensure the threshold is 1s.

  • The sleep for 5 * time.Second after StartReplication is just heuristic to allow the replication to catch up, and does not depend on the throttler configuration. Again, catch up is likely to happen in less than 1s but I feel 5s is great against flakyness.
  • The sleep for 10 * time.Second after ChangeTabletType is because it will take that amount of time for the throttler to identify the new roster. It's hard coded in mysqlRefreshInterval = 10 * time.Second, and now I've actually uppsed the test to sleep for 12 * time.Second to avoid flakyness.

I've moreover now made these numbers as constants in the test. Now the waits are named, and its clearer what each wait means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new code:

const (
	throttlerInitWait        = 10 * time.Second
	accumulateLagWait        = 2 * time.Second
	mysqlRefreshIntervalWait = 12 * time.Second
	replicationCatchUpWait   = 5 * time.Second
)

...

		time.Sleep(mysqlRefreshIntervalWait)

go/vt/vttablet/tabletserver/throttle/throttler.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/throttler.go Outdated Show resolved Hide resolved
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach merged commit 1514987 into vitessio:master Jan 26, 2021
@shlomi-noach shlomi-noach deleted the throttler-replica-throttle-lag branch January 26, 2021 07:34
@askdba askdba added this to the v9.0 milestone Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants