-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: throttling info for both source and target; Online DDL propagates said info #10601
VReplication: throttling info for both source and target; Online DDL propagates said info #10601
Conversation
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
… to vreplication table Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
I forgot to mention. One of the side effects to this PR, which is actually the root trigger to working on it: Without this PR, a VReplication workflow that was throttled, would be identified by the Online DDL executor as "stale" after 10 minutes of inactivity, and terminated. Now, with this PR, this does not happen because |
Signed-off-by: Shlomi Noach <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Shlomi Noach <[email protected]>
Looking into unit test failures, which seem to be related to the |
Obviously only failing in GitHub CI 😛 and not on local envs. |
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Added the two new columns in |
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
… _vt.vreplication Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Also added:
|
Fixed all unit tests, new ones showing! Still working through them. |
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Phew, the tests were brutal but legit! Now good to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! This will greatly improve the observability/debugability of VReplication workflows (especially when we enable tablet throttling by default). Thank you for working on it!
I had a few questions and comments, but nothing major. I'll look out for your responses and can approve quickly if needed.
@rohit-nayak-ps if you have time, would you mind giving it a pass too? Thanks!
@@ -98,7 +100,7 @@ const ( | |||
reverted_uuid, | |||
is_view | |||
) VALUES ( | |||
%a, %a, %a, %a, %a, %a, %a, %a, %a, FROM_UNIXTIME(NOW()), %a, %a, %a, %a, %a, %a, %a, %a | |||
%a, %a, %a, %a, %a, %a, %a, %a, %a, NOW(), %a, %a, %a, %a, %a, %a, %a, %a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so here again it's a unix timestamp in seconds precision. I wonder why we can't use a timestamp field in the table -- then we have a contract for what the value represents and we know we'll get what we expect from from_unixtime()
etc? That's a moot point though I think as we are already using bigint fields for timestamps in these tables and related code so uniformity takes precedence over stricter types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to clarify, the column type in this case is a TIMESTAMP
and there was a long standing bug where we erroneously INSERT
ed a FROM_UNIXTIME()
value, which of course did not make any sense.
We are now looking at a 1.5 year old decision, to use TIMESTAMP
values in _vt.schema_migrations
table, long before it was even associated with _vt.vreplication
, and long before I knew of th eexistence of _vt.vreplication
. This is how the code "grew" and now it's too late to change _vt.schema_migrations
from TIMESTAMP
and into INT
.
The fact that we copy values from _vt.vreplication
, that happen to be INT
types, and into _vt.schema_migrations
with TIMESTAMP
types, is unfortunate, but also does not tell the entire story. For long, schema_migrations
acted with gh-ost
and pt-osc
, oblivious of vreplication
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you ❤️ The only thing left is documentation — you've proven yourself to be very studious in that regard though so I'm not concerned. Please let me know when the docs PR is ready and I'll review it. 🙇
Description
This PR is primarily an enhancement to VReplication, adding visibility into throttling status of an active workflow.
In addition, Online DDL now utilizes said information and presents it as part of
SHOW VITESS_MIGRATIONS
.Lastly, there's a deadlock solution to a particular throttling scenario.
Throttling visibility in VReplication
The following components in VReplication consult the throttler (way before this PR):
vstreamer
, on source tabletrowstreamer
, on source tabletvplayer
, on target tabletvcopier
, on target tabletAs of this PR, all four further advertise their throttling state. The table
_vt.vreplication
now has two new columns:time_throttled
(unix time, seconds)component_throttled
(name of component)These present the last known incident where a VReplication component was throttled. Example values might be:
For
vplayer
andvcopier
, whenever they are throttled, they now askvreplicator
to update those values, e.g. via:vreplicator
rate-limits those writes, to at most one per second. It's fine if calls toupdateTimeThrottled()
are frequent. Some calls may be dropped. The database will only be affected at most once per second.As for
vstreamer
androwstreamer
, they don't have immediate access to_vt.vreplication
or tovreplicator
, because they are on the source tablet. A newbool throttled
proto field is added in bothVEvent
andVStreamRowsResponse
. Ifvstreamer
is throttled, it sends (rate limited) heartbeat events withThrottled: true
. IfVStreamRowsResponse
is throttled, it sends (rate limited) responses whereThrottled: true
.vplayer
andvcopier
receive and identify those responses, respectively, and callupdateTimeThrottled()
.This means we can see in
_vt.vreplication
the last known throttling incident: when it happened, and which component was affected, one ofvstreamer, rowstreamer, vplayer, vcopier
.This information does not include frequency of throttling, number of successful/rejected throttler checks, etc. We leave that to a future PR.
Online DDL throttling info
When reviewing running migrations, and for
vitess
migration, the online DDL executor now reads the above two columns, and propagate them as two new columns in_vt.schema_migrations
table:last_throttled_timestamp
(TIMESTAMP
)component_throttled
(textual)These are only sampled once per minute, sometimes more frequently.
A deadlock scenario, now solved
With the introduction of on demand heartbeats, we've introduced a throttling deadlock scenario. Consider the following scenario (Copy of comment from this PR):
And so, once per
reviewRunningMigrations()
, and assuming there are running migrations, we ensure to hit a throttler check. This will kick on-demand heartbeats, unlocking the deadlock.rowstreamer heartbeats
rowstreamer
now sends heartbeats every10sec
. This means even if it's blocked on reading the next batch of rows (blocked, slow, hangs, for whatever reason), it still reports heartbeats. these are intercepted byvcopier
which then updates_vt.vreplication.time_heartbeat
.Related Issue(s)
#6926
#10198
Checklist
Deployment Notes