Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the frequency at which heartbeats update the _vt.vreplication table configurable #7659

Merged
merged 4 commits into from
Mar 14, 2021

Conversation

rohit-nayak-ps
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps commented Mar 10, 2021

Description

Motivation

For an idle source shard, the source vstreamer sends a heartbeat. Currently that is once per second. On receiving the heartbeat the target vreplication module updates the time_updated column of the relevant row of _vt.vreplication. For some setups this is a problem, for example:

  • if there are too many streams the extra write qps or cpu load due to these updates are unacceptable
  • if there are too many streams and/or a large source field (lot of participating tables) which generates unacceptable increase in the binlog size
  • even for a single stream, if the server is of a lower configuration, then the resultant qps/binlog increase may be significant as a percentage of resources

See related issues for more information.

Implementation

This PR provides a tablet level configuration parameter -vreplication_heartbeat_update_interval which determines how often the time_updated column is updated if there are no real events on the source and the source. Default is 1 second. Keep this low if you expect high QPS and are monitoring this column to alert about potential outages. Keep this high if you experience any of the issues mentioned above.

Notes

This is an initial light-weight implementation to address immediate concerns of affected users.

This also doesn't address the problem of long-running workflows (like materialize or reverse replication) for high qps systems.

To address this, in the long term, we will be refactoring the vreplication table so that the static part (including the source proto) will be separated from the dynamic part (position, timestamps) , so as to minimize the size of binlog entries.

The VReplicationHeartbeat metric will continue to be updated at 1 second, so for systems using this metric for monitoring, you will continue to get 1-second granularity for alerts.

Signed-off-by: Rohit Nayak [email protected]

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

…ble a vttablet configuration

Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep an upper bound on that interval. The maximum interval should not exceed one minute. The reason is that we must have some tracking. Specifically, OnlineDDL looks at the update timestamp to say "this is healthy/active".

@rohit-nayak-ps
Copy link
Contributor Author

Let's keep an upper bound on that interval. The maximum interval should not exceed one minute. The reason is that we must have some tracking. Specifically, OnlineDDL looks at the update timestamp to say "this is healthy/active".

Thanks that was a good point. I added an upper bound. I have enforced it in code at the moment. I couldn't see any validations of flags in vttablet: I would ideally like to fail the vttablet launch itself if a higher value is provided so that we are not silently fixing it.

@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review March 11, 2021 15:05
@deepthi
Copy link
Member

deepthi commented Mar 11, 2021

Thanks that was a good point. I added an upper bound. I have enforced it in code at the moment. I couldn't see any validations of flags in vttablet: I would ideally like to fail the vttablet launch itself if a higher value is provided so that we are not silently fixing it.

Mandatory flags are checked in vttablet.go. For example:

	if *tabletPath == "" {
		log.Exit("-tablet-path required")
	}

@rohit-nayak-ps
Copy link
Contributor Author

Mandatory flags are checked in vttablet.go. For example:

This are for flags defined in that cmd itself. The ones in vreplicator.go are not in scope here or in tm_init.go. It seems like an overkill to validate this inside vreplication and fail a vreplication stream. Maybe I just add this to the flags documentation for now?

@deepthi
Copy link
Member

deepthi commented Mar 12, 2021

Mandatory flags are checked in vttablet.go. For example:

This are for flags defined in that cmd itself. The ones in vreplicator.go are not in scope here or in tm_init.go. It seems like an overkill to validate this inside vreplication and fail a vreplication stream. Maybe I just add this to the flags documentation for now?

Let us log a warning if the user provided value is being overridden by the limit - either when starting the VR engine, or when a stream is created. If it turns out that people are being affected and not realizing it we can revisit whether to fail vttablet startup.

@rohit-nayak-ps
Copy link
Contributor Author

Let us log a warning if the user provided value is being overridden by the limit - either when starting the VR engine, or when a stream is created. If it turns out that people are being affected and not realizing it we can revisit whether to fail vttablet startup.

added the log

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@rohit-nayak-ps rohit-nayak-ps merged commit 6d5204b into vitessio:master Mar 14, 2021
@rohit-nayak-ps rohit-nayak-ps deleted the rn-vr-heartbeat-update branch March 14, 2021 14:41
@askdba askdba added this to the v10.0 milestone Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants