-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vreplication: dynamic packet sizing #7933
Conversation
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
To note: right now the performance measures are fully client side. It measures the roundtrip time for each insert batch as seen for the client. This could easily be improved to make the server collaborate and report the actual insert timing, but I think there's value in this client-side performance metric, because it takes into account network throughput and latency, something that I believe is also meaningful when tuning the packet sizes. |
Signed-off-by: Vicent Marti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magical PR as usual :)
I don't understand the statistical stuff but the changes to the vreplication code look good.
I suggest adding some per-stream vstreamer metrics to externalize the packet size (current, max, average). This will help provide more insights into how the algorithm is working in production setups both for users and (maybe) for us to tune it in specific circumstances. Examples here: go/vt/vttablet/tabletserver/vstreamer/engine.go:108
Signed-off-by: Vicent Marti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Description
☠️ Warning: Spicy PR ahead ☠️
As discussed on last week's VReplication performance improvements, I noticed that table copying performance was basically being bottlenecked by MySQL bulk insert performance. After some testing on staging environments, I found out that the simplest way to improve MySQL insert performance was tweaking the packet size (
-vstream_packet_size
) on the vttablet source: depending on the layout and size of the tables being copied, careful tuning of this configuration option can make table copying up to 30% faster.This is good, because it means advanced Vitess users can configure their clusters to squeeze more performance out of VReplication, but it is also bad, because it means you need be an advanced user and be able to understand how to tweak the packet sizes based on your use case. So, in the spirit of PlanetScale, it occurred to me: can we make Vitess more magical so it always does the right thing?
And it turns out we can! This PR introduces a new configuration option
-vstream_dynamic_packet_size
, which turns on an extension to the VReplication controller that measures the throughput of the replication stream and uses statistical analysis to adjust its packet size for optimal performance. The idea is good, but the devil is in the details, of course: to ensure this real-time optimization is as effective as it can be, the dynamic packet sizer performs repeated experiments to sample the effects of packet size changes. It keeps track of the response times from previous stream calls and aggregates them into statistical samples by removing any outliers and performing a two-sample Welch T-Test to ensure changes in packet size are actually faster. Tthis is the same statistical analysis that I usually perform when analysing benchmark results, to ensure optimizations are meaningfully faster.To accomplish all this, I've had to vendor some of the maths logic from
golang.org/x/perf/benchstat
into Vitess; the meaningful math code is in aninternal
package so it cannot be used as a dependency directly. I believe this is fine because these kinds of statistic analysis functions are probably not being changed or improved anytime soon. Relatedly: I decided to use a Welch T-Test instead of the default U-test that thebenchstat
package uses because the later runs inO(n^2)
memory, while the T-Test uses constant memory. I don't think the accuracy difference between the tests is meaningful, or more accurately, Wikipedia thinks this is fine. Both tests should be statistically meaningful and the T-test doesn't allocate, sooo...Testing
I've started by verifying in the end-to-end performance harness that this change meaningfully improves table copying performance. It does for all testing I've been able to do locally -- it seems to consistently improve on the default 25000 packet size setting from Vitess.
Besides that, as you may guess, this is a complex feature to unit test. In order to give some sanity to the implementation, I've implemented a simulation as a unit test: basically, I've graphed the performance of several MySQL bulk insert behaviors and curve-fitted the response time graphs into several polynomials. The unit tests are using this curve fitting to simulate the response times of a MySQL server and to ensure that the packet size adjustments give a meaningful performance impact.
Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: