-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VStream server-side error during gh-ost online schema migration #7078
Comments
Maybe @rohit-nayak-ps has some idea of the error? Seems related to schema versioning. |
Also during a long-running
|
I tried reproducing this with both sharded and unsharded keyspaces with the table being continuously populated with data so that the gh-ost alter table is long-running. No luck. I also notice that there are some changes in the gh-ost based functionality. Not sure if it impacts the bug you are encountering but was wondering if it is possible for you to check if the same problem happens on master. The latest code changes the way you invoke gh-ost like:
I know you mentioned this only happens with huge tables, I was testing with ~100k. I will start populating one locally for testing this ... |
@rohit-nayak-ps thanks for replying. I managed to reproduce the following error (
|
Add @rgibaiev to follow this issue as well. |
I also managed to reproduce the following error (columns and values mismatch error) repeatatively with GA v8.0.0:
The steps are:
I'll test the master branch tomorrow, as you suggested @rohit-nayak-ps |
@keweishang, I was able to repro on master branch as well, so no need to test on it! For me too I was able to get it only while pointing to replica and not to master, but it might be a race. As you suspected, the schema is not getting reloaded correctly by vstreamer after the gh-ost operation completes. Sugu suggested gh-ost might be explicitly reloading schema on master (where gh-ost runs), so we don't see the error there. Will let you know once we have progress. |
@rohit-nayak-ps, it's great that you can reproduce the errors on your side now. For me, both the "unknown table" and the "column mismatch" errors also happened when pointing vstream to master as well. Sure. Keep me updated here and let me know if you need any further information. |
@rohit-nayak-ps happy to look into reloading schema after |
Quick update from discussing with @rohit-nayak-ps , we will seek a way to trigger |
Thanks for the update. @shlomi-noach So you meant |
Yes, assuming I understand correctly; specifically, we need to reload on the replica where vstream runs on. @rohit-nayak-ps has a workaround meanwhile, I'll update soon. |
The workarounds I had discussed (while we wait for an automatic schema load post-migration) are:
|
Sorry, but I am not able to repro this anymore. I have been testing for a while now using this setup:
Not sure how this is different from your setup. Since it is happening consistently for you @keweishang, it will be nice if you can repro using the same setup with any mods to recreate the bug, since it is then easier for any of us here to debug. I am running this on the current master (though I don't think we have new code that could have fixed this error). |
Hi @rohit-nayak-ps, sure, I'll try and use your setup to reproduce the error. Will keep you updated this week. |
Hi @rohit-nayak-ps, sorry for the delay. Based on GA 8.0.0 docker image, I can repetitively reproduce the errors. I've created a public repo with README that has the steps to reproduce the errors: https://github.com/keweishang/schema_reload_error_test Let me know if you manage to reproduce the error with the above repo setup. Thanks. |
@keweishang , thanks for the great test repo. I was able to reproduce the "cannot determine table columns" issue, even with the latest code. The issue with the internal tables created by gh-ost has been resolved in #7159, so it doesn't appear now. The cause is:
The default is to not run the tracker, so #1 doesn't apply. When #2 is also not applicable, ie when we call VStream API only after the migration is complete, we are then dependent on #3, vttablet's automatic reload. #4 is impractical for production use. In our case the VStream API is called, with gtid set to "current", before the periodic upload, The schema is then not in sync. This results in the schema-mismatch error that is thrown. We discussed reloading the schema once Online DDL completes a migration. However we need to resolve a couple of things before we can do that
So this requires more thought and will not happen in the short-term. The recommended way, at this time, is to enable the tracker in vttablet using |
Also, the reason I was unable to consistently repro earlier was that the tablets always had vstreams running on them which were reloading the schema.So a fresh VStream API client always found the updated schema. |
@rohit-nayak-ps thanks for the update. First of all, I really appreciate your explanation. Also good work in finding and fixing the issue with the internal tables (#7159). Thanks for letting me know that having vstream running on the tablets is essential in reloading the schema of the tablet. In my case, all VStream API had failed due to #7159. No tracker was enabled by Will enabling tracker with |
There is an overhead of an additional vstreamer which will download the binlogs and do the minimal parsing required. Since it only deals with DDLs it is less than a regular vstream. Whether it is perceptible depends on the server configuration and write QPS. This is precisely why we disable it by default. Originally it was enabled by default, but we had a few customers in production who were affected by it. (iirc) Those with lots of small servers + high QPS saw spikes in CPU usage when they migrated to that version. The solution is for the tracker to be light-weight. I have done a quick POC by paring down the vstreamer functionality to a minimum and got over 60% reduction in cpu usage. To productionise it would however need a lot of testing since vstreamer would now follow different code paths based on whether it is a "lite" or regular version and vstreamers are in the core of vreplication. So it is not too high on our priority list at this moment. I will create an issue for this soon and if we find more support for it we can take it up earlier! |
Closing this. As discussed above, the recommended way to get around this issue is to enable the tracker in vttablet |
Overview of the Issue
Our Debezium Vitess Connector (CDC) uses VStream gRPC to stream change events from a sharded (2 shards:
-80
and80-
) keyspace calledtest_sharded_keyspace
.When running the following
gh-ost
online schema migration:VStream gRPC throws a server-side error:
Reproduction Steps
Steps to reproduce this issue:
Deploy the following
vschema
:Deploy the following
schema
:Run VStream gRPC client to continuously stream from the sharded keyspace
test_sharded_keyspace
where the table resides in.The table has 30 million rows.
Run
vtctlclient -server vtctld-host:15999 ApplySchema -sql "ALTER WITH 'gh-ost' TABLE bar_entry add column status int" test_sharded_keyspace
to startgh-ost
online schema migration.Run
vtctlclient -server vtctld-host:15999 OnlineDDL test_sharded_keyspace show recent
to check gh-ost job status, which changes fromqueued
torunning
tocomplete
on each shards (-80
and80-
).Run
show create table bar_entry\G
and see the new columnstatus
is present.VStream gRPC client received the following server-side error:
Binary version
The text was updated successfully, but these errors were encountered: