-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix struct order for schema updates when using upsert/delete mode #368
base: master
Are you sure you want to change the base?
Fix struct order for schema updates when using upsert/delete mode #368
Conversation
Thanks @jurgispods for taking time to write the detailed example and explaining the issue. I will take sometime to go over the changes and get back in 2 weeks time. Meanwhile, would you be able to add an integration test for this change please. |
@b-goyal Sure, I can add an integration test. |
@b-goyal I just added an integration test that reproduces the issue (and the fix). I found out it only shows under certain circumstances, i.e. when a schema update happens after the intermediate table has been deleted. Otherwise, schemas of destination and intermediate tables are always in sync, as they are updated using the same logic. As far as I can seen, deletion of intermediate table only happens when the connector is stopped. So in order to replicate the error, I had to write an IT test that is quite involved:
In order to show that the connector indeed fails, I added a config for toggling my fix on or off. That might not be necessary in the final PR, as in reality, it should always be on. We could instead test that with a unit test and remove the added config. |
Thanks for adding the integration test @jurgispods. |
Hi @b-goyal, is there an update on this? |
When using the connector in upsert/delete mode, it can fail under certain circumstances when the schema is updated in such a way that the intermediate table and the destination table have differently ordered nested struct fields.
Example scenario
Schema version 1
Assume the Kafka source topic has the following Avro schema (version 1):
The corresponding Bigquery destination table schema:
Schema version 2
Now, the source table schema is updated to version 2:
The problem now is that the Bigquery schemas of the intermediate and destination tables will have different orders of nested fields.
Bigquery schema of the intermediate table after creation:
Updated Bigquery destination table schema - note that the new field
maxAmount
is appended at the end:The connector will subsequently fail during the periodic merge flush:
This can be easily seen by looking at the executed MERGE queries.
Comparison of executed MERGE queries
This query will fail due to different orders of nested fields:
In contrast, this query succeeds:
We can see that for upserts, the order of struct fields matters.
Proposed changes
In this PR, I have added the destination table schema to the list returned by
SchemaManager.getSchemasList
when it is called for an intermediate table in upsert/merge mode. That way, the intermediate table schema is forced to respect the order of nested fields in the destination table schema - schema updates are simply applied on top of it, ensuring the same field order in both tables when new fields are added.Please let me know what you think of this approach.