-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8219] add concurrent schema evolution conflict detection #12781
Open
Davis-Zhang-Onehouse
wants to merge
11
commits into
apache:master
Choose a base branch
from
Davis-Zhang-Onehouse:HUDI-8219-se
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[HUDI-8219] add concurrent schema evolution conflict detection #12781
Davis-Zhang-Onehouse
wants to merge
11
commits into
apache:master
from
Davis-Zhang-Onehouse:HUDI-8219-se
+2,919
−677
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The PR added some test utils to be used by subsequent unit test PR. Also fixed issue with HoodieTestTable that it might not pick the right instant serializer when configured with different table versions.
These are badly written and is revised to use HoodieTestTable with more comprehensive coverage in the next commit
944e03f
to
cb46d26
Compare
4 tasks
6474e22
to
27df5e6
Compare
Why it is broken: The hive sync schema contains hoodie meta columns. The test expect it to not contain. Why it passes before: The config HIVE_SYNC_OMIT_METADATA_FIELDS has always been "false", meaning hive sync should include meta columns, but previously the table schema resolver does not honor the flag, regardless of what it is set no meta column is included. Now as the new table resolver is introduced with exhaustive test coverage, it always behaves correctly and honor the "includeMetaFields" flag. As a result, the hive sync now returns the correct result but we are validating against something wrong. How it is fixed: I set HIVE_SYNC_OMIT_METADATA_FIELDS to true so the schema hive sync got does not contain meta field and matches what we validate.
The Commit addresses 2 issues: - Whenever table schema resolver poke into the timeline searching for something, it should always use reversed order stream for lazy evaluation. Previously it always process all the instant and build a new timeline first. Since we will use it in the commit code path which is performance sensitive, such change is necessary. - Added another fallback behavior to make sure it has the same behavior as before - there can be cases only compaction is in the timeline, which should be super rare, in that case, we parse whatever usable writer schema as the table schema. Test: Added tests on the fallback behavior.
27df5e6
to
81023af
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
Please refer RFC #12005
https://github.com/apache/hudi/blob/master/rfc/rfc-82/rfc-82.md
Impact
Concurrent schema evolution will be protected instead of leaving the data in an inconsistent state
The functional tests added took 24.6 sec on my mac book to execute.
Risk level (write none, low medium or high below)
none
Documentation Update
Please refer RFC-82
Contributor's checklist