Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8219] add concurrent schema evolution conflict detection #12781

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Davis-Zhang-Onehouse
Copy link
Contributor

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse commented Feb 5, 2025

Change Logs

Please refer RFC #12005
https://github.com/apache/hudi/blob/master/rfc/rfc-82/rfc-82.md

Impact

Concurrent schema evolution will be protected instead of leaving the data in an inconsistent state
The functional tests added took 24.6 sec on my mac book to execute.

Risk level (write none, low medium or high below)

none

Documentation Update

Please refer RFC-82

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

The PR added some test utils to be used by subsequent unit test PR.
Also fixed issue with HoodieTestTable that it might not pick the right
instant serializer when configured with different table versions.
These are badly written and is revised to use HoodieTestTable with more
comprehensive coverage in the next commit
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Feb 5, 2025
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8219-se branch 2 times, most recently from 944e03f to cb46d26 Compare February 6, 2025 02:54
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse changed the title Hudi 8219 se [HUDI-8219] add concurrent schema evolution conflict detection Feb 6, 2025
Davis-Zhang-Onehouse and others added 4 commits February 7, 2025 10:28
Why it is broken:
The hive sync schema contains hoodie meta columns. The test expect it to
not contain.

Why it passes before:
The config HIVE_SYNC_OMIT_METADATA_FIELDS has always been "false",
meaning hive sync should include meta columns, but previously the table
schema resolver does not honor the flag, regardless of what it is set no
meta column is included.

Now as the new table resolver is introduced with exhaustive test
coverage, it always behaves correctly and honor the "includeMetaFields"
flag. As a result, the hive sync now returns the correct result but we
are validating against something wrong.

How it is fixed:
I set HIVE_SYNC_OMIT_METADATA_FIELDS to true so the schema hive sync got
does not contain meta field and matches what we validate.
The Commit addresses 2 issues:
- Whenever table schema resolver poke into the timeline searching for
  something, it should always use reversed order stream for lazy
  evaluation. Previously it always process all the instant and build a
  new timeline first.
  Since we will use it in the commit code path which is performance
  sensitive, such change is necessary.
- Added another fallback behavior to make sure it has the same behavior
  as before - there can be cases only compaction is in the timeline,
  which should be super rare, in that case, we parse whatever usable
  writer schema as the table schema.

Test:
Added tests on the fallback behavior.
@hudi-bot
Copy link

hudi-bot commented Feb 7, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants