-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5096] Upgrade jcommander to 1.78 #7068
Conversation
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
Outdated
Show resolved
Hide resolved
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
Outdated
Show resolved
Hide resolved
...udi-hive-sync/src/main/java/org/apache/hudi/hive/replication/HiveSyncGlobalCommitParams.java
Outdated
Show resolved
Hide resolved
...i-hive-sync/src/test/java/org/apache/hudi/hive/replication/TestHiveSyncGlobalCommitTool.java
Outdated
Show resolved
Hide resolved
@Parameter(names = {"--auto-create-database"}, description = "Auto create hive database") | ||
public Boolean autoCreateDatabase; | ||
@Parameter(names = {"--auto-create-database"}, description = "Auto create hive database", arity = 1) | ||
public boolean autoCreateDatabase = true; | ||
@Parameter(names = {"--ignore-exceptions"}, description = "Ignore hive exceptions") | ||
public Boolean ignoreExceptions; | ||
public boolean ignoreExceptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't just add default true to the variables and leave others false. what if some config changed default from false to true? we should point these to the properties' defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan The change of the default value is a problem.
Double check with you, is it ok to do like this
boolean useJdbc = Boolean.parseBoolean(HIVE_USE_JDBC.defaultValue());
That means for boolean params we must pass "--key true/false" instead of "--key" if we want to overwrite the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm thanks for making further changes. Making it work like --key true/false
is a breaking change for users, so we have to avoid. But this is an existing problem, correct? from the beginning, regardless of users setting --use-jdbc
or not, because useJdbc
defaults to true
, users won't be able to overwrite it at all. Let's keep the behavior unchanged (not ideal but it's not an urgent problem) and focus on fixing the reported issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan I reverted the latest commit.
For the params default to true, userJdbc, autoCreateDatabase,syncAsSparkDataSourceTable
, I changed the arity =1
and default vaue to true. So user can overwrite them and other UTs can be successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm this leads to inconsistency: some boolean args are arity=0
and some arity=1
, which will cause confusion and won't be a good usage pattern. Besides, this basically ties usage with default values: what if an option default change from true to false, then we need to change arity and it will break. Please avoid any usage change in this PR and only tackle the NPE problem. As i mentioned, this is an existing problem and should be addressed separately and with more user-experience consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. I updated the pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm @xushiyan I am a bit late to the party, but just wanted to point out cbeust/jcommander#378
This means you can still override parameters that default to true unlike what you seem to believe. Worse, if they're set to true by default, adding them in the cli tool will toggle them to false. Or adding them multiple times will also change the effect (easy to verify now that a test was added).
Therefore, the behavior in hudi has already changed since #4175; before that the Boolean JCommander parameters had default values, such as public Boolean syncAsSparkDataSourceTable = true;
, which meant adding --spark-datasource
actually set this to false. Now on the other hand, since the default value comes from HIVE_SYNC_AS_DATA_SOURCE_TABLE
and not through JCommander, adding --spark-datasource
would also set this to true. So it is already quite confusing. With no defaults set in JCommander like currently is the case, it may make more sense (at least it's consistent), but then to disable spark-datasource, I think I'll actually have to add --spark-datasource --spark-datasource
; if I don't add it, the default HIVE_SYNC_AS_DATA_SOURCE_TABLE
value (=true) is used, if I add it once -> JCommander also sets it to true, twice -> JCommander flips it to false.
Based on this I'm personally in favor of enforcing arity=1
for all booleans, makes it clear what will happen and will throw an error if someone does not specify an arity, so then it's at least clear that the behavior changed (which now isn't).
But this probably deserves a separate issue?
@hudi-bot run azure |
@xicm i found upgrading to 1.78 resolves the original NPE issue |
Experts tell the truth. 👍 |
- resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]>
- resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]>
- resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]>
- resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]>
- resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]>
* [DOCS] Fix Slack invite link in README.md (apache#6648) * [HUDI-3558] Consistent bucket index: bucket resizing (split&merge) & concurrent write during resizing (apache#4958) RFC-42 implementation - Implement bucket resizing for consistent hashing index. - Support concurrent write during bucket resizing. This change added tests and can be verified as follows: - The test of the consistent bucket index is enhanced to include the case of bucket resizing. - Tests of different bucket resizing cases. - Tests of concurrent resizing, and concurrent writes during resizing. * [MINOR] Add dev setup and spark 3.3 profile to readme (apache#6656) * [HUDI-4831] Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invoke correct api (apache#6637) Co-authored-by: Rahil Chertara <[email protected]> * [HUDI-4806] Use Avro version from the root pom for Flink bundle (apache#6628) Co-authored-by: Shawn Chang <[email protected]> * [HUDI-4833] Add Postgres Schema Name to Postgres Debezium Source (apache#6616) * [HUDI-4825] Remove redundant fields in serialized commit metadata in JSON (apache#6646) * [MINOR] Insert should call validateInsertSchema in HoodieFlinkWriteClient (apache#5919) Co-authored-by: 徐帅 <[email protected]> * [HUDI-3879] Suppress exceptions that are not fatal in HoodieMetadataTableValidator (apache#5344) Co-authored-by: yuezhang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning (apache#5478) - The last completed commit timestamp is used to calculate how many commit have been completed since the last clean. we might need to save this w/ clean plan so that next time when we trigger clean, we can start calculating from that. * [HUDI-3994] - Added support for initializing DeltaStreamer without a defined Spark Master (apache#5630) That will enable the usage of DeltaStreamer on environments such as AWS Glue or other serverless environments where the spark master is inherited and we do not have access to it. Co-authored-by: Angel Conde Manjon <[email protected]> * [HUDI-4628] Hudi-flink support GLOBAL_BLOOM,GLOBAL_SIMPLE,BUCKET index type (apache#6406) Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4814] Schedules new clustering plan based on latest clustering instant (apache#6574) * Keep a clustering running at the same time * Simplify filtering logic Co-authored-by: dongsj <[email protected]> * [HUDI-4817] Delete markers after full-record bootstrap operation (apache#6667) * [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module (apache#6550) As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic from Spark 3.2 module has been simply copied over. This PR is rectifying that by: 1. Creating new module "hudi-spark3.2plus-common" (that is shared across Spark 3.2 and Spark 3.3) 2. Moving shared components under "hudi-spark3.2plus-common" * [HUDI-4752] Add dedup support for MOR table in cli (apache#6608) * [HUDI-4837] Stop sleeping where it is not necessary after the success (apache#6270) Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4843] Delete the useless timer in BaseRollbackActionExecutor (apache#6671) Co-authored-by: 吴文池 <[email protected]> * [HUDI-4780] hoodie.logfile.max.size It does not take effect, causing the log file to be too large (apache#6602) * hoodie.logfile.max.size It does not take effect, causing the log file to be too large Co-authored-by: [email protected] <loukey_7821> * [HUDI-4844] Skip partition value resolving when the field does not exists for MergeOnReadInputFormat#getReader (apache#6678) * [MINOR] Fix the Spark job status description for metadata-only bootstrap operation (apache#6666) * [HUDI-3403] Ensure keygen props are set for bootstrap (apache#6645) * [HUDI-4193] Upgrade Protobuf to 3.21.5 (apache#5784) * [HUDI-4785] Fix partition discovery in bootstrap operation (apache#6673) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type (apache#6486) InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns * [HUDI-4851] Fixing CSI not handling `InSet` operator properly (apache#6685) * [HUDI-4796] MetricsReporter stop bug (apache#6619) * [HUDI-3861] update tblp 'path' when rename table (apache#5320) * [HUDI-4853] Get field by name for OverwriteNonDefaultsWithLatestAvroPayload to avoid schema mismatch (apache#6689) * [HUDI-4813] Fix infer keygen not work in sparksql side issue (apache#6634) * [HUDI-4813] Fix infer keygen not work in sparksql side issue Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4856] Missing option for HoodieCatalogFactory (apache#6693) * [HUDI-4864] Fix AWSDmsAvroPayload#combineAndGetUpdateValue when using MOR snapshot query after delete operations with test (apache#6688) Co-authored-by: Rahil Chertara <[email protected]> * [HUDI-4841] Fix sort idempotency issue (apache#6669) * [HUDI-4865] Optimize HoodieAvroUtils#isMetadataField to use O(1) complexity (apache#6702) * [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed (apache#6536) * [HUDI-4842] Support compaction strategy based on delta log file num (apache#6670) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4757] Create pyspark examples (apache#6672) * [HUDI-3959] Rename class name for spark rdd reader (apache#5409) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650) Co-authored-by: yangshuo3 <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4873] Report number of messages to be processed via metrics (apache#6271) Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4870] Improve compaction config description (apache#6706) * [HUDI-3304] Support partial update payload (apache#4676) Co-authored-by: jian.feng <[email protected]> * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630) * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489) Bumped spring shell to 2.1.1 and updated the default value for show fsview all `pathRegex` parameter. * [minor] following 3304, some code refactoring (apache#6713) * [HUDI-4832] Fix drop partition meta sync (apache#6662) * [HUDI-4810] Fix log4j imports to use bridge API (apache#6710) Co-authored-by: dongsj <[email protected]> * [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717) Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920) - This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861* - The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. * Co-authored-by: Sagar Sumit <[email protected]> * [MINOR] fix indent to make build pass (apache#6721) * [HUDI-3478] Implement CDC Write in Spark (apache#6697) * [HUDI-4326] Fix hive sync serde properties (apache#6722) * [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709) * [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708) * [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516) File group in pending compaction can not be queried when query _ro table with spark. This commit fixes that. Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Sagar Sumit <[email protected]> * [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715) * [HUDI-4758] Add validations to java spark examples (apache#6615) * [HUDI-4792] Batch clean files to delete (apache#6580) This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <[email protected]> * [HUDI-4363] Support Clustering row writer to improve performance (apache#6046) * [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734) * [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-3901] Correct the description of hoodie.index.type (apache#6749) * [MINOR] Add .mvn directory to gitignore (apache#6746) Co-authored-by: Rahil Chertara <[email protected]> * add support for unraveling proto schemas * fix some compile issues * [HUDI-4901] Add avro.version to Flink profiles (apache#6757) * Add avro.version to Flink profiles Co-authored-by: Shawn Chang <[email protected]> * [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322) * [HUDI-4883] Supporting delete savepoint for MOR (apache#6744) Users could delete unnecessary savepoints and unblock archival for MOR table. * [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740) * [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031) * Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768) This reverts commit 092375f. * [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769) * [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762) * [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754) * Update HoodieIndex.java Fix a typo * [HUDI-4906] Fix the local tests for hudi-flink (apache#6763) * [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755) * [HUDI-4892] Fix hudi-spark3-bundle (apache#6735) * [MINOR] Fix a few typos in HoodieIndex (apache#6784) Co-authored-by: xingjunwang <[email protected]> * [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130) There are two minor issues fixed here: 1. When the insert_overwrite operation is performed, the clusteringPlan in the requestedReplaceMetadata will be null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE. 2. When insert_overwrite operation, inflightCommitMetadata!=null, getOperationType should be obtained from getHoodieInflightReplaceMetadata, the original code will have a null pointer. * [MINOR] retain avro's namespace (apache#6783) * [MINOR] Simple logging fix in LockManager (apache#6765) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349) When using the repair deduplicate command with hudi-cli, there is no way to run it on the unpartitioned dataset, so modify the cli parameter. Co-authored-by: Xingjun Wang <[email protected]> * [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256) * [HUDI-4915] improve avro serializer/deserializer (apache#6788) * [HUDI-3478] Implement CDC Read in Spark (apache#6727) * naming and style updates * [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652) * make test data random, reuse code * [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561) - Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch. * [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792) * [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778) * [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794) * [HUDI-4718] Add Kerberos kinit command support. (apache#6719) * add test for 2 different recursion depths, fix schema cache key * add unsigned long support * better handle other types * rebase on 4904 * get all tests working * fix oneof expected schema, update tests after rebase * [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759) * [MINOR] Update PR template with documentation update (apache#6748) * revert scala binary change * try a different method to avoid avro version * [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761) If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data. Main changes to existing code: ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array Converting from a proto to an avro now accounts for this truncation of the input * delete unused file * [HUDI-4907] Prevent single commit multi instant issue (apache#6766) Co-authored-by: TengHuo <[email protected]> Co-authored-by: yuzhao.cyz <[email protected]> * [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4848] Fixing repair deprecated partition tool (apache#6731) * [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785) * address PR feedback, update decimal precision * fix isNullable issue, check if class is Int64value * checkstyle fix * change wrapper descriptor set initialization * add in testing for unsigned long to BigInteger conversion * [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676) Turn off the type inference of the partition column to be consistent with existing behavior. Add notes around partition column type inference. * [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015) Co-authored-by: huangjing02 <[email protected]> Co-authored-by: sivabalan <[email protected]> * [HUDI-4924] Auto-tune dedup parallelism (apache#6802) * [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657) Use JOL GraphLayout for estimating deep size. * [MINOR] fixing validate async operations to poll completed clean instances (apache#6814) * [HUDI-4734] Deltastreamer table config change validation (apache#6753) Co-authored-by: sivabalan <[email protected]> * [HUDI-4934] Revert batch clean files (apache#6813) * Revert "[HUDI-4792] Batch clean files to delete (apache#6580)" This reverts commit cbf9b83. * [HUDI-4722] Added locking metrics for Hudi (apache#6502) * [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616) Co-authored-by: leon <[email protected]> Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820) * [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729) * [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828) * [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826) * [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821) Co-authored-by: dujunling <[email protected]> Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355) Co-authored-by: jian.feng <[email protected]> * [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665) Adds an incremental source from GCS based on a similar design as https://hudi.apache.org/blog/2021/08/23/s3-events-source * [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839) * [HUDI-4718] Add Kerberos kdestroy command support (apache#6810) * [HUDI-4916] Implement change log feed for Flink (apache#6840) * [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848) * [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805) * [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851) * [MINOR] Fix testUpdateRejectForClustering (apache#6852) * [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846) * [HOTFIX] Fix source release validate script (apache#6865) * [HUDI-4980] Calculate avg record size using commit only (apache#6864) Calculate average record size for Spark upsert partitioner based on commit instants only. Previously it's based on commit and replacecommit, of which the latter may be created by clustering which has inaccurately smaller average record sizes, which could result in OOM due to size underestimation. * shade protobuf dependency * Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809) This reverts commit 79b3e2b. * [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857) * Enhancing README for multi-writer tests (apache#6870) * [MINOR] Fix deploy script for flink 1.15 (apache#6872) * [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883) * Revert "shade protobuf dependency" This reverts commit f03f961. * [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751) * [HUDI-2786] Docker demo on mac aarch64 (apache#6859) * [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873) * [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958) - Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). - New timeline action is introduced for the purpose. Co-authored-by: sivabalan <[email protected]> * Relocate apache http package (apache#6874) * [HUDI-4975] Fix datahub bundle dependency (apache#6896) * [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901) * [MINOR] Update GitHub setting for merge button (apache#6922) Only allow squash and merge. Disable merge and rebase * [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885) * [MINOR] Fix name spelling for RunBootstrapProcedure * [HUDI-4754] Add compliance check in github actions (apache#6575) * [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847) Co-authored-by: rmahindra123 <[email protected]> * [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886) * Implement Create/Drop/Show/Refresh Secondary Index (apache#5933) * [MINOR] Moved readme from .github to the workflows folder (apache#6932) * [HUDI-4952] Fixing reading from metadata table when there are no inflight commits (apache#6836) * Fixing reading from metadata table when there are no inflight commits * Fixing reading from metadata if not fully built out * addressing minor comments * fixing sql conf and options interplay * addressing minor refactoring * [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer (apache#6003) Co-authored-by: yuezhang <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5006] Use the same wrapper for timestamp type metadata for parquet and log files (apache#6918) Before this patch, for timestamp type, we use LongWrapper for parquet and TimestampMicrosWrapper for avro log, they may keep different precision val here, for example, with timestamp(3), LongWrapper keeps the val as a millisecond long from EPOCH instant, while TimestampMicrosWrapper keeps the val as micro-seconds. For spark, it uses micro-seconds internally for timestamp type value, while flink uses the TimestampData internally, we better keeps the same precision for better compatibility here. * [HUDI-5016] Flink clustering does not reserve commit metadata (apache#6929) * [HUDI-3900] Fixing hdfs setup and tear down in tests to avoid flakiness (apache#6912) * [HUDI-5002] Remove deprecated API usage in SparkHoodieHBaseIndex#generateStatement (apache#6909) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5010] Fix flink hive catalog external config not work (apache#6923) * fix flink catalog external config not work * [HUDI-4948] Improve CDC Write (apache#6818) * improve cdc write to support multiple log files * update: use map to store the cdc stats * [HUDI-5030] Fix TestPartialUpdateAvroPayload.testUseLatestRecordMetaValue(apache#6948) * [HUDI-5033] Fix Broken Link In MultipleSparkJobExecutionStrategy (apache#6951) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5037] Upgrade org.apache.thrift:libthrift to 0.14.0 (apache#6941) * [MINOR] Fixing verbosity of docker set up (apache#6944) * [HUDI-5022] Make better error messages for pr compliance (apache#6934) * [HUDI-5003] Fix the type of InLineFileSystem`startOffset to long (apache#6916) * [HUDI-4855] Add missing table configs for bootstrap in Deltastreamer (apache#6694) * [MINOR] Handling null event time (apache#6876) * [MINOR] Update DOAP with 0.12.1 Release (apache#6988) * [MINOR] Increase maxParameters size in scalastyle (apache#6987) * [HUDI-3900] Closing resources in TestHoodieLogRecord (apache#6995) * [MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts (apache#6949) * [HUDI-4982] Add validation job for spark bundles in GitHub Actions (apache#6954) * [HUDI-5041] Fix lock metric register confict error (apache#6968) Co-authored-by: hbg <[email protected]> * [HUDI-4998] Infer partition extractor class first from meta sync partition fields (apache#6899) * [HUDI-4781] Allow omit metadata fields for hive sync (apache#6471) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4997] Use jackson-v2 import instead of jackson-v1 (apache#6893) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-3900] Fixing tempDir usage in TestHoodieLogFormat (apache#6981) * [HUDI-4995] Relocate httpcomponents (apache#6906) * [MINOR] Update GitHub setting for branch protection (apache#7008) - require at least 1 approving review * [HUDI-4960] Upgrade jetty version for timeline server (apache#6844) Co-authored-by: rmahindra123 <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5046] Support all the hive sync options for flink sql (apache#6985) * [MINOR] fix cdc flake ut (apache#7016) * [MINOR] Remove redundant space in PR compliance check (apache#7022) * [HUDI-5063] Enabling run time stats to be serialized with commit metadata (apache#7006) * [HUDI-5070] Adding lock provider to testCleaner tests since async cleaning is invoked (apache#7023) * [HUDI-5070] Move flaky cleaner tests to separate class (apache#7034) * [HUDI-4971] Remove direct use of kryo from `SerDeUtils` (apache#7014) Co-authored-by: Alexey Kudinkin <[email protected]> * [HUDI-5081] Tests clean up in hudi-utilities (apache#7033) * [HUDI-5027] Replace hardcoded hbase config keys with constant variables (apache#6946) * [MINOR] add commit_action output in show_commits (apache#7012) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception (apache#7001) Co-authored-by: liufangqi.chenfeng <[email protected]> * [MINOR] Skip loading last completed txn for single writer (apache#6660) Co-authored-by: sivabalan <[email protected]> * [HUDI-4281] Using hudi to build a large number of tables in spark on hive causes OOM (apache#5903) * [HUDI-5042] Fix clustering schedule problem in flink when enable schedule clustering and disable async clustering (apache#6976) Co-authored-by: hbg <[email protected]> * [HUDI-4753] more accurate record size estimation for log writing and spillable map (apache#6632) * [HUDI-4201] Cli tool to get warned about empty non-completed instants from timeline (apache#6867) * [HUDI-5038] Increase default num_instants to fetch for incremental source (apache#6955) * [HUDI-5049] Supports dropPartition for Flink catalog (apache#6991) * for both dfs and hms catalogs * [HUDI-4809] glue support drop partitions (apache#7007) Co-authored-by: xxhua <[email protected]> * [HUDI-5057] Fix msck repair hudi table (apache#6999) * [HUDI-4959] Fixing Avro's `Utf8` serialization in Kryo (apache#7024) * temp_view_support (apache#6990) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-4982] Add Utilities and Utilities Slim + Spark Bundle testing to GH Actions (apache#7005) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-5085]When a flink job has multiple sink tables, the index loading status is abnormal (apache#7051) * [HUDI-5089] Refactor HoodieCommitMetadata deserialization (apache#7055) * [HUDI-5058] Fix flink catalog read spark table error : primary key col can not be nullable (apache#7009) * [HUDI-5087] Fix incorrect merging sequence for Column Stats Record in `HoodieMetadataPayload` (apache#7053) * [HUDI-5087]Fix incorrect maxValue getting from metatable [HUDI-5087]Fix incorrect maxValue getting from metatable * Fixed `HoodieMetadataPayload` merging seq; Added test * Fixing handling of deletes; Added tests for handling deletes; * Added tests for combining partition files-list record Co-authored-by: Alexey Kudinkin <[email protected]> * [HUDI-4946] fix merge into with no preCombineField having dup row by only insert (apache#6824) * [HUDI-5072] Extract `ExecutionStrategy#transform` duplicate code (apache#7030) * [HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle (apache#6079) * [HUDI-5000] Support schema evolution for Hive/presto (apache#6989) Co-authored-by: z00484332 <[email protected]> * [HUDI-4716] Avoid parquet-hadoop-bundle in hudi-hadoop-mr (apache#6930) * [HUDI-5035] Remove usage of deprecated HoodieTimer constructor (apache#6952) Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5083]Fixed a bug when schema evolution (apache#7045) * [HUDI-5102] source operator(monitor and reader) support user uid (apache#7085) * Update HoodieTableSource.java Co-authored-by: chenzhiming <[email protected]> * [HUDI-5057] Fix msck repair external hudi table (apache#7084) * [MINOR] Fix typos in Spark client related classes (apache#7083) * [HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796) Co-authored-by: jian.feng <[email protected]> * [MINOR] use default maven version since it already fix the warnings recently (apache#6863) Co-authored-by: jian.feng <[email protected]> * Revert "[HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796)" (apache#7090) This reverts commit e222693. * [MINOR] Fix doc of org.apache.hudi.sink.meta.CkpMetadata#bootstrap (apache#7048) Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4799] improve analyzer exception tip when cannot resolve expression (apache#6625) * [HUDI-5096] Upgrade jcommander to 1.78 (apache#7068) - resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]> * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (apache#7091) * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql * remove pr compliance from open source * fix test issues * fix bad merge files * ignoring Spark3DDL tests, as they are failing in OSS master too against spark3.2 , scala2.12 * remove flakey test case * Update HoodieMultiTableCommitStatsManager when creating job info (apache#122) * Update HoodieMultiTableCommitStatsManager when creating job info * Tidying up Co-authored-by: Y Ethan Guo <[email protected]> Co-authored-by: Yuwei XIAO <[email protected]> Co-authored-by: Sagar Sumit <[email protected]> Co-authored-by: Rahil C <[email protected]> Co-authored-by: Rahil Chertara <[email protected]> Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Abhishek Modi <[email protected]> Co-authored-by: shuai.xu <[email protected]> Co-authored-by: 徐帅 <[email protected]> Co-authored-by: YueZhang <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: 董可伦 <[email protected]> Co-authored-by: Angel Conde <[email protected]> Co-authored-by: Angel Conde Manjon <[email protected]> Co-authored-by: FocusComputing <[email protected]> Co-authored-by: xiaoxingstack <[email protected]> Co-authored-by: eric9204 <[email protected]> Co-authored-by: dongsj <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: Manu <[email protected]> Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: wuwenchi <[email protected]> Co-authored-by: 吴文池 <[email protected]> Co-authored-by: luokey <[email protected]> Co-authored-by: Danny Chan <[email protected]> Co-authored-by: Sylwester Lachiewicz <[email protected]> Co-authored-by: komao <[email protected]> Co-authored-by: KnightChess <[email protected]> Co-authored-by: voonhous <[email protected]> Co-authored-by: 苏承祥 <[email protected]> Co-authored-by: 苏承祥 <[email protected]> Co-authored-by: 5herhom <[email protected]> Co-authored-by: Jon Vexler <[email protected]> Co-authored-by: simonsssu <[email protected]> Co-authored-by: y0908105023 <[email protected]> Co-authored-by: yangshuo3 <[email protected]> Co-authored-by: 冯健 <[email protected]> Co-authored-by: jian.feng <[email protected]> Co-authored-by: Paul Zhang <[email protected]> Co-authored-by: Kyle Zhike Chen <[email protected]> Co-authored-by: Yann Byron <[email protected]> Co-authored-by: Shiyan Xu <[email protected]> Co-authored-by: dohongdayi <[email protected]> Co-authored-by: shaoxiong.zhan <[email protected]> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Nicolas Paris <[email protected]> Co-authored-by: sivabalan <[email protected]> Co-authored-by: RexAn <[email protected]> Co-authored-by: ForwardXu <[email protected]> Co-authored-by: wangxianghu <[email protected]> Co-authored-by: wulei <[email protected]> Co-authored-by: Xingjun Wang <[email protected]> Co-authored-by: Prasanna Rajaperumal <[email protected]> Co-authored-by: xingjunwang <[email protected]> Co-authored-by: liujinhui <[email protected]> Co-authored-by: ChanKyeong Won <[email protected]> Co-authored-by: Zouxxyy <[email protected]> Co-authored-by: Nicholas Jiang <[email protected]> Co-authored-by: Forus <[email protected]> Co-authored-by: TengHuo <[email protected]> Co-authored-by: hj2016 <[email protected]> Co-authored-by: huangjing02 <[email protected]> Co-authored-by: jsbali <[email protected]> Co-authored-by: Leon Tsao <[email protected]> Co-authored-by: leon <[email protected]> Co-authored-by: 申胜利 <[email protected]> Co-authored-by: aiden.dong <[email protected]> Co-authored-by: dujunling <[email protected]> Co-authored-by: Pramod Biligiri <[email protected]> Co-authored-by: Zouxxyy <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: Surya Prasanna <[email protected]> Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: rmahindra123 <[email protected]> Co-authored-by: huberylee <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: slfan1989 <[email protected]> Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: 吴祥平 <[email protected]> Co-authored-by: wangzeyu <[email protected]> Co-authored-by: vvsd <[email protected]> Co-authored-by: Zhaojing Yu <[email protected]> Co-authored-by: Bingeng Huang <[email protected]> Co-authored-by: hbg <[email protected]> Co-authored-by: that's cool <[email protected]> Co-authored-by: liufangqi.chenfeng <[email protected]> Co-authored-by: gavin <[email protected]> Co-authored-by: Jon Vexler <[email protected]> Co-authored-by: Xixi Hua <[email protected]> Co-authored-by: xxhua <[email protected]> Co-authored-by: YangXiao <[email protected]> Co-authored-by: chao chen <[email protected]> Co-authored-by: Zhangshunyu <[email protected]> Co-authored-by: Long Zhao <[email protected]> Co-authored-by: z00484332 <[email protected]> Co-authored-by: 矛始 <[email protected]> Co-authored-by: chenzhiming <[email protected]> Co-authored-by: lvhu-goodluck <[email protected]> Co-authored-by: harshal patil <[email protected]> Co-authored-by: Vinish Reddy <[email protected]>
* [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4757] Create pyspark examples (apache#6672) * [HUDI-3959] Rename class name for spark rdd reader (apache#5409) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650) Co-authored-by: yangshuo3 <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4873] Report number of messages to be processed via metrics (apache#6271) Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4870] Improve compaction config description (apache#6706) * [HUDI-3304] Support partial update payload (apache#4676) Co-authored-by: jian.feng <[email protected]> * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630) * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489) Bumped spring shell to 2.1.1 and updated the default value for show fsview all `pathRegex` parameter. * [minor] following 3304, some code refactoring (apache#6713) * [HUDI-4832] Fix drop partition meta sync (apache#6662) * [HUDI-4810] Fix log4j imports to use bridge API (apache#6710) Co-authored-by: dongsj <[email protected]> * [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717) Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920) - This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861* - The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. * Co-authored-by: Sagar Sumit <[email protected]> * [MINOR] fix indent to make build pass (apache#6721) * [HUDI-3478] Implement CDC Write in Spark (apache#6697) * [HUDI-4326] Fix hive sync serde properties (apache#6722) * [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709) * [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708) * [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516) File group in pending compaction can not be queried when query _ro table with spark. This commit fixes that. Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Sagar Sumit <[email protected]> * [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715) * [HUDI-4758] Add validations to java spark examples (apache#6615) * [HUDI-4792] Batch clean files to delete (apache#6580) This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <[email protected]> * [HUDI-4363] Support Clustering row writer to improve performance (apache#6046) * [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734) * [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-3901] Correct the description of hoodie.index.type (apache#6749) * [MINOR] Add .mvn directory to gitignore (apache#6746) Co-authored-by: Rahil Chertara <[email protected]> * add support for unraveling proto schemas * fix some compile issues * [HUDI-4901] Add avro.version to Flink profiles (apache#6757) * Add avro.version to Flink profiles Co-authored-by: Shawn Chang <[email protected]> * [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322) * [HUDI-4883] Supporting delete savepoint for MOR (apache#6744) Users could delete unnecessary savepoints and unblock archival for MOR table. * [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740) * [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031) * Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768) This reverts commit 092375f. * [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769) * [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762) * [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754) * Update HoodieIndex.java Fix a typo * [HUDI-4906] Fix the local tests for hudi-flink (apache#6763) * [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755) * [HUDI-4892] Fix hudi-spark3-bundle (apache#6735) * [MINOR] Fix a few typos in HoodieIndex (apache#6784) Co-authored-by: xingjunwang <[email protected]> * [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130) There are two minor issues fixed here: 1. When the insert_overwrite operation is performed, the clusteringPlan in the requestedReplaceMetadata will be null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE. 2. When insert_overwrite operation, inflightCommitMetadata!=null, getOperationType should be obtained from getHoodieInflightReplaceMetadata, the original code will have a null pointer. * [MINOR] retain avro's namespace (apache#6783) * [MINOR] Simple logging fix in LockManager (apache#6765) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349) When using the repair deduplicate command with hudi-cli, there is no way to run it on the unpartitioned dataset, so modify the cli parameter. Co-authored-by: Xingjun Wang <[email protected]> * [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256) * [HUDI-4915] improve avro serializer/deserializer (apache#6788) * [HUDI-3478] Implement CDC Read in Spark (apache#6727) * naming and style updates * [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652) * make test data random, reuse code * [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561) - Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch. * [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792) * [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778) * [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794) * [HUDI-4718] Add Kerberos kinit command support. (apache#6719) * add test for 2 different recursion depths, fix schema cache key * add unsigned long support * better handle other types * rebase on 4904 * get all tests working * fix oneof expected schema, update tests after rebase * [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759) * [MINOR] Update PR template with documentation update (apache#6748) * revert scala binary change * try a different method to avoid avro version * [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761) If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data. Main changes to existing code: ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array Converting from a proto to an avro now accounts for this truncation of the input * delete unused file * [HUDI-4907] Prevent single commit multi instant issue (apache#6766) Co-authored-by: TengHuo <[email protected]> Co-authored-by: yuzhao.cyz <[email protected]> * [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4848] Fixing repair deprecated partition tool (apache#6731) * [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785) * address PR feedback, update decimal precision * fix isNullable issue, check if class is Int64value * checkstyle fix * change wrapper descriptor set initialization * add in testing for unsigned long to BigInteger conversion * [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676) Turn off the type inference of the partition column to be consistent with existing behavior. Add notes around partition column type inference. * [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015) Co-authored-by: huangjing02 <[email protected]> Co-authored-by: sivabalan <[email protected]> * [HUDI-4924] Auto-tune dedup parallelism (apache#6802) * [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657) Use JOL GraphLayout for estimating deep size. * [MINOR] fixing validate async operations to poll completed clean instances (apache#6814) * [HUDI-4734] Deltastreamer table config change validation (apache#6753) Co-authored-by: sivabalan <[email protected]> * [HUDI-4934] Revert batch clean files (apache#6813) * Revert "[HUDI-4792] Batch clean files to delete (apache#6580)" This reverts commit cbf9b83. * [HUDI-4722] Added locking metrics for Hudi (apache#6502) * [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616) Co-authored-by: leon <[email protected]> Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820) * [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729) * [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828) * [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826) * [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821) Co-authored-by: dujunling <[email protected]> Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355) Co-authored-by: jian.feng <[email protected]> * [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665) Adds an incremental source from GCS based on a similar design as https://hudi.apache.org/blog/2021/08/23/s3-events-source * [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839) * [HUDI-4718] Add Kerberos kdestroy command support (apache#6810) * [HUDI-4916] Implement change log feed for Flink (apache#6840) * [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848) * [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805) * [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851) * [MINOR] Fix testUpdateRejectForClustering (apache#6852) * [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846) * [HOTFIX] Fix source release validate script (apache#6865) * [HUDI-4980] Calculate avg record size using commit only (apache#6864) Calculate average record size for Spark upsert partitioner based on commit instants only. Previously it's based on commit and replacecommit, of which the latter may be created by clustering which has inaccurately smaller average record sizes, which could result in OOM due to size underestimation. * shade protobuf dependency * Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809) This reverts commit 79b3e2b. * [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857) * Enhancing README for multi-writer tests (apache#6870) * [MINOR] Fix deploy script for flink 1.15 (apache#6872) * [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883) * Revert "shade protobuf dependency" This reverts commit f03f961. * [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751) * [HUDI-2786] Docker demo on mac aarch64 (apache#6859) * [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873) * [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958) - Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). - New timeline action is introduced for the purpose. Co-authored-by: sivabalan <[email protected]> * Relocate apache http package (apache#6874) * [HUDI-4975] Fix datahub bundle dependency (apache#6896) * [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901) * [MINOR] Update GitHub setting for merge button (apache#6922) Only allow squash and merge. Disable merge and rebase * [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885) * [MINOR] Fix name spelling for RunBootstrapProcedure * [HUDI-4754] Add compliance check in github actions (apache#6575) * [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847) Co-authored-by: rmahindra123 <[email protected]> * [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886) * Implement Create/Drop/Show/Refresh Secondary Index (apache#5933) * [MINOR] Moved readme from .github to the workflows folder (apache#6932) * [HUDI-4952] Fixing reading from metadata table when there are no inflight commits (apache#6836) * Fixing reading from metadata table when there are no inflight commits * Fixing reading from metadata if not fully built out * addressing minor comments * fixing sql conf and options interplay * addressing minor refactoring * [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer (apache#6003) Co-authored-by: yuezhang <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5006] Use the same wrapper for timestamp type metadata for parquet and log files (apache#6918) Before this patch, for timestamp type, we use LongWrapper for parquet and TimestampMicrosWrapper for avro log, they may keep different precision val here, for example, with timestamp(3), LongWrapper keeps the val as a millisecond long from EPOCH instant, while TimestampMicrosWrapper keeps the val as micro-seconds. For spark, it uses micro-seconds internally for timestamp type value, while flink uses the TimestampData internally, we better keeps the same precision for better compatibility here. * [HUDI-5016] Flink clustering does not reserve commit metadata (apache#6929) * [HUDI-3900] Fixing hdfs setup and tear down in tests to avoid flakiness (apache#6912) * [HUDI-5002] Remove deprecated API usage in SparkHoodieHBaseIndex#generateStatement (apache#6909) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5010] Fix flink hive catalog external config not work (apache#6923) * fix flink catalog external config not work * [HUDI-4948] Improve CDC Write (apache#6818) * improve cdc write to support multiple log files * update: use map to store the cdc stats * [HUDI-5030] Fix TestPartialUpdateAvroPayload.testUseLatestRecordMetaValue(apache#6948) * [HUDI-5033] Fix Broken Link In MultipleSparkJobExecutionStrategy (apache#6951) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5037] Upgrade org.apache.thrift:libthrift to 0.14.0 (apache#6941) * [MINOR] Fixing verbosity of docker set up (apache#6944) * [HUDI-5022] Make better error messages for pr compliance (apache#6934) * [HUDI-5003] Fix the type of InLineFileSystem`startOffset to long (apache#6916) * [HUDI-4855] Add missing table configs for bootstrap in Deltastreamer (apache#6694) * [MINOR] Handling null event time (apache#6876) * [MINOR] Update DOAP with 0.12.1 Release (apache#6988) * [MINOR] Increase maxParameters size in scalastyle (apache#6987) * [HUDI-3900] Closing resources in TestHoodieLogRecord (apache#6995) * [MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts (apache#6949) * [HUDI-4982] Add validation job for spark bundles in GitHub Actions (apache#6954) * [HUDI-5041] Fix lock metric register confict error (apache#6968) Co-authored-by: hbg <[email protected]> * [HUDI-4998] Infer partition extractor class first from meta sync partition fields (apache#6899) * [HUDI-4781] Allow omit metadata fields for hive sync (apache#6471) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-4997] Use jackson-v2 import instead of jackson-v1 (apache#6893) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-3900] Fixing tempDir usage in TestHoodieLogFormat (apache#6981) * [HUDI-4995] Relocate httpcomponents (apache#6906) * [MINOR] Update GitHub setting for branch protection (apache#7008) - require at least 1 approving review * [HUDI-4960] Upgrade jetty version for timeline server (apache#6844) Co-authored-by: rmahindra123 <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5046] Support all the hive sync options for flink sql (apache#6985) * [MINOR] fix cdc flake ut (apache#7016) * [MINOR] Remove redundant space in PR compliance check (apache#7022) * [HUDI-5063] Enabling run time stats to be serialized with commit metadata (apache#7006) * [HUDI-5070] Adding lock provider to testCleaner tests since async cleaning is invoked (apache#7023) * [HUDI-5070] Move flaky cleaner tests to separate class (apache#7034) * [HUDI-4971] Remove direct use of kryo from `SerDeUtils` (apache#7014) Co-authored-by: Alexey Kudinkin <[email protected]> * [HUDI-5081] Tests clean up in hudi-utilities (apache#7033) * [HUDI-5027] Replace hardcoded hbase config keys with constant variables (apache#6946) * [MINOR] add commit_action output in show_commits (apache#7012) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception (apache#7001) Co-authored-by: liufangqi.chenfeng <[email protected]> * [MINOR] Skip loading last completed txn for single writer (apache#6660) Co-authored-by: sivabalan <[email protected]> * [HUDI-4281] Using hudi to build a large number of tables in spark on hive causes OOM (apache#5903) * [HUDI-5042] Fix clustering schedule problem in flink when enable schedule clustering and disable async clustering (apache#6976) Co-authored-by: hbg <[email protected]> * [HUDI-4753] more accurate record size estimation for log writing and spillable map (apache#6632) * [HUDI-4201] Cli tool to get warned about empty non-completed instants from timeline (apache#6867) * [HUDI-5038] Increase default num_instants to fetch for incremental source (apache#6955) * [HUDI-5049] Supports dropPartition for Flink catalog (apache#6991) * for both dfs and hms catalogs * [HUDI-4809] glue support drop partitions (apache#7007) Co-authored-by: xxhua <[email protected]> * [HUDI-5057] Fix msck repair hudi table (apache#6999) * [HUDI-4959] Fixing Avro's `Utf8` serialization in Kryo (apache#7024) * temp_view_support (apache#6990) Co-authored-by: 苏承祥 <[email protected]> * [HUDI-4982] Add Utilities and Utilities Slim + Spark Bundle testing to GH Actions (apache#7005) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-5085]When a flink job has multiple sink tables, the index loading status is abnormal (apache#7051) * [HUDI-5089] Refactor HoodieCommitMetadata deserialization (apache#7055) * [HUDI-5058] Fix flink catalog read spark table error : primary key col can not be nullable (apache#7009) * [HUDI-5087] Fix incorrect merging sequence for Column Stats Record in `HoodieMetadataPayload` (apache#7053) * [HUDI-5087]Fix incorrect maxValue getting from metatable [HUDI-5087]Fix incorrect maxValue getting from metatable * Fixed `HoodieMetadataPayload` merging seq; Added test * Fixing handling of deletes; Added tests for handling deletes; * Added tests for combining partition files-list record Co-authored-by: Alexey Kudinkin <[email protected]> * [HUDI-4946] fix merge into with no preCombineField having dup row by only insert (apache#6824) * [HUDI-5072] Extract `ExecutionStrategy#transform` duplicate code (apache#7030) * [HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle (apache#6079) * [HUDI-5000] Support schema evolution for Hive/presto (apache#6989) Co-authored-by: z00484332 <[email protected]> * [HUDI-4716] Avoid parquet-hadoop-bundle in hudi-hadoop-mr (apache#6930) * [HUDI-5035] Remove usage of deprecated HoodieTimer constructor (apache#6952) Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-5083]Fixed a bug when schema evolution (apache#7045) * [HUDI-5102] source operator(monitor and reader) support user uid (apache#7085) * Update HoodieTableSource.java Co-authored-by: chenzhiming <[email protected]> * [HUDI-5057] Fix msck repair external hudi table (apache#7084) * [MINOR] Fix typos in Spark client related classes (apache#7083) * [HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796) Co-authored-by: jian.feng <[email protected]> * [MINOR] use default maven version since it already fix the warnings recently (apache#6863) Co-authored-by: jian.feng <[email protected]> * Revert "[HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796)" (apache#7090) This reverts commit e222693. * [MINOR] Fix doc of org.apache.hudi.sink.meta.CkpMetadata#bootstrap (apache#7048) Co-authored-by: xiaoxingstack <[email protected]> * [HUDI-4799] improve analyzer exception tip when cannot resolve expression (apache#6625) * [HUDI-5096] Upgrade jcommander to 1.78 (apache#7068) - resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <[email protected]> * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (apache#7091) * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql * [HUDI-5107] Fix hadoop config in DirectWriteMarkers, HoodieFlinkEngineContext and StreamerUtil are not consistent issue (apache#7094) Co-authored-by: xiaoxingstack <[email protected]> * [MINOR] Fix OverwriteWithLatestAvroPayload full class name (apache#7096) * [HUDI-5074] Warn if table for metastore sync has capitals in it (apache#7077) Co-authored-by: Jonathan Vexler <=> * [HUDI-5124] Fix HoodieInternalRowFileWriter#canWrite error return tag. (apache#7107) Co-authored-by: slfan1989 <louj1988@@> * [MINOR] update commons-codec:commons-codec 1.4 to 1.13 (apache#6959) * [HUDI-5148] Claim RFC-63 for Index on Function and Logical Partitioning (apache#7114) * [HUDI-5065] Call close on SparkRDDWriteClient in HoodieCleaner (apache#7101) Co-authored-by: Jonathan Vexler <=> * [HUDI-4624] Implement Closable for S3EventsSource (apache#7086) Co-authored-by: Jonathan Vexler <=> * [HUDI-5045] Adding support to configure index type with integ tests (apache#6982) Co-authored-by: Y Ethan Guo <[email protected]> * [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5416) https://issues.apache.org/jira/browse/HUDI-3963 RFC design : apache#5567 Add Lock-Free executor to improve hoodie writing throughput and optimize execution efficiency. Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction. Existing BoundedInMemory is the default. Users can enable on a need basis. Co-authored-by: yuezhang <[email protected]> * [HUDI-5076] Fixing non serializable path used in engineContext with metadata table intialization (apache#7036) * [HUDI-5032] Add archive to cli (apache#7076) Adding archiving capability to cli. Co-authored-by: Jonathan Vexler <=> * [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task (apache#6733) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy()… (apache#7113) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy() on HoodieTableSource * [MINOR] Fixing broken test (apache#7123) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (apache#6741) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table * Update HiveAvroSerializer.java otherwise payload string type combine field will cause cast exception * [HUDI-5126] Delete duplicate configuration items PAYLOAD_CLASS_NAME (apache#7103) * [HUDI-4989] Fixing deltastreamer init failures (apache#6862) Fixing handling missing hoodie.properties * [MINOR] Fix flaky test in ITTestHoodieDataSource (apache#7134) * [HUDI-4071] Remove default value for mandatory record key field (apache#6681) * [HUDI-5088]Fix bug:Failed to synchronize the hive metadata of the Flink table (apache#7056) * sync `_hoodie_operation` meta field if changelog mode is enabled. * [MINOR] Removing spark2 scala12 combinations from readme (apache#7112) * [HUDI-5153] Fix the write token name resolution of cdc log file (apache#7128) * [HUDI-5066] Support flink hoodie source metaclient cache (apache#7017) * [HUDI-5132] Add hadoop-mr bundle validation (apache#7157) * [HUDI-2673] Add kafka connect bundle to validation test (apache#7131) * [HUDI-5082] Improve the cdc log file name format (apache#7042) * [HUDI-5154] Improve hudi-spark-client Lambada writing (apache#7127) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5178] Add Call show_table_properties for spark sql (apache#7161) * [HUDI-5067] Merge the columns stats of multiple log blocks from the same log file (apache#7018) * [HUDI-5025] Rollback failed with log file not found when rollOver in rollback process (apache#6939) * fix rollback file not found * [HUDI-4526] Improve spillableMapBasePath when disk directory is full (apache#6284) * [minor] Refactor the code for CkpMetadata (apache#7166) * [HUDI-5111] Improve integration test coverage (apache#7092) Co-authored-by: Raymond Xu <[email protected]> * [HUDI-5187] Remove the preCondition check of BucketAssigner assign state (apache#7170) * [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (apache#7171) * [MINOR] Performance improvement of flink ITs with reused miniCluster (apache#7151) * implement MiniCluster extension compatible with junit5 * Make local build work * Delete files removed in OSS * Fix bug in testing * Upgrade to version release-v0.10.0 Co-authored-by: 5herhom <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> Co-authored-by: Jon Vexler <[email protected]> Co-authored-by: simonsssu <[email protected]> Co-authored-by: y0908105023 <[email protected]> Co-authored-by: yangshuo3 <[email protected]> Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: Volodymyr Burenin <[email protected]> Co-authored-by: 冯健 <[email protected]> Co-authored-by: jian.feng <[email protected]> Co-authored-by: FocusComputing <[email protected]> Co-authored-by: xiaoxingstack <[email protected]> Co-authored-by: Paul Zhang <[email protected]> Co-authored-by: Danny Chan <[email protected]> Co-authored-by: Sagar Sumit <[email protected]> Co-authored-by: eric9204 <[email protected]> Co-authored-by: dongsj <[email protected]> Co-authored-by: Kyle Zhike Chen <[email protected]> Co-authored-by: Yann Byron <[email protected]> Co-authored-by: Shiyan Xu <[email protected]> Co-authored-by: dohongdayi <[email protected]> Co-authored-by: shaoxiong.zhan <[email protected]> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Manu <[email protected]> Co-authored-by: Nicolas Paris <[email protected]> Co-authored-by: sivabalan <[email protected]> Co-authored-by: RexAn <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: Rahil C <[email protected]> Co-authored-by: Rahil Chertara <[email protected]> Co-authored-by: Timothy Brown <[email protected]> Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: ForwardXu <[email protected]> Co-authored-by: wangxianghu <[email protected]> Co-authored-by: wulei <[email protected]> Co-authored-by: Xingjun Wang <[email protected]> Co-authored-by: Prasanna Rajaperumal <[email protected]> Co-authored-by: xingjunwang <[email protected]> Co-authored-by: liujinhui <[email protected]> Co-authored-by: 苏承祥 <[email protected]> Co-authored-by: 苏承祥 <[email protected]> Co-authored-by: ChanKyeong Won <[email protected]> Co-authored-by: Zouxxyy <[email protected]> Co-authored-by: Nicholas Jiang <[email protected]> Co-authored-by: KnightChess <[email protected]> Co-authored-by: Forus <[email protected]> Co-authored-by: voonhous <[email protected]> Co-authored-by: TengHuo <[email protected]> Co-authored-by: hj2016 <[email protected]> Co-authored-by: huangjing02 <[email protected]> Co-authored-by: jsbali <[email protected]> Co-authored-by: Leon Tsao <[email protected]> Co-authored-by: leon <[email protected]> Co-authored-by: 申胜利 <[email protected]> Co-authored-by: aiden.dong <[email protected]> Co-authored-by: dujunling <[email protected]> Co-authored-by: Pramod Biligiri <[email protected]> Co-authored-by: Zouxxyy <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: Surya Prasanna <[email protected]> Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: rmahindra123 <[email protected]> Co-authored-by: huberylee <[email protected]> Co-authored-by: YueZhang <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: yuezhang <[email protected]> Co-authored-by: slfan1989 <[email protected]> Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: 吴祥平 <[email protected]> Co-authored-by: wangzeyu <[email protected]> Co-authored-by: vvsd <[email protected]> Co-authored-by: Zhaojing Yu <[email protected]> Co-authored-by: Bingeng Huang <[email protected]> Co-authored-by: hbg <[email protected]> Co-authored-by: that's cool <[email protected]> Co-authored-by: liufangqi.chenfeng <[email protected]> Co-authored-by: Yuwei XIAO <[email protected]> Co-authored-by: gavin <[email protected]> Co-authored-by: Jon Vexler <[email protected]> Co-authored-by: Xixi Hua <[email protected]> Co-authored-by: xxhua <[email protected]> Co-authored-by: YangXiao <[email protected]> Co-authored-by: chao chen <[email protected]> Co-authored-by: Zhangshunyu <[email protected]> Co-authored-by: Long Zhao <[email protected]> Co-authored-by: z00484332 <[email protected]> Co-authored-by: 矛始 <[email protected]> Co-authored-by: chenzhiming <[email protected]> Co-authored-by: lvhu-goodluck <[email protected]> Co-authored-by: alberic <[email protected]> Co-authored-by: lxxyyds <[email protected]> Co-authored-by: Alexander Trushev <[email protected]> Co-authored-by: xiarixiaoyao <[email protected]> Co-authored-by: windWheel <[email protected]> Co-authored-by: Alexander Trushev <[email protected]> Co-authored-by: Shizhi Chen <[email protected]>
Change Logs
Upgrade jcommander 1.72 -> 1.78 to resolve NPE issue for Boolean param without default value.
Impact
Risk level (write none, low medium or high below)
low
Documentation Update
NA
Contributor's checklist