[HUDI-3478] Support CDC for Spark in Hudi #5885

YannByron · 2022-06-16T08:07:34Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

YannByron · 2022-06-17T07:36:16Z

@hudi-bot run azure

YannByron · 2022-06-20T16:01:57Z

@vinothchandar please review this. The case when cdc.supplemental.logging is false and SparkSQL syntax may be supported in the subsequent commits or the other pr.

YannByron · 2022-06-22T07:56:13Z

update:
support cdc when disable cdc.supplemental.logging. in this case, cdc block will only persist the op and record_key fields.

vinothchandar · 2022-07-14T02:52:32Z

@YannByron made one pass to understand the file changes. Will start the detailed review next

prasannarajaperumal · 2022-07-16T11:32:58Z

Hey @YannByron ,

Thanks for this PR and a well written RFC-51.
Overall I agree with the high level direction. I will do the code review soon. I have a question before that.

Should we introduce a new concept (CDC) here on Hudi tables? I think this should be sub-mode of Incremental Query.
For illustration, Suppose we have something like the following modes for incremental query (change stream)

LATEST_STATE_INSERT_DELETE_KEYS (entire row state for all inserted keys and empty delete keys?)
LATEST_STATE_ONLY_INSERT_KEYS
MIN_STATE_CHANGE_INSERT_DELETE_KEYS (only columns changed and consolidate multiple inserts,deletes, or remove data inserted and deleted within the time range)
ALL_STATE_CHANGES_INSERT_DELETE_KEYS (include every single change made to the key)

I think read-schema changes for the CDC style incremental queries could be a challenge.

The reason I think of converging the incremental queries with RFC-51 is because

Removes the limitation of tracking deletes accross compaction boundaries for incremental queries
I think it just makes sense for us to track the data we track when "cdc.supplemental.logging=false" by default for all Hudi tables. Having this data stored efficiently for point lookups will help with record merging as well I suppose?

@YannByron What do you think? (cc @vinothchandar )

Cheers
Prasanna

YannByron · 2022-07-17T15:13:30Z

Hey @prasannarajaperumal Thank you very much for reviewing this.

CDC is not a new concept, is a common concept for database. So I think it's better to distinguish CDC and Incremental Query. Some reasons:

CDC is better known than incremental query. incremental query is just defined by hudi.
Different from Incremental Query and Snapshot Query, CDC has its own output format in which every record have op, ts_ms, before and after fields.
According to RFC-51, CDC has its own read and write logical. We have to persist some other information for CDC when data is written to hudi.

Looking forward to your reply.

prasannarajaperumal · 2022-07-18T05:37:43Z

@YannByron

I understand CDC is a database concept. My point was incremental query is also just a form of CDC if you think about how it is used. Yes the schema is different based on the modes of the incremental query. I believe we can unify the current CDC proposal and the incremental query feature to make it simple for users consuming change streams out of Hudi table.

We can call this unified feature as incremental query or CDC or Change Streams (I am not hung up on the name).

prasannarajaperumal · 2022-07-18T05:54:16Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

@@ -102,6 +117,15 @@
  protected Map<String, HoodieRecord<T>> keyToNewRecords;
  protected Set<String> writtenRecordKeys;
  protected HoodieFileWriter<IndexedRecord> fileWriter;
+  // a flag that indicate whether allow the change data to write out a cdc log file.
+  protected boolean cdcEnabled = false;


Create a sub-class of HoodieAppendHandle - HoodieChangeTrackingAppendHandle and move all the code related to persisting row-level change tracking metadata to the subclass. I prefer naming all methods/parameters as changeTracking instead of CDC. CDC is a feature, ChangeTracking is the action you do during write.

I think you mean HoodieChangeTrackingMergeHandle?

hi @prasannarajaperumal
i try to create some sub-classes HoodieChangeTrackingMergeHandle, HoodieChangeTrackingSortedMergeHandle and HoodieChangeTrackingConcatHandle, and add the logical to judge whether HoodieChangeTrackingXXXHandle should be created at all the places where HoodieMergeHandle and other classes are created before. I think it is maybe less clear.

prasannarajaperumal · 2022-07-18T06:27:16Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java

+   * Relative cdc file path that store the CDC data.
+   */
+  @Nullable
+  private String cdcPath;


ChangeTrackingStat

maybe changeTrackingPath ? After all it is a file path, not a stat.

do these new fields evolve well? i.e backwards compatible with existing write stat without these new fields?

let me test this case that use the old hudi version to query tables created by this branch.

prasannarajaperumal · 2022-07-18T06:59:46Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

+    if (cdcEnabled) {
+      if (indexedRecord.isPresent()) {
+        GenericRecord record = (GenericRecord) indexedRecord.get();
+        cdcData.add(cdcRecord(CDCOperationEnum.UPDATE, hoodieRecord.getRecordKey(), hoodieRecord.getPartitionPath(),


We will be holding the record data in-memory until the handle is closed when supplemental logging is enabled. Any side-effects to be cautious about?
We will be deflating the actual record once its written to the file and bloom filter calculation happens after - would there be significant memory pressure if we still hold on to the data for cdc and how do we handle this?

IMO, it's ok.
A base parquet file is about 128M in most common cases. Even if all the records is updated, the cdcData will take the memory that is less that about 300M. And if the workflow is heavy, user can increase the memory of workers.
But If we are worry about this, use the hudi.common.util.collection.ExternalSpillableMap instead of this.

In general, with all the java/jvm overhead, I think. it'll be more than 300M comfortably. Can we use the spillable map instead here in this PR

ok, i will use hudi.common.util.collection.ExternalSpillableMap instead here.

prasannarajaperumal · 2022-07-18T07:05:30Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/CDCFileTypeEnum.java

+public enum CDCFileTypeEnum {
+
+  CDC_LOG_FILE,
+  ADD_BASE_File,


s/ADD_BASE_File/ADD_BASE_FILE

prasannarajaperumal · 2022-07-18T07:09:20Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala

+   * Parse HoodieWriteStat, judge which type the file is, and what strategy should be used to parse CDC data.
+   * Then build a [[ChangeFileForSingleFileGroupAndCommit]] object.
+   */
+  private def parseWriteStat(


Does it make sense to generalize this out of Spark and make the logic to identify the different CDC types and load them common to all clients?

yes. let me move this to the common place.

YannByron · 2022-07-18T07:20:35Z

@prasannarajaperumal
Got it. In some aspects, Incremental query can be considered a kind of CDC. Incremental query just return the data inserted/updated after a certain point. it use the normal format, and only care about the new values rather than the old values.

vinothchandar · 2022-07-20T01:02:17Z

+1 on doing this as a part of the incremental query. CDC is a database concept, but not a query per se. Incremental Query actually executes the SQL on the CDC stream. So they are actually different. CDC refers to the mechanism of capturing changes from a database log consistently, in order. Thats all.

I'd rather not introduce a new query type, and take on the overhead for us to explain incremental vs CDC query on an ongoing basis. There are already four query types : snapshot, read optimized, point in time, incremental.

vinothchandar

Can you think through the following scenarios and ensure things work as expected.

Across clustering operations
Across multi writer scenarios.

I am yet to review the MOR relation changes, the write handle changes themselves look good. tbh its cool that such major functionality can be implemented e2e with smaller LOC.

@danny0405 do you want to take a pass at this

vinothchandar · 2022-07-20T01:35:29Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

+    if (cdcEnabled) {
+      if (indexedRecord.isPresent()) {
+        GenericRecord record = (GenericRecord) indexedRecord.get();
+        cdcData.add(cdcRecord(CDCOperationEnum.UPDATE, hoodieRecord.getRecordKey(), hoodieRecord.getPartitionPath(),


In general, with all the java/jvm overhead, I think. it'll be more than 300M comfortably. Can we use the spillable map instead here in this PR

vinothchandar · 2022-07-20T01:39:25Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

@@ -399,9 +447,57 @@ protected void writeIncomingRecords() throws IOException {
    }
  }

+  protected GenericData.Record cdcRecord(CDCOperationEnum operation, String recordKey, String partitionPath,


RFC-46 is moving away from GenericRecord as the canonical data record. So we may want to move in that direction as well. We need to sequence the two efforts correctly.

vinothchandar · 2022-07-20T01:40:46Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java

@@ -114,6 +118,8 @@ protected HoodieWriteHandle(HoodieWriteConfig config, String instantTime, String
                              HoodieTable<T, I, K, O> hoodieTable, Option<Schema> overriddenSchema,
                              TaskContextSupplier taskContextSupplier) {
    super(config, Option.of(instantTime), hoodieTable);
+    this.keyFiled = config.populateMetaFields() ? HoodieRecord.RECORD_KEY_METADATA_FIELD


typo: keyField

vinothchandar · 2022-07-20T01:43:29Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java

+      throws Exception {
+    String jsonStr = new String(bytes, StandardCharsets.UTF_8);
+    if (jsonStr.isEmpty()) {
+      return null;


please avoid using null as return type

vinothchandar · 2022-07-20T01:46:31Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java

+   * parse the bytes of deltacommit, and get the base file and the log files belonging to this
+   * provided file group.
+   */
+  public static Pair<String, List<String>> getFileSliceForFileGroupFromDeltaCommit(


does nt any of the existing code do this?

no. let me do this first. no simple and low-codes way can do this. i think it deserves a new pr.

vinothchandar · 2022-07-20T01:47:43Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java

+   * Relative cdc file path that store the CDC data.
+   */
+  @Nullable
+  private String cdcPath;


do these new fields evolve well? i.e backwards compatible with existing write stat without these new fields?

vinothchandar · 2022-07-20T01:49:47Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

@@ -229,6 +230,10 @@ protected synchronized void scanInternal(Option<KeySpec> keySpecOpt) {
        HoodieLogBlock logBlock = logFormatReaderWrapper.next();
        final String instantTime = logBlock.getLogBlockHeader().get(INSTANT_TIME);
        totalLogBlocks.incrementAndGet();
+        if (logBlock.getBlockType() == CDC_DATA_BLOCK) {


if the data block is rolled back or commit is rolled back, is the CDC block skipped correctly. Can we write some tests to cover these scenarios

ok, let me add some UTs.

vinothchandar · 2022-07-20T01:55:41Z

@YannByron Can we rework this by making CDC a special mode of inc query?

YannByron · 2022-07-20T02:36:06Z

Can you think through the following scenarios and ensure things work as expected.

Across clustering operations

Across multi writer scenarios.

The TestCDCDataFrameSuite UT in this pr has covered the scenario across clustering operations. But the multi-writer scenarios is not covered dnd i think maybe it's not necessary.

YannByron · 2022-07-20T02:53:50Z

@YannByron Can we rework this by making CDC a special mode of inc query?

Actually, i think the incremental query is a special mode of CDC. The inc query keeps the normal format and return the less info about CDC.
But it is fine. If we want to do this, just modify the configs that users need to be aware like this:

the current way sets:
hoodie.datasource.query.type=cdc

the modified way sets:

hoodie.datasource.query.type=incremental
hoodie.datasource.incremental.output=cdc

is it ok? cc @vinothchandar

danny0405 · 2022-07-20T07:54:28Z

Can you think through the following scenarios and ensure things work as expected.

Across clustering operations

Across multi writer scenarios.

I am yet to review the MOR relation changes, the write handle changes themselves look good. tbh its cool that such major functionality can be implemented e2e with smaller LOC.

@danny0405 do you want to take a pass at this

Yeah, i need a detail review for this part, we seem not even make consensus on the design and i don't understand why the RFC doc was merged, confused firstly.

xushiyan · 2022-07-20T14:02:08Z

we seem not even make consensus on the design and i don't understand why the RFC doc was merged, confused firstly.

@danny0405 I've explained in #5436 (comment). Also we've messaged over wechat with @YannByron on this saying we should make a follow up PR to update. It's not meant to ignore any unresolved questions. We'll make sure previous discussion points linked and resolved in the updating PR.

…gging

YannByron · 2022-08-09T03:13:11Z

@hudi-bot run azure

hudi-bot · 2022-08-09T04:06:24Z

CI report:

b412810 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

YannByron · 2022-08-23T04:26:24Z

Reopen: #6476

XuQianJin-Stars requested review from nsivabalan, danny0405 and vinothchandar June 16, 2022 09:29

YannByron marked this pull request as draft June 16, 2022 11:59

YannByron force-pushed the master_cdc_code branch from d9343eb to b067fe1 Compare June 17, 2022 04:04

YannByron marked this pull request as ready for review June 17, 2022 07:14

fengjian428 mentioned this pull request Jul 12, 2022

[SUPPORT] what the Incremental query should be ? #6067

Closed

vinothchandar self-assigned this Jul 13, 2022

vinothchandar added the priority:blocker label Jul 13, 2022

vinothchandar assigned prasannarajaperumal Jul 15, 2022

prasannarajaperumal self-requested a review July 16, 2022 11:05

prasannarajaperumal requested changes Jul 18, 2022

View reviewed changes

vinothchandar reviewed Jul 20, 2022

View reviewed changes

codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jul 20, 2022

YannByron force-pushed the master_cdc_code branch from 066382c to de18057 Compare July 22, 2022 12:22

YannByron force-pushed the master_cdc_code branch 2 times, most recently from fae8816 to 281d69e Compare July 23, 2022 05:46

YannByron requested review from prasannarajaperumal and vinothchandar July 23, 2022 07:16

YannByron force-pushed the master_cdc_code branch from 281d69e to 2d407c1 Compare July 24, 2022 14:56

xushiyan changed the title ~~[RFC-51][HUDI-3478] Hudi CDC~~ [HUDI-3478] Support CDC for Spark in Hudi Jul 24, 2022

xushiyan added pr:wip Work in Progress/PRs big-needle-movers and removed priority:critical production down; pipelines stalled; Need help asap. labels Jul 24, 2022

XuQianJin-Stars closed this Jul 27, 2022

XuQianJin-Stars reopened this Jul 27, 2022

xushiyan removed the pr:wip Work in Progress/PRs label Jul 29, 2022

YannByron mentioned this pull request Aug 1, 2022

[RFC-51][HUDI-3478] Update RFC: CDC support #6256

Merged

YannByron added 6 commits August 4, 2022 14:33

[RFC-51][HUDI-3478] Hudi CDC

0d16ba3

update

a5f077c

[RFC-51][HUDI-3478] support the case that disable cdc supplemental lo…

eaa83a7

…gging

update

445503a

solve comments

30f0f69

update spark3.3 avro serde

12ed2f3

YannByron force-pushed the master_cdc_code branch from 2d407c1 to 12ed2f3 Compare August 4, 2022 06:49

fix streaming ut

b412810

YannByron closed this Aug 17, 2022

YannByron mentioned this pull request Sep 5, 2022

[HUDI-3478] Support CDC for Spark in Hudi #6476

Closed

[HUDI-3478] Support CDC for Spark in Hudi #5885

[HUDI-3478] Support CDC for Spark in Hudi #5885

Conversation

YannByron commented Jun 16, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

YannByron commented Jun 17, 2022

YannByron commented Jun 20, 2022 • edited Loading

YannByron commented Jun 22, 2022

vinothchandar commented Jul 14, 2022

prasannarajaperumal commented Jul 16, 2022

YannByron commented Jul 17, 2022 • edited Loading

prasannarajaperumal commented Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YannByron commented Jul 18, 2022

vinothchandar commented Jul 20, 2022 • edited Loading

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Jul 20, 2022

YannByron commented Jul 20, 2022

YannByron commented Jul 20, 2022 • edited Loading

danny0405 commented Jul 20, 2022 • edited Loading

xushiyan commented Jul 20, 2022 • edited Loading

YannByron commented Aug 9, 2022

hudi-bot commented Aug 9, 2022

CI report:

YannByron commented Aug 23, 2022

YannByron commented Jun 20, 2022 •

edited

Loading

YannByron commented Jul 17, 2022 •

edited

Loading

prasannarajaperumal commented Jul 18, 2022 •

edited

Loading

vinothchandar commented Jul 20, 2022 •

edited

Loading

YannByron commented Jul 20, 2022 •

edited

Loading

danny0405 commented Jul 20, 2022 •

edited

Loading

xushiyan commented Jul 20, 2022 •

edited

Loading