Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sink to csv: output old value to CSV files #10174

Merged
merged 3 commits into from
Dec 12, 2023

Conversation

zhangjinpeng87
Copy link
Contributor

@zhangjinpeng87 zhangjinpeng87 commented Nov 29, 2023

What problem does this PR solve?

Issue Number: close #10167

What is changed and how it works?

When changefeed configuration set:

[sink.csv]
output-old-value = true

As #10167 described, sink to csv replace update with a pair of delete and insert, in this way to record the old value for UPDATE statement, so it is possible to generate undo DML for some data repairing cases.

Before this change, the CSV output statement of update schema.table set name="def" where id=1 and name="abc" is

U, table, schema, [commit-ts], 1, def

After this change, the CSV output looks like:

D, table, schema, [commit-ts], 1, abc
I, table, schema, [commit-ts], 1, def

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?

After set [sink.csv] output-old-value = true configuration for the sink to storage changefeed, for each UPDATE statement there will be 2 rows of data in the output CSV file. The output CSV files size is larger than before if there are many UPDATE statements.

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

TiCDC sink to CSV files: support output old values for csv files, this is helpful for data repairing scenario.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 29, 2023
@@ -919,6 +919,7 @@ type CSVConfig struct {
NullString string `json:"null"`
IncludeCommitTs bool `json:"include_commit_ts"`
BinaryEncodingMethod string `json:"binary_encoding_method"`
EnableOldValue bool `json:"enable_old_value"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_old_value has different meanings on different versions of ticdc and has been confusing for users, so a more explicit parameter name such as output_old_value is recommended here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force-split-update is recommended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output-old-value is more obvious from the semantic perspective, because split-update is just one specific implementation case of output old value in the context of sink to csv files.

@3AceShowHand
Copy link
Contributor

There is no need to update the CSV implementation.

It's recommended to split the update event https://github.com/pingcap/tiflow/blob/master/cdc/model/sink.go#L833

Add one new variable forceSplitUpdate to the shouldSplitUpdateEvent method should solve this problem.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Nov 29, 2023

There is no need to update the CSV implementation.

It's recommended to split the update event https://github.com/pingcap/tiflow/blob/master/cdc/model/sink.go#L833

Add one new variable forceSplitUpdate to the shouldSplitUpdateEvent method should solve this problem.

@3AceShowHand Thanks for your reminding. But

  1. I found that https://github.com/pingcap/tiflow/blob/master/cdc/model/sink.go#L833 this place is a quite early stage of a sink and handled most common/generic logic of a changefeed (pull events from buffer). And this stage TiCDC doesn't apply any changefeed configuration specific things (csv specific or other protocol specific configurations) to the sink during this stage. Just my opinion: in the future if we want to share most of steps for multiple changefeeds like sink to CSV & sink to Kafka for the same table. Applying the protocol and configuration specific things to the sink at the last stage make things simple and flexible in the future.
  2. The semantic output-old-value for csv is not equal with with force-split-update here as I mentioned in above conversation.

@zhangjinpeng87 zhangjinpeng87 changed the title sink to csv: update support record old value sink to csv: output old value to CSV files Nov 29, 2023
@3AceShowHand
Copy link
Contributor

output-old-value is more obvious from the semantic perspective, because split-update is just one specific implementation case of output old value in the context of sink to csv files.

IMHO, TiCDC always fetch old value from the TiKV, and also output old value if the output format allows to do so. In this aspect, It's ok to name it as output-old-value and only bound to the CSV format.

It's ok to support the output-old-value by modifying CSV encoder. IMHO, it would be easy to implement the output-old-value in the https://github.com/pingcap/tiflow/blob/master/cdc/model/sink.go#L833 by splitting the update into a delete and insert, this keeps the implementation of CSV format simple and clean.

@zhangjinpeng87
Copy link
Contributor Author

output-old-value is more obvious from the semantic perspective, because split-update is just one specific implementation case of output old value in the context of sink to csv files.

IMHO, TiCDC always fetch old value from the TiKV, and also output old value if the output format allows to do so. In this aspect, It's ok to name it as output-old-value and only bound to the CSV format.

It's ok to support the output-old-value by modifying CSV encoder. IMHO, it would be easy to implement the output-old-value in the https://github.com/pingcap/tiflow/blob/master/cdc/model/sink.go#L833 by splitting the update into a delete and insert, this keeps the implementation of CSV format simple and clean.

@3AceShowHand
IMO, we are not going to choose an implementation just because it is easy, more consideration is focus on where most other CSV changefeed configuration take affects, and split update is not equal to the semantic of output old value for a protocol. In this way, we can make the semantic more clear and easy to change/maintain in the future.

@3AceShowHand @CharlesCheung96 Thanks for your review, please let me know if you have other concerns about this PR.

@zhangjinpeng87 zhangjinpeng87 removed the first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. label Dec 2, 2023
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 12, 2023
@ti-chi-bot ti-chi-bot bot added the lgtm label Dec 12, 2023
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, CharlesCheung96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,CharlesCheung96]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 12, 2023
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-12-12 06:21:59.515731266 +0000 UTC m=+337210.552958184: ☑️ agreed by 3AceShowHand.
  • 2023-12-12 15:42:34.116270661 +0000 UTC m=+370845.153497574: ☑️ agreed by CharlesCheung96.

@zhangjinpeng87
Copy link
Contributor Author

/test dm-integration-test

@ti-chi-bot ti-chi-bot bot merged commit a3f2fe2 into pingcap:master Dec 12, 2023
3 checks passed
@zhangjinpeng87 zhangjinpeng87 self-assigned this Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sink to cloud storage] output old values to CSV files
3 participants