-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sink to cloud storage] output old values to CSV files #10167
Comments
Another option is using "U-" or "U+", instead of "I" and "D", to represent UPDATE. cc @Benjamin2037 |
I agree that recording the Update on two separate lines is necessary for us to record row's data before and after the update. I do have 2 concerns however. The first is more immediate and related to the increased complexity it would introduce for using the changefeed to repair erroneous transactions. The second is a future concern over implications if we ever introduced support for triggers to TiDB and the mismatch between triggers on an Update versus a paired Delete and Insert. Regarding the first concern, recording an Update operation as a pair of Delete and Insert operations makes processing more complex with relational tooling, particularly for scenarios where we are looking to undo erroneous operations. For example, if an application has been executing erroneous Updates to a table (say With the current CSV protocol we could expect a series of changes to be recorded as follows (see https://docs.pingcap.com/tidb/stable/ticdc-csv#definition-of-the-data-format):
With the original proposal in this issue, the CSV file would look like:
If we separated out the pre and post update row data using
Now, with that CSV file loaded into a database table (e.g. update employee inner join employee_cdc
on employee.id = employee_cdc.id -- join on origin tables unique/primary key
set employee.surname = employee_cdc.surname,
employee.firstname = employee_cdc.firstname,
employee.hiredate = employee_cdc.hiredate,
employee.location = employee_cdc.location,
where employee_cdc._cdc_op = 'U-' -- the data prior to the erroneous update
and employee_cdc._cdc_Commit_ts >= '…1626'
and employee_cdc._cdc_Commit_ts <= '…1631'
; In contrast, if they are recorded using Delete and Insert statements, separating out the non-paired insert and deletes from the paired inserts and deletes is more complex. In this example it requires an additional join to identify the two rows relating to the update: with employee_updates AS (select pre.*
from employee_changes pre inner join employee_changes post
on pre._cdc_Commit_ts = post._cdc_Commit_ts
and pre.Id = post.Id
and pre._cdc_schema = post._cdc_schema
and pre._cdc_table = post._cdc_table
where pre._cdc_op = 'D' and post._cdc_op = 'I')
update employee inner join employee_updates
on employee.Id = employee_updates.Id -- join on origin tables unique/primary key
set employee.LastName = employee_updates.LastName,
employee.FirstName = employee_updates.FirstName,
employee.HireDate = employee_updates.HireDate,
employee.OfficeLocation = employee_updates.OfficeLocation
where employee_updates._cdc_op = 'D' -- the data prior to the erroneous update
and employee_cdc._cdc_Commit_ts >= '…1626'
and employee_cdc._cdc_Commit_ts <= '…1631'
; Looking at other solutions that offer streaming representation of data changes, the ability to preserve the Update operation in this fashion appears to be a common pattern. For example, Snowflake streams in their |
Is your feature request related to a problem?
In the repairing upstream data scenario (described in #10100 ), if the mistake operation is update, the sink-to-cloudstorage CSV protocol can not generate the undo DML because for update operations the CSV protocol just record the after-update value.
Describe the feature you'd like
Record the old value for update operations(delete recorded old value already) when
oldvalue
parameter is truesink-uri="s3://xxx/xxx?protocol=csv&oldvalue=true"
. For example, there are following update operations:Current CSV protocol will generate CSV file looks like:
There is no way to generate undo DML for these update because they lack of old value. My proposal is add an option to support using delete and insert to replace update operation for the CSV protocol, and the CSV file looks like:
The advantage of this proposal is we can record the old value for Insert(I), Update(U) and Delete(D) operations. The disadvantage of this proposal is the file size grows if there are many update operations.
And when generate the undo DML, it can generate following DMLs:
Describe alternatives you've considered
Using two update rows to record old value and new value, and the generated CSV file for above example should looks like:
But the semantic is not clear when generating undo DML.
Teachability, Documentation, Adoption, Migration Strategy
todo
The text was updated successfully, but these errors were encountered: