Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql]fix duplicate split request for newly added table #1156

Merged
merged 2 commits into from
Oct 19, 2022

Conversation

yujiaxinlong
Copy link

@yujiaxinlong yujiaxinlong commented May 7, 2022

related to #1149

I debugged for a while and find the reason of stucking at the second split(not 100% percent second , could be third or later but always stuck) is that it triggers two split requests on one subtask at the same time.

When adding a new table to an existed flinkcdc task , if the old table has finished snapshot phase and has been consuming binlog. It will read some binlog splits before finding there is new table during task restart.

Here when a binlog split finished in MysqlSourceReader#onSplitFinished(...), it actually activate next split twice.

  1. SuspendBinlogReaderAckEvent -> WakeupReaderEvent -> context.sendSplitRequest()
  2. directly context.sendSplitRequest().

Which lead to two snapshot splits handled by one subtask , one split fetcher and one same debezium BinaryLogClient at the same time.

This BinaryLogClient mostly reach an EOF before the high watermark of the second split here, so the binlogSplitReadTask for the second split will never trigger an binlog end watermark event, which eventually lead to snapshotreader hangs forever.

The WakeupReaderEvent for snapshotsplit seems not needed. Currently I just removed the context.sendSplitRequest() here and keep this event for future usage. And this works fine in my case.

@yujiaxinlong
Copy link
Author

@leonardBang @PatrickRen there should be something wrong with oracle connector or it's test case, past several unrelated PRs all failed on this.

@lzshlzsh
Copy link
Contributor

+1, the fix solved our online problem

@lzshlzsh
Copy link
Contributor

lzshlzsh commented May 19, 2022

Would you explain how to reproduce the bug more detailed ? I have tried tens times according to #1149 , but could not reproduce.

You say: "Which lead to two snapshot splits handled by one subtask , one split fetcher and one same debezium BinaryLogClient at the same time."

There is no problem that multiple snapshot splits handled by one subtask, which happens when decrease parallelism(e.g 4 to 1) in snapshot stage.

@yujiaxinlong
Copy link
Author

@lzshlzsh
I think multiple splits indeed can be handled by one subtask, but should not at the same time.

The problem here is that it starts to handle the second split before finishing first one. What I observe here is that the second split keeps waiting for it's watermark end event , which only happens when currentBinlogOffset.isAtOrAfter(binlogSplit.getEndingOffset()) see MySqlBinlogSplitReadTask#handleEvent(Event event). But it never get it.

I haven't read the code of BinaryLogClient very carefully, but I believe the reason is that BinaryLogClient doesn't read binlog endlessly. it read exactly amount of binlog that has been decided when it connects. so when the second split also use this client, it most likely reach an eof and jump out before reach the high watermark of second split.

I haven't reproduce this problem except in our online database, I think the core here is the table A I mentioned in #1149 must has finished snapshot reading and keeps having new DML sql.

@leonardBang leonardBang self-requested a review May 20, 2022 05:47
@yujiaxinlong
Copy link
Author

@leonardBang Hi, I'm wandering how is the review going, I checked code of BinaryLogClient recently and pretty sure the reason lead to this bug is what I described before. The BinaryLogClient for the first split won't read new log for the second.

@leonardBang leonardBang force-pushed the fix_duplicate_split_request branch from 3b33c2b to b61c748 Compare October 19, 2022 03:22
Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yujiaxinlong for the detail digging! I like your analysis, LGTM.
I also help rebase and improve the PR a little as the original PR has some conflicts.

@leonardBang leonardBang merged commit d343538 into apache:master Oct 19, 2022
@leonardBang leonardBang added this to the V2.3.0 milestone Oct 19, 2022
ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025
ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants