Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker(dm): add retry for watch when network is weak #4553

Closed
wants to merge 9 commits into from
Closed

worker(dm): add retry for watch when network is weak #4553

wants to merge 9 commits into from

Conversation

Ehco1996
Copy link
Contributor

@Ehco1996 Ehco1996 commented Feb 10, 2022

What problem does this PR solve?

Issue Number: close #4548

What is changed and how it works?

Check List

Tests

  • Integration test

Code changes

  • add retry for observeSourceBound when watch channel closed by weak network
  • restart keep alive after observeSourceBound exit
    Side effects

Related changes

Release note

`add retry for watch when network is weak`.

@Ehco1996 Ehco1996 added the area/dm Issues or PRs related to DM. label Feb 10, 2022
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-linked-issue size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Feb 10, 2022
@GMHDBJD GMHDBJD added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2022
@@ -269,7 +269,11 @@ func WatchSourceBound(ctx context.Context, cli *clientv3.Client, worker string,
case <-ctx.Done():
return
case resp, ok := <-ch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason of this WatchChan being closed?

Copy link
Contributor Author

@Ehco1996 Ehco1996 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when networkw is poor, all grpc stream is dead, etcd watch will return an closed channel

see here

and this issue

Copy link
Contributor

@lance6716 lance6716 Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you given is not the version and path of etcd Watch we use.

tiflow/go.mod

Line 83 in b301406

go.etcd.io/etcd v0.5.0-alpha.5.0.20210512015243-d19fbe541bf9

https://github.com/etcd-io/etcd/blob/d19fbe541bf9c81e2d69d71d1068bd40c04de200/clientv3/watch.go#L288

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.streams will only be nil after w.Close(), I don't think that's the root cause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have time can you help locate the problem, I don't have any ideas here at the moment, I'll send you the log in private chat~

@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@Ehco1996 Ehco1996 changed the title scheduler(dm): add testcase to reproduce #4548 worker(dm): add retry for observeSourceBound and observeRelayConfig Feb 11, 2022
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2022
@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2022

Codecov Report

Merging #4553 (5d6be06) into master (9607554) will decrease coverage by 0.0565%.
The diff coverage is 60.6334%.

Flag Coverage Δ
cdc 60.2988% <69.4485%> (+0.3765%) ⬆️
dm 51.5863% <51.6030%> (-0.4426%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master      #4553        +/-   ##
================================================
- Coverage   55.6402%   55.5836%   -0.0566%     
================================================
  Files           494        502         +8     
  Lines         61283      62432      +1149     
================================================
+ Hits          34098      34702       +604     
- Misses        23750      24270       +520     
- Partials       3435       3460        +25     

@Ehco1996 Ehco1996 added the status/ptal Could you please take a look? label Feb 11, 2022
dm/dm/worker/server.go Outdated Show resolved Hide resolved
@Ehco1996 Ehco1996 changed the title worker(dm): add retry for observeSourceBound and observeRelayConfig worker(dm): add retry for watch when network is weak Feb 15, 2022
@Ehco1996 Ehco1996 added this to the v5.3.1 milestone Feb 16, 2022
@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 16, 2022
@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2022
@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 16, 2022
@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@Ehco1996 Ehco1996 added the needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. label Feb 16, 2022
@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@Ehco1996
Copy link
Contributor Author

/run-dm-integration-tests

@Ehco1996 Ehco1996 removed this from the v5.3.1 milestone Feb 17, 2022
@Ehco1996 Ehco1996 removed the needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. label Feb 17, 2022
@ti-chi-bot ti-chi-bot added needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. and removed do-not-merge/needs-triage-completed labels Feb 24, 2022
@Ehco1996
Copy link
Contributor Author

ping @GMHDBJD @niubell 🧡

@Ehco1996 Ehco1996 removed the status/ptal Could you please take a look? label Mar 8, 2022
@Ehco1996
Copy link
Contributor Author

Ehco1996 commented Mar 8, 2022

from etcd

Canceled is used to indicate watch failure. If the watch failed and the stream was about to close, before the channel is closed, the channel sends a final response that has Canceled set to true with a non-nil Err()

seems grpc streams closed is not the root cause of #4548 , we upadted embed etcd in #4755, let's see if this can solve the problem 😵

@Ehco1996
Copy link
Contributor Author

Ehco1996 commented Mar 8, 2022

/hold

@ti-chi-bot ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2022
@lance6716
Copy link
Contributor

from etcd

Canceled is used to indicate watch failure. If the watch failed and the stream was about to close, before the channel is closed, the channel sends a final response that has Canceled set to true with a non-nil Err()

seems grpc streams closed is not the root cause of #4548 , we upadted embed etcd in #4755, let's see if this can solve the problem 😵

I tested the behaviour of watch, after I turned off the WiFi of my laptop, watch will retry infinitely rather than terminate.

@Ehco1996 Ehco1996 added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2022
@Ehco1996
Copy link
Contributor Author

close for no updated now

@Ehco1996 Ehco1996 closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dm-worker hang when etcd watch ch closed
6 participants