Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: merge two-phase scheduler #5955

Merged
merged 31 commits into from
Jun 24, 2022
Merged

Conversation

overvenus
Copy link
Member

@overvenus overvenus commented Jun 21, 2022

What problem does this PR solve?

Issue Number: ref #4757

What is changed and how it works?

Add two-phase scheduler to reduce latency spike during rolling restart.

Commits are cherry-picked from fb/latency branch.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Add two-phase scheduler to reduce latency spike during rolling restart.

@overvenus overvenus added component/scheduler TiCDC inner scheduler component. area/ticdc Issues or PRs related to TiCDC. labels Jun 21, 2022
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jun 21, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • 3AceShowHand
  • sdojjy

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 21, 2022
@overvenus overvenus added the tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges. label Jun 21, 2022
@@ -130,6 +130,13 @@ var defaultServerConfig = &ServerConfig{
IteratorSlowReadDuration: 256,
},
Messages: defaultMessageConfig.Clone(),

EnableTwoPhaseScheduler: false,
Copy link
Member Author

@overvenus overvenus Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disabled for now.

@amyangfei
Copy link
Contributor

Should it be tide/merge-method-merge instead of tide/merge-method-rebase?

@overvenus
Copy link
Member Author

Should it be tide/merge-method-merge instead of tide/merge-method-rebase?

Both are fine, they all keep commit history, while rebase keep a linear history.

@overvenus
Copy link
Member Author

/run-all-tests

@overvenus
Copy link
Member Author

/run-all-tests

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 21, 2022
@3AceShowHand
Copy link
Contributor

/run-kafka-integration-test
/run-integration-test

@3AceShowHand
Copy link
Contributor

[2022/06/21 21:13:27.534 +08:00] [PANIC] [schedule_dispatcher.go:167] ["checkpointTs regressed"] [namespace=default] [changefeed=95447eb3-e57f-4ebb-a8db-8b121af228c4] [old=434062543589474316] [new=434062543458402305] [stack="github.com/pingcap/tiflow/cdc/scheduler/internal/base.(*ScheduleDispatcher).Tick\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/scheduler/internal/base/schedule_dispatcher.go:167\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/base.(*SchedulerV2).Tick\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/scheduler/internal/base/owner_scheduler.go:79\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).tick\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/owner/changefeed.go:264\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).Tick\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/owner/changefeed.go:160\ngithub.com/pingcap/tiflow/cdc/owner.(*ownerImpl).Tick\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/owner/owner.go:203\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:247\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).runEtcdWorker\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/capture/capture.go:430\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).campaignOwner\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/capture/capture.go:403\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run.func2\n\t/home/jenkins/agent/workspace/cdc_ghpr_integration_test/go/src/github.com/pingcap/tiflow/cdc/capture/capture.go:279"]

cdc.log

cdc/api/v1/api.go Outdated Show resolved Hide resolved
@overvenus
Copy link
Member Author

/run-all-tests

@overvenus
Copy link
Member Author

/run-kafka-integration-test
/run-integration-test

@3AceShowHand
Copy link
Contributor

/run-verify-ci

@codecov-commenter
Copy link

Codecov Report

Merging #5955 (60898f6) into master (070e459) will increase coverage by 0.5127%.
The diff coverage is 69.5089%.

Flag Coverage Δ
cdc 63.0320% <70.7519%> (+1.0925%) ⬆️
dm 51.9882% <100.0000%> (-0.0039%) ⬇️
engine 62.4118% <54.6729%> (-0.0131%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master      #5955        +/-   ##
================================================
+ Coverage   57.1666%   57.6794%   +0.5127%     
================================================
  Files           686        697        +11     
  Lines         80923      82421      +1498     
================================================
+ Hits          46261      47540      +1279     
- Misses        30368      30556       +188     
- Partials       4294       4325        +31     

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2022
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2022
@overvenus
Copy link
Member Author

/run-kafka-integration-test
/run-integration-test

1 similar comment
@3AceShowHand
Copy link
Contributor

/run-kafka-integration-test
/run-integration-test

overvenus and others added 20 commits June 24, 2022 22:25
* tp: add table id to heartbeat and support burst remove

Signed-off-by: Neil Shen <[email protected]>

* tp: support basic balance tables

Signed-off-by: Neil Shen <[email protected]>
* add more tests.

* coverage to 79.1

* Update cdc/scheduler/internal/tp/agent.go

Co-authored-by: Neil Shen <[email protected]>

Co-authored-by: Neil Shen <[email protected]>
* cdc: integrate two phase scheduler

Signed-off-by: Neil Shen <[email protected]>

* tp: fix duplicate topic handlers panic

Signed-off-by: Neil Shen <[email protected]>

* tp: add From to messages from scheduler

Signed-off-by: Neil Shen <[email protected]>

* tp: add more comments and fix typo

Signed-off-by: Neil Shen <[email protected]>

* cdc: fix missing table ID in table meta

Signed-off-by: Neil Shen <[email protected]>

* cdc: fix processor.tick panic

Signed-off-by: Neil Shen <[email protected]>

* tp: fix missing processor epoch

Signed-off-by: Neil Shen <[email protected]>

* tp: fix missing checkpoint ts in burstBalanceTask

Signed-off-by: Neil Shen <[email protected]>

* tp: fix lost AddTable request during Commit

Signed-off-by: Neil Shen <[email protected]>

* Revert "cdc: fix processor.tick panic"

This reverts commit ebb0f2c.
* add all changes.

* also check capture status, before rebalance tables.

* add basic move table scheduler implementation.

* add basic implementation of manual scheduling api.

* fix rebalance.

* add teste for manual move table .

* revert change in pb.go

* fix

* fix by review.
* fix by suggestion, use unique state for the table.
* fix TestTableExecutorAddingTableDirectly
* add ut TestTableExecutorAddingTableIndirectly
* add ut TestTableExecutorAddingTableIndirectly
* tp: support balance when add a new capture

Signed-off-by: Neil Shen <[email protected]>

* tp: fix an error when a stopped table state follows by a heartbeat response

Signed-off-by: Neil Shen <[email protected]>

* fix tests

Signed-off-by: Neil Shen <[email protected]>

* tp: add more tests

Signed-off-by: Neil Shen <[email protected]>
… calculation (pingcap#5761)

* fix tp advance checkpoint.
* fix ddl sink log.
* add a ut for replication manager advance checkpoint ts.
…essages. (pingcap#5791)

* agent should return absent directly.
* do not panic when remove table not exist.
* no need to check for table status.
* tp: fill resolved ts when add a new table

Signed-off-by: Neil Shen <[email protected]>

* tp: relax invarint check

Signed-off-by: Neil Shen <[email protected]>

* tp: add metrics

Signed-off-by: Neil Shen <[email protected]>

* tp: record tasks in burstBalance

Signed-off-by: Neil Shen <[email protected]>
* tp: clean up stale capture table metrics

Signed-off-by: Neil Shen <[email protected]>

* tp: refine balance scheduler

Signed-off-by: Neil Shen <[email protected]>

* tp: seperate balance scheduler to basic and balance

Signed-off-by: Neil Shen <[email protected]>

* fix tests

Signed-off-by: Neil Shen <[email protected]>

* Apply suggestions from code review

Co-authored-by: Ling Jin <[email protected]>

* address comments

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: Ling Jin <[email protected]>
…p2p messages. (pingcap#5820)

* add table struct to agent.

* agent add table state machine.

* simplify coordinator.

* agent to be state awared.

* call IsRemoveTableFinished to clean table resource from processor.

* fix message header.

* fix agent.

* prepare new agent ready.

* fix some test.

* add basic ut.

* fix agent handle message ut.

* add all test.

* fix agent handle stopping.

* adjust pipeline table state.

* refine the agent.

* introduce tableManager.

* fix all tests.

* add some new test.

* fix log.

* fix by make check

* fix by review comment.

* fix by review comment.

* fix ut.

* fix check.

* agent fix heartbeat does not refresh each tick.

* remove scheduler log.

* fix ut.

* fix some test.

* fix ut.

* rename

* fix by check.
pingcap#5906)

* tp: clean up metrics and add logs
* tp: cleanup running task when a table has shutdown
* tests: skip check move table results

Signed-off-by: Neil Shen <[email protected]>
* drain capture scheduler schedule logic.

* add drain capture scheduler ut.

* coordinator adjust.

* simplify drain capture, only return table count.

Co-authored-by: Neil Shen <[email protected]>
Signed-off-by: Neil Shen <[email protected]>
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Jun 24, 2022
@overvenus
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 26ff1a4

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 24, 2022
@overvenus
Copy link
Member Author

/run-all-tests

@ti-chi-bot
Copy link
Member

@overvenus: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@overvenus
Copy link
Member Author

/run-verify
/run-leak-test

@ti-chi-bot ti-chi-bot merged commit 97cc048 into pingcap:master Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/scheduler TiCDC inner scheduler component. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants