Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler(2pc): agent for 2 phase scheduling #5593

Merged
merged 24 commits into from
May 31, 2022

Conversation

3AceShowHand
Copy link
Contributor

@3AceShowHand 3AceShowHand commented May 25, 2022

What problem does this PR solve?

Issue Number: ref #4757

What is changed and how it works?

2 phase scheduling agent implementation.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

`None`.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented May 25, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • overvenus

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-linked-issue needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 25, 2022
@3AceShowHand 3AceShowHand changed the base branch from master to fb/latency May 25, 2022 12:05
@ti-chi-bot ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 25, 2022
@3AceShowHand 3AceShowHand changed the title scheduler(2pc): agent for 2 phase scheduling [DNM] scheduler(2pc): agent for 2 phase scheduling May 25, 2022
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 26, 2022
@ti-chi-bot ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 27, 2022
@3AceShowHand 3AceShowHand force-pushed the agent-2-phase-scheduling branch from 365c2bf to 182f2a9 Compare May 27, 2022 12:53
@3AceShowHand 3AceShowHand changed the title [DNM] scheduler(2pc): agent for 2 phase scheduling scheduler(2pc): agent for 2 phase scheduling May 30, 2022
@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 30, 2022
@3AceShowHand 3AceShowHand requested a review from overvenus May 30, 2022 09:25
@3AceShowHand 3AceShowHand added component/scheduler TiCDC inner scheduler component. area/ticdc Issues or PRs related to TiCDC. and removed do-not-merge/needs-linked-issue labels May 30, 2022
cdc/scheduler/internal/table_executor.go Outdated Show resolved Hide resolved
cdc/scheduler/internal/base/agent.go Outdated Show resolved Hide resolved
cdc/processor/pipeline/table.go Outdated Show resolved Hide resolved
cdc/processor/processor_test.go Outdated Show resolved Hide resolved
cdc/scheduler/internal/tp/agent.go Outdated Show resolved Hide resolved
Comment on lines 545 to 548
// This panic will happen only if two messages have been received
// with the same ownerRev but with different ownerIDs.
// This should never happen unless the election via Etcd is buggy.
log.Panic("owner IDs do not match",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if p2p batches messages of different owners?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not panic, because message from different owner should have different revision.

This is for two owner with the same revision, but different owner ID, which is unrecoveriable etcd error.

dm/pkg/schema/tracker_test.go Outdated Show resolved Hide resolved
@@ -573,6 +576,7 @@ func (r *ReplicationSet) handleRemoveTable() ([]*schedulepb.Message, error) {
zap.Any("replicationSet", r), zap.Int64("tableID", r.TableID))
return nil, nil
}
// todo: OldState must be `replicating` here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we deal with these todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks OldState is only used for logging, we can remove it from the log, since we know that it's must be in desired state.


// pendingTasks is a queue of dispatch table task yet to be processed.
// the Deque stores *dispatchTableTask.
pendingTasks deque.Deque
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a deque? can we handle tasks immediately? Owner control the total number of task that are running concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that dispatchTableRequest A and B received, sent by owner 1, and followed by heartbeat or dispatchTableRequest sent by owner 2. A and B should not be processed.

By putting tasks in pending states, make sure this tick does not do unnecessary task handling.

But I consider that, no matter owner switch or not, all tasks should be handled, this can reduce unnecessary table rescheduling, and ongoing old task does not affect global table distribution and scheduling correctly, since the heartbeat can help the new owner detect all tables' states.

So, we can remove pendingTasks, just handling all dispatchTableTasks directly.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 31, 2022
@3AceShowHand 3AceShowHand merged commit dfc2d3e into pingcap:fb/latency May 31, 2022
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 21, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 21, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 23, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 23, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 24, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 24, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 24, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
overvenus pushed a commit to overvenus/ticdc that referenced this pull request Jun 24, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
ti-chi-bot pushed a commit that referenced this pull request Jun 24, 2022
* fix some typo.

* update table scheduler proto

* add some new to agent.

* track owner info.

* try to handle dispatch table request.

* add more and more to agent implementation.

* fix update owner info.

* finish handle dispatch table.

* tackle epoch

* remove checkpoint from proto

* handle heartbeat with stopping.

* add benchmark for heartbeat response.

* fix agent.

* fix agent code layout.

* refine benchmark test.

* refine coordinator / capture_manager / relication_manager.

* fix agent.

* add a lot of test.

* revise the code.

* fix by suggestion.

* fix by suggestion.

* remoe pendingTask.

* fix unit test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/scheduler TiCDC inner scheduler component. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants