Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot(ticdc): fix ddl puller and ddl manager stuck caused by dead lock #11886

Conversation

wlwilliamx
Copy link
Contributor

@wlwilliamx wlwilliamx commented Dec 16, 2024

What problem does this PR solve?

Issue Number: close #11884

What is changed and how it works?

Summary

This PR fixes a deadlock issue in the Snapshot implementation:

Deadlock in Recursive Read Locking: Although Go’s sync.RWMutex allows recursive read locks, they can result in deadlocks if a write lock is requested during the recursive read lock execution. This blocks the outer read lock from releasing, preventing the write lock from proceeding.

This PR refactors lock usage patterns to avoid recursive read locking.


Root Causes of the Deadlocks

Recursive Read Lock Issue

Recursive calls involving RWMutex.RLock() can result in deadlocks when a write lock is requested during the recursive read lock execution. This behavior arises because Go’s sync.RWMutex prioritizes write locks over read locks.

Here is an example that illustrates the problem:

func (s *Snapshot) Operation() {
    s.rwlock.RLock()
    defer s.rwlock.RUnlock()
    s.NestedOperation() // Second RLock
}

func (s *Snapshot) NestedOperation() {
    s.rwlock.RLock()
    defer s.rwlock.RUnlock()
    // Perform some operations
}

If a write lock is requested while NestedOperation is executing, the following chain of events occurs:

  1. The write lock request blocks new readers, including the recursive RLock in NestedOperation.
  2. NestedOperation cannot complete until its RLock is granted.
  3. The first RLock in Operation cannot release until NestedOperation completes.
  4. Deadlock occurs because the first RLock and the recursive RLock are mutually dependent.

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?

No.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 16, 2024
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 16, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, CharlesCheung96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,CharlesCheung96]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 16, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 16, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-16 10:33:56.033692121 +0000 UTC m=+866626.122494663: ☑️ agreed by 3AceShowHand.
  • 2024-12-16 10:59:02.014033006 +0000 UTC m=+868132.102835545: ☑️ agreed by CharlesCheung96.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 16, 2024
@wlwilliamx
Copy link
Contributor Author

/test cdc-integration-pulsar-test

@wlwilliamx
Copy link
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 0bb4977 into pingcap:master Dec 16, 2024
26 checks passed
@wlwilliamx wlwilliamx changed the title snapshot(ticdc): fix ddl puller and ddl manager stuck caused by two dead lock snapshot(ticdc): fix ddl puller and ddl manager stuck caused by dead lock Dec 17, 2024
@fubinzh
Copy link

fubinzh commented Dec 17, 2024

/cherry-pick release-8.5
/cherry-pick release-7.5

@ti-chi-bot
Copy link
Member

@fubinzh: new pull request created to branch release-7.5: #11887.

In response to this:

/cherry-pick release-8.5
/cherry-pick release-7.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@fubinzh: new pull request created to branch release-8.5: #11888.

In response to this:

/cherry-pick release-8.5
/cherry-pick release-7.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@fubinzh
Copy link

fubinzh commented Dec 17, 2024

/cherry-pick release-8.1

@ti-chi-bot
Copy link
Member

@fubinzh: new pull request created to branch release-8.1: #11889.

In response to this:

/cherry-pick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@3AceShowHand 3AceShowHand added needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. labels Dec 17, 2024
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Dec 17, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #11896.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #11897.

Copy link
Contributor

ti-chi-bot bot commented Dec 19, 2024

@wlwilliamx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-integration-kafka-test d17865a link unknown /test cdc-integration-kafka-test
pull-cdc-integration-pulsar-test d17865a link true /test cdc-integration-pulsar-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot bot pushed a commit that referenced this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
5 participants