Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Tizen] Calling ChipDnssdResolve with the CHIP stack locked can cause deadlock #25466

Closed
gharveymn opened this issue Mar 3, 2023 · 5 comments · Fixed by #25573
Closed
Assignees
Labels
tizen For Tizen platform

Comments

@gharveymn
Copy link
Contributor

Reproduction steps

During our (SmartThings) testing on the Tizen platform we have been encountering an issue where the DNS-SD implementation may become deadlocked when attempting to resolve multiple devices at the same time. I have determined that this occurs because we call ChipDnssdResolve with the CHIP stack locked. Consider the following scenario with threads [1] and [2]:

  1. [2] Is in the middle of handling a request for service resolution.
  2. [1] Some function calling ChipDnssdResolve acquires the CHIP stack lock some arbitrary amount of time beforehand.
  3. [2] The service is resolved and this thread acquires the NSD lock (ref)
  4. [1] As a result of calling ChipDnssdResolve, this thread reaches dnssd_create_remote_service and attempts to acquire the NSD lock. It waits.
  5. [2] The callback OnResolve in the Matter SDK attempts to acquire the CHIP stack lock. It also waits (here).

As a result, the mutually dependent NSD and CHIP stack locks become deadlocked.

If it helps, I've also modified the TestDnssd unit test and some build files on a branch in a fork to demonstrate this issue (just a rough patch using your existing work with containers, @arkq). You can find it at https://github.com/gharveymn/connectedhomeip/tree/tizen-deadlock. You can run the test by starting up the docker container and executing gen-cmd.sh and then ninja -C out/tizen-unit-check check-unit.

CC: @msandstedt @arkq @hyunuktak @chulspro

Bug prevalence

Whenever successfully resolving multiple devices concurrently

GitHub hash of the SDK that was being used

3957377

Platform

other

Platform Version(s)

No response

Anything else?

No response

@msandstedt
Copy link
Contributor

I have determined that this occurs because we call ChipDnssdResolve with the CHIP stack locked.

By the way, it is required that we call ChipDnssdResolve with the CHIP stack locked, so it shouldn't be misunderstood that there is an assertion that not doing this could be a solution.

Viable solutions may be:

  • ensuring that NSD and CHIP stack lock are acquired in the same order in both threads
  • reworking Tizen's main loop to execute from the sdk's event loop

@arkq
Copy link
Contributor

arkq commented Mar 4, 2023

OK, I'll look into it. Thanks for reporting!

@arkq arkq self-assigned this Mar 4, 2023
@arkq arkq added the tizen For Tizen platform label Mar 5, 2023
@arkq
Copy link
Contributor

arkq commented Mar 8, 2023

@gharveymn, if you can, please test patch provided in #25573. It's a workaround for bad design of DNS-SD Tizen API, which calls callbacks with it's internal mutex locked...

@gharveymn
Copy link
Contributor Author

Hey @arkq, thanks for the quick update! I'm just gonna let you know that I'm currently experiencing some health issues, so I personally won't be able to test this immediately. This is still a high priority issue for us though, so I'll see if someone else can take over in the mean time.

@gharveymn
Copy link
Contributor Author

@arkq Apologies for the wait. I've live-tested the patch with the same setup as before, and it seems to fix the issue as expected. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tizen For Tizen platform
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants