Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Crash" on TestInetEndPoint in CI on Darwin #10025

Closed
bzbarsky-apple opened this issue Sep 28, 2021 · 3 comments · Fixed by #10060
Closed

"Crash" on TestInetEndPoint in CI on Darwin #10025

bzbarsky-apple opened this issue Sep 28, 2021 · 3 comments · Fixed by #10060

Comments

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Sep 28, 2021

Problem

Sometimes the TestInetEndPoint test "crashes" in CI on Darwin like so:

../../third_party/pigweed/repo/targets/host/run_test: line 18: 17937 Abort trap: 6           "$@"
INF Test 1/1: [FAIL] TestInetEndPoint

We added diagnostic log upload, but it turns out this is not actually a crash; it's an abort() call, so it looks like that does not generate a diagnostic log...

But the logging we added for VerifyOrDie does catch this in https://github.com/project-chip/connectedhomeip/runs/3733865170?check_suite_focus=true:

[1632840981785] [17937:91664] CHIP: [IN] Async DNS worker thread woke up.
[1632840981785] [17937:91664] CHIP: [IN] Posting DNS completion event to CHIP thread.
[1632840981785] [17937:91640] CHIP: [SPT] VerifyOrDie failure at ../../src/system/SystemTimer.cpp:152: lTimer->mNextTimer != add

Proposed Solution

Sort out why that VerifyOrDie is getting hit and fix it.

@mspang @kpschoedel @andy31415 @woody-apple

@bzbarsky-apple
Copy link
Contributor Author

bzbarsky-apple commented Sep 28, 2021

So the abort is because something is adding a timer that is already in the timer list.

I did some test runs with the log line above replaced with Mac-specific backtrace dumping, and the stack trace (sorry, no line numbers, but I expect them to be pretty obvious for the most part) looks like this:

chip::System::Timer::List::Add(chip::System::Timer*)
chip::System::Timer::MutexedList::Add(chip::System::Timer*)
chip::System::LayerImplSelect::StartTimer(unsigned int, void (*)(chip::System::Layer*, void*), void*)
ServiceEvents(unsigned int)
ServiceNetwork(unsigned int)
TestResolveHostAddress(_nlTestSuite*, void*)

the only real line number ambiguity here is which of the ServiceNetwork calls in TestResolveHostAddress is involved here. The test log has:

1632856161377] [67621:187410] CHIP: [IN] Async DNS worker thread woke up.
[1632856161381] [67621:187410] CHIP: [IN] Posting DNS completion event to CHIP thread.
    DNS name resolution complete: 142.250.69.196
    DNS name resolution complete: 127.0.0.1
[1632856161381] [67621:187411] CHIP: [IN] Async DNS worker thread woke up.
[1632856161381] [67621:187411] CHIP: [IN] Posting DNS completion event to CHIP thread.
../../third_party/pigweed/repo/targets/host/run_test: line 18: 67621 Abort trap: 6           "$@"

which annoyingly mixes ChipLog bits and printf bits, which makes it a little hard to tell what the actual ordering was, because I am not sure I can assume it matches the output order.... Will see if I can fix things here to use ChipLog consistently and reproduce again.

@bzbarsky-apple
Copy link
Contributor Author

bzbarsky-apple commented Sep 29, 2021

OK, so I tried running the unit tests under TSan and then everything becomes clear:

WARNING: ThreadSanitizer: data race (pid=21940)
  Read of size 8 at 0x0001021ac828 by thread T1:
    #0 chip::System::Timer::List::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.cpp:201 (TestInetEndPoint:x86_64+0x10004a71c)
    #1 chip::System::Timer::MutexedList::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.h:156 (TestInetEndPoint:x86_64+0x100046b0a)
    #2 chip::System::LayerImplSelect::CancelTimer(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:166 (TestInetEndPoint:x86_64+0x100046a06)
    #3 chip::System::LayerImplSelect::ScheduleWork(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:187 (TestInetEndPoint:x86_64+0x100046bf0)
    #4 chip::Inet::AsyncDNSResolverSockets::NotifyChipThread(chip::Inet::DNSResolver*) AsyncDNSResolverSockets.cpp:316 (TestInetEndPoint:x86_64+0x100043437)
    #5 chip::Inet::AsyncDNSResolverSockets::AsyncDNSThreadRun(void*) AsyncDNSResolverSockets.cpp:342 (TestInetEndPoint:x86_64+0x1000428be)

  Previous write of size 8 at 0x0001021ac828 by main thread:
    #0 chip::System::Timer::List::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.cpp:207 (TestInetEndPoint:x86_64+0x10004a7a9)
    #1 chip::System::Timer::MutexedList::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.h:156 (TestInetEndPoint:x86_64+0x100046b0a)
    #2 chip::System::LayerImplSelect::CancelTimer(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:166 (TestInetEndPoint:x86_64+0x100046a06)
    #3 chip::System::LayerImplSelect::StartTimer(unsigned int, void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:123 (TestInetEndPoint:x86_64+0x1000465ae)
    #4 ServiceEvents(unsigned int) TestInetCommonPosix.cpp:464 (TestInetEndPoint:x86_64+0x100031238)

How could we have a data race under MutexedList::Remove you ask?

src/platform/Darwin/SystemPlatformConfig.h:#define CHIP_SYSTEM_CONFIG_NO_LOCKING 1

and from src/system/SystemMutex.h:

#if CHIP_SYSTEM_CONFIG_NO_LOCKING
inline CHIP_ERROR Init(Mutex & aMutex)
{
    return CHIP_NO_ERROR;
}
inline void Mutex::Lock() {}
inline void Mutex::Unlock() {}
#endif // CHIP_SYSTEM_CONFIG_NO_LOCKING

So the timer handling ends up completely unsynchronized, we have data races, and end up in a bad state sometimes, not surprising at that point.

@vivien-apple @kpschoedel @sagar-apple We need to figure out what the actual contract here is going to be when CHIP_SYSTEM_CONFIG_USE_DISPATCH. Two obvious problems:

  1. If GetDispatchQueue() returns null (as it does in this test) we falll back to the non-dispatch codepaths, but those most definitely don't work if CHIP_SYSTEM_CONFIG_NO_LOCKING.
  2. Even if GetDispatchQueue() is non-null, we store things in mTimerList, which ends up racy, because CHIP_SYSTEM_CONFIG_NO_LOCKING is set.

@bzbarsky-apple
Copy link
Contributor Author

Maybe the answer is we should not be using AsyncDNSResolverSockets on Darwin if we're not supposed to be spinning up extra threads? And then we need a different impl of InetLayer::ResolveHostAddress...

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 29, 2021
In the use_dispatch setup the Matter stack is expected to not spin up
threads manually, but our async DNS implementation does just that via
pthreads.  This is leading to random failures in TestInetEndpoint on
Darwin due to thread data races.

project-chip#10025
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 29, 2021
In the use_dispatch setup the Matter stack is expected to not spin up
threads manually, but our async DNS implementation does just that via
pthreads.  This is leading to random failures in TestInetEndpoint on
Darwin due to thread data races.

Fixes project-chip#10025
andy31415 pushed a commit that referenced this issue Sep 29, 2021
In the use_dispatch setup the Matter stack is expected to not spin up
threads manually, but our async DNS implementation does just that via
pthreads.  This is leading to random failures in TestInetEndpoint on
Darwin due to thread data races.

Fixes #10025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant