Use of `mHandleSelectThread` in SystemLayer is data-racy #7818

bzbarsky-apple · 2021-06-22T13:28:40Z

Problem

When I tried to use ScheduleWork from outside the Matter thread, I get thread races reported by TSan like so:

WARNING: ThreadSanitizer: data race (pid=9506)
  Write of size 8 at 0x556d474f9c38 by thread T4 (mutexes: write M97):
    #0 chip::System::Layer::HandleTimeout() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/SystemLayer.cpp:655 (chip-tool+0x1f5a08)
    #1 chip::System::WatchableEventManager::HandleEvents() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/WatchableSocketSelect.cpp:190 (chip-tool+0x1f99ac)
    #2 chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_RunEventLoop() ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp:190 (chip-tool+0x1dc422)
    #3 chip::DeviceLayer::PlatformManager::RunEventLoop() ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/PlatformManager.h:247 (chip-tool+0x1db97d)

  Previous read of size 8 at 0x556d474f9c38 by main thread:
    #0 chip::System::Layer::WakeIOThread() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/SystemLayer.cpp:699 (chip-tool+0x1f5b3c)
    #1 chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_PostEvent(chip::DeviceLayer::ChipDeviceEvent const*) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp:144 (chip-tool+0x1dc2ce)
    #2 chip::DeviceLayer::PlatformManager::PostEvent(chip::DeviceLayer::ChipDeviceEvent const*) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/PlatformManager.h:310 (chip-tool+0x1d66e4)
    #3 chip::DeviceLayer::Internal::GenericPlatformManagerImpl<chip::DeviceLayer::PlatformManagerImpl>::_ScheduleWork(void (*)(long), long) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl.cpp:198 (chip-tool+0x1dbd90)
    #4 chip::DeviceLayer::PlatformManager::ScheduleWork(void (*)(long), long) <null> (chip-tool+0x31cfc)

and indeed RunEventLoop ends up repeatedly calling HandleEvents (with the CHIP lock held). Each call calls Layer::HandleTimeout, which sets mHandleSelectThread, then does some work, then unsets mHandleSelectThread.

On the other hand, ScheduleWork ends up calling Layer::WakeIOThread, which ends up no-opping via early return if pthread_equal(this->mHandleSelectThread, pthread_self()). There's no synchronization around this bit.

This is clearly racy. Futher, since mHandleSelectThread is only set transiently, the only time we take the early return is if we are in the middle of HandleTimeout. It's not clear to me why we're doing this no-opping in the pthread_equal case (I wish it were documented!), but doing so non-deterministically depending on where in the execution of HandleTimeout we are seems fishy.

Proposed Solution

I'm not sure. If we tried to take the lock in WakeIOThread then we could conceivably deadlock if we're already on the thread that holds the lock, and more to the point once we get the lock we are guaranteed that mHandleSelectThread is PTHREAD_NULL....

This looks like fallout from #6561

@kpschoedel

The text was updated successfully, but these errors were encountered:

bzbarsky-apple · 2021-06-22T18:15:44Z

This looks like fallout from #6561

It's not. When I run the same code on top of the pre-#6561 codebase I get the same Tsan failure. So this is a pre-existing problem with GenericPlatformManagerImpl_POSIX::_PostEvent....

bzbarsky-apple · 2021-06-22T18:28:08Z

And this general mHandleSelectThread thing and the way it's used in WakeSelect and HandleSelectResult exists back to Weave, and according to @kpschoedel back to before Weave code was imported into Google's code. But the use of WakeSelect in posting events seems to be new in CHIP. So maybe all the WakeSelect calls in Weave are synchronized somehow? @mrjerryjohns @tcarmelveilleux @andy31415

bzbarsky-apple · 2021-06-22T22:16:19Z

So poking at this some:

This seems to be an optimization to avoid trying to wake up the IO thread if we don't need to.
The optimization in its current form is valid (at least for _PostEvent) because the thing _PostEvent adds to is processed under GenericPlatformManagerImpl_POSIX::ProcessDeviceEvents, which is called after watchState.HandleEvents(), and mHandleSelectThread is only set during part of HandleEvents, specifically under Layer::HandleTimeout.
The data race is sort-of benign in that if we read "too early", after the IO thread wakes up but before it entersHandleTimeout, then we will just do an unnecessary wakeup write. If we read "too late" (after HandleTimeout is done), then we might also do an unnecessary wakeup. If we read while the value is non-null, then either we are on a different thread and then pthread_equal will return false and we will do the wakeup, or we are on the same thread and then we are not racing.
It's not clear to me whether the data race is really benign in the sense that we can't read parts of a pthread_t out of mHandleSelectThread and therefore accidentally get a true result from pthread_equal because we read a value that is not the one that anyone ever wrote to mHandleSelectThread.

I question whether this optimization is really worthwhile, or as complete as it "should" be, but maybe for now we should just address item 4 there by making mHandleSelectThread be std::atomic<pthread_t>?

mHandleSelectThread is used for an optimization: when it's set to the current thread id, WakeIOThread knows that: 1) We're in the middle of running Layer::HandleTimeout 2) We're doing that on the same thread where WakeIOThread was called and hence WakeIOThread doesn't need to do anything, because we're already on the IO thread and it's already awake. The read of this member in WakeIOThread is not synchronized in any way, but as long as we guarantee that it correctly reads a value that was actually assigned to the member (which std::atomic does), things will work correctly. Either the value we read will be equal to the current thread id, in which case we know we're on the one thread that called Layer::HandleTimeout and are inside that function, or it will not be equal to our thread id and then we need to do the actual wakeup work, whatever that value is (including if it's null). Fixes project-chip#7818

mHandleSelectThread is used for an optimization: when it's set to the current thread id, WakeIOThread knows that: 1) We're in the middle of running Layer::HandleTimeout 2) We're doing that on the same thread where WakeIOThread was called and hence WakeIOThread doesn't need to do anything, because we're already on the IO thread and it's already awake. The read of this member in WakeIOThread is not synchronized in any way, but as long as we guarantee that it correctly reads a value that was actually assigned to the member (which std::atomic does), things will work correctly. Either the value we read will be equal to the current thread id, in which case we know we're on the one thread that called Layer::HandleTimeout and are inside that function, or it will not be equal to our thread id and then we need to do the actual wakeup work, whatever that value is (including if it's null). Fixes #7818

mHandleSelectThread is used for an optimization: when it's set to the current thread id, WakeIOThread knows that: 1) We're in the middle of running Layer::HandleTimeout 2) We're doing that on the same thread where WakeIOThread was called and hence WakeIOThread doesn't need to do anything, because we're already on the IO thread and it's already awake. The read of this member in WakeIOThread is not synchronized in any way, but as long as we guarantee that it correctly reads a value that was actually assigned to the member (which std::atomic does), things will work correctly. Either the value we read will be equal to the current thread id, in which case we know we're on the one thread that called Layer::HandleTimeout and are inside that function, or it will not be equal to our thread id and then we need to do the actual wakeup work, whatever that value is (including if it's null). Fixes project-chip#7818

kpschoedel mentioned this issue Jun 22, 2021

Re-factor of how threading and event dispatch is handled in the SDK #7725

Closed

bzbarsky-apple mentioned this issue Jun 22, 2021

Make System::Layer::mHandleSelectThread atomic. #7828

Merged

woody-apple closed this as completed in #7828 Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of `mHandleSelectThread` in SystemLayer is data-racy #7818

Use of `mHandleSelectThread` in SystemLayer is data-racy #7818

bzbarsky-apple commented Jun 22, 2021

bzbarsky-apple commented Jun 22, 2021

bzbarsky-apple commented Jun 22, 2021

bzbarsky-apple commented Jun 22, 2021

Use of mHandleSelectThread in SystemLayer is data-racy #7818

Use of mHandleSelectThread in SystemLayer is data-racy #7818

Comments

bzbarsky-apple commented Jun 22, 2021

Problem

Proposed Solution

bzbarsky-apple commented Jun 22, 2021

bzbarsky-apple commented Jun 22, 2021

bzbarsky-apple commented Jun 22, 2021

Use of `mHandleSelectThread` in SystemLayer is data-racy #7818

Use of `mHandleSelectThread` in SystemLayer is data-racy #7818