Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of mHandleSelectThread in SystemLayer is data-racy #7818

Closed
bzbarsky-apple opened this issue Jun 22, 2021 · 3 comments · Fixed by #7828
Closed

Use of mHandleSelectThread in SystemLayer is data-racy #7818

bzbarsky-apple opened this issue Jun 22, 2021 · 3 comments · Fixed by #7828

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

When I tried to use ScheduleWork from outside the Matter thread, I get thread races reported by TSan like so:

WARNING: ThreadSanitizer: data race (pid=9506)
  Write of size 8 at 0x556d474f9c38 by thread T4 (mutexes: write M97):
    #0 chip::System::Layer::HandleTimeout() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/SystemLayer.cpp:655 (chip-tool+0x1f5a08)
    #1 chip::System::WatchableEventManager::HandleEvents() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/WatchableSocketSelect.cpp:190 (chip-tool+0x1f99ac)
    #2 chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_RunEventLoop() ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp:190 (chip-tool+0x1dc422)
    #3 chip::DeviceLayer::PlatformManager::RunEventLoop() ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/PlatformManager.h:247 (chip-tool+0x1db97d)

  Previous read of size 8 at 0x556d474f9c38 by main thread:
    #0 chip::System::Layer::WakeIOThread() ../../../examples/chip-tool/third_party/connectedhomeip/src/system/SystemLayer.cpp:699 (chip-tool+0x1f5b3c)
    #1 chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_PostEvent(chip::DeviceLayer::ChipDeviceEvent const*) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp:144 (chip-tool+0x1dc2ce)
    #2 chip::DeviceLayer::PlatformManager::PostEvent(chip::DeviceLayer::ChipDeviceEvent const*) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/PlatformManager.h:310 (chip-tool+0x1d66e4)
    #3 chip::DeviceLayer::Internal::GenericPlatformManagerImpl<chip::DeviceLayer::PlatformManagerImpl>::_ScheduleWork(void (*)(long), long) ../../../examples/chip-tool/third_party/connectedhomeip/src/include/platform/internal/GenericPlatformManagerImpl.cpp:198 (chip-tool+0x1dbd90)
    #4 chip::DeviceLayer::PlatformManager::ScheduleWork(void (*)(long), long) <null> (chip-tool+0x31cfc)

and indeed RunEventLoop ends up repeatedly calling HandleEvents (with the CHIP lock held). Each call calls Layer::HandleTimeout, which sets mHandleSelectThread, then does some work, then unsets mHandleSelectThread.

On the other hand, ScheduleWork ends up calling Layer::WakeIOThread, which ends up no-opping via early return if pthread_equal(this->mHandleSelectThread, pthread_self()). There's no synchronization around this bit.

This is clearly racy. Futher, since mHandleSelectThread is only set transiently, the only time we take the early return is if we are in the middle of HandleTimeout. It's not clear to me why we're doing this no-opping in the pthread_equal case (I wish it were documented!), but doing so non-deterministically depending on where in the execution of HandleTimeout we are seems fishy.

Proposed Solution

I'm not sure. If we tried to take the lock in WakeIOThread then we could conceivably deadlock if we're already on the thread that holds the lock, and more to the point once we get the lock we are guaranteed that mHandleSelectThread is PTHREAD_NULL....

This looks like fallout from #6561

@kpschoedel

@bzbarsky-apple
Copy link
Contributor Author

This looks like fallout from #6561

It's not. When I run the same code on top of the pre-#6561 codebase I get the same Tsan failure. So this is a pre-existing problem with GenericPlatformManagerImpl_POSIX::_PostEvent....

@bzbarsky-apple
Copy link
Contributor Author

And this general mHandleSelectThread thing and the way it's used in WakeSelect and HandleSelectResult exists back to Weave, and according to @kpschoedel back to before Weave code was imported into Google's code. But the use of WakeSelect in posting events seems to be new in CHIP. So maybe all the WakeSelect calls in Weave are synchronized somehow? @mrjerryjohns @tcarmelveilleux @andy31415

@bzbarsky-apple
Copy link
Contributor Author

So poking at this some:

  1. This seems to be an optimization to avoid trying to wake up the IO thread if we don't need to.
  2. The optimization in its current form is valid (at least for _PostEvent) because the thing _PostEvent adds to is processed under GenericPlatformManagerImpl_POSIX::ProcessDeviceEvents, which is called after watchState.HandleEvents(), and mHandleSelectThread is only set during part of HandleEvents, specifically under Layer::HandleTimeout.
  3. The data race is sort-of benign in that if we read "too early", after the IO thread wakes up but before it entersHandleTimeout, then we will just do an unnecessary wakeup write. If we read "too late" (after HandleTimeout is done), then we might also do an unnecessary wakeup. If we read while the value is non-null, then either we are on a different thread and then pthread_equal will return false and we will do the wakeup, or we are on the same thread and then we are not racing.
  4. It's not clear to me whether the data race is really benign in the sense that we can't read parts of a pthread_t out of mHandleSelectThread and therefore accidentally get a true result from pthread_equal because we read a value that is not the one that anyone ever wrote to mHandleSelectThread.

I question whether this optimization is really worthwhile, or as complete as it "should" be, but maybe for now we should just address item 4 there by making mHandleSelectThread be std::atomic<pthread_t>?

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 22, 2021
mHandleSelectThread is used for an optimization: when it's set to the
current thread id, WakeIOThread knows that:

1) We're in the middle of running Layer::HandleTimeout
2) We're doing that on the same thread where WakeIOThread was called

and hence WakeIOThread doesn't need to do anything, because we're
already on the IO thread and it's already awake.

The read of this member in WakeIOThread is not synchronized in any
way, but as long as we guarantee that it correctly reads a value that
was actually assigned to the member (which std::atomic does), things
will work correctly.  Either the value we read will be equal to the
current thread id, in which case we know we're on the one thread that
called Layer::HandleTimeout and are inside that function, or it will
not be equal to our thread id and then we need to do the actual wakeup
work, whatever that value is (including if it's null).

Fixes project-chip#7818
woody-apple pushed a commit that referenced this issue Jun 23, 2021
mHandleSelectThread is used for an optimization: when it's set to the
current thread id, WakeIOThread knows that:

1) We're in the middle of running Layer::HandleTimeout
2) We're doing that on the same thread where WakeIOThread was called

and hence WakeIOThread doesn't need to do anything, because we're
already on the IO thread and it's already awake.

The read of this member in WakeIOThread is not synchronized in any
way, but as long as we guarantee that it correctly reads a value that
was actually assigned to the member (which std::atomic does), things
will work correctly.  Either the value we read will be equal to the
current thread id, in which case we know we're on the one thread that
called Layer::HandleTimeout and are inside that function, or it will
not be equal to our thread id and then we need to do the actual wakeup
work, whatever that value is (including if it's null).

Fixes #7818
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this issue Sep 23, 2021
mHandleSelectThread is used for an optimization: when it's set to the
current thread id, WakeIOThread knows that:

1) We're in the middle of running Layer::HandleTimeout
2) We're doing that on the same thread where WakeIOThread was called

and hence WakeIOThread doesn't need to do anything, because we're
already on the IO thread and it's already awake.

The read of this member in WakeIOThread is not synchronized in any
way, but as long as we guarantee that it correctly reads a value that
was actually assigned to the member (which std::atomic does), things
will work correctly.  Either the value we read will be equal to the
current thread id, in which case we know we're on the one thread that
called Layer::HandleTimeout and are inside that function, or it will
not be equal to our thread id and then we need to do the actual wakeup
work, whatever that value is (including if it's null).

Fixes project-chip#7818
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant