Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canceling timers does not work reliably in the LayerImplSelect implementation #19387

Closed
bzbarsky-apple opened this issue Jun 9, 2022 · 6 comments · Fixed by #22375
Closed
Assignees

Comments

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Jun 9, 2022

Problem

When we process timers, we grab a batch out of the list before we start calling them, and now the timers in that batch cannot be canceled. This can lead to use-after-free.

Proposed Solution

Either stop doing the batch thing and process timers one by one (maybe with a cap on how far down the list we go), or make the batch a member so that canceling can remove from the batch too. Discussion on Slack tended toward the former solution.

We should also consider making sure that the Select and FreeRTOS implementations are the same, so our CI tells us something about the FreeRTOS case.

@kpschoedel

@kpschoedel
Copy link
Contributor

kpschoedel commented Jun 9, 2022

I think the code can be identical to that in FreeRTOS::HandlePlatformTimer(). [see below] It would be nice to share, though the implementation class hierarchy diverges early.

@bzbarsky-apple
Copy link
Contributor Author

Once this is fixed, the workaround for this in FailSafeContext::FailSafeTimerExpired can be removed, I think.

@kpschoedel
Copy link
Contributor

I think the code can be identical to that in FreeRTOS::HandlePlatformTimer()

It can't, because in the Select implementation a timer callback that re-adds itself with immediate effect (e.g. using System::Layer::ScheduleWork(), not to be confused with PlatformMgr::ScheduleWork()) will steal all the execution slots. Probably have to use the cancel-from-batch method.

@bzbarsky-apple
Copy link
Contributor Author

Ah, and the FreeRTOS setup does not have that problem?

@kpschoedel
Copy link
Contributor

Ah, and the FreeRTOS setup does not have that problem?

On FreeRTOS, ScheduleWork uses the recent ScheduleLambda<>(), which percolates through to PlatformMgr().PostEvent(), rather than the timer queue. In the Select version, it's approximately StartTimer(0, …).

I don't know why the Select version doesn't use ScheduleLambda, but I seem to remember that originally only existed on LwIP platforms, so that might be the only reason. @kghost

@kpschoedel
Copy link
Contributor

It turns out there's another subtlety with the Select implementation and being able to Cancel from the expired timers. Suppose timers A and B expire, and A's callback wants to schedule B. This cancels B from the expired timer list and re-adds it to the main list, so it doesn't execute in the current batch. This can happen forever.

For StartTimer(), the cancel/re-add behaviour is documented in the API, but for ScheduleWork() it isn't, and this actually occurs in a unit test (TestReadHandler::KillOverQuotaSubscriptions). Treating ScheduleWork() as StartTimer(0) is probably just wrong.

@bzbarsky-apple bzbarsky-apple changed the title Canceling timers does not work reliably in the SystemLayerSelect implementation Canceling timers does not work reliably in the LayerImplSelect implementation Sep 2, 2022
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 2, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 3, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 3, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
andy31415 added a commit that referenced this issue Sep 5, 2022
* Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history.

* Fix timer cancellation to be reliable in LayerImplSelect.

To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes #19387
Fixes #22160

* Add a unit test that timer cancelling works even for currently "in progress" timers

* Address review comments.

* Fix shadowing problem.

* Turn off TestSystemScheduleWork on esp32 QEMU for now.

* Turn off TestSystemScheduleWork on the fake platform too.

The fake platform's event loop does not actually process the SystemLayer events.

Co-authored-by: Andrei Litvin <[email protected]>
isiu-apple pushed a commit to isiu-apple/connectedhomeip that referenced this issue Sep 16, 2022
…ip#22375)

* Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history.

* Fix timer cancellation to be reliable in LayerImplSelect.

To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160

* Add a unit test that timer cancelling works even for currently "in progress" timers

* Address review comments.

* Fix shadowing problem.

* Turn off TestSystemScheduleWork on esp32 QEMU for now.

* Turn off TestSystemScheduleWork on the fake platform too.

The fake platform's event loop does not actually process the SystemLayer events.

Co-authored-by: Andrei Litvin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment