Canceling timers does not work reliably in the LayerImplSelect implementation #19387

bzbarsky-apple · 2022-06-09T15:20:47Z

Problem

When we process timers, we grab a batch out of the list before we start calling them, and now the timers in that batch cannot be canceled. This can lead to use-after-free.

Proposed Solution

Either stop doing the batch thing and process timers one by one (maybe with a cap on how far down the list we go), or make the batch a member so that canceling can remove from the batch too. Discussion on Slack tended toward the former solution.

We should also consider making sure that the Select and FreeRTOS implementations are the same, so our CI tells us something about the FreeRTOS case.

@kpschoedel

kpschoedel · 2022-06-09T15:57:02Z

~~I think the code can be identical to that in FreeRTOS::HandlePlatformTimer().~~ [see below] It would be nice to share, though the implementation class hierarchy diverges early.

bzbarsky-apple · 2022-06-10T02:58:18Z

Once this is fixed, the workaround for this in FailSafeContext::FailSafeTimerExpired can be removed, I think.

kpschoedel · 2022-06-10T14:30:32Z

I think the code can be identical to that in FreeRTOS::HandlePlatformTimer()

It can't, because in the Select implementation a timer callback that re-adds itself with immediate effect (e.g. using System::Layer::ScheduleWork(), not to be confused with PlatformMgr::ScheduleWork()) will steal all the execution slots. Probably have to use the cancel-from-batch method.

bzbarsky-apple · 2022-06-10T20:02:25Z

Ah, and the FreeRTOS setup does not have that problem?

kpschoedel · 2022-06-10T20:52:09Z

Ah, and the FreeRTOS setup does not have that problem?

On FreeRTOS, ScheduleWork uses the recent ScheduleLambda<>(), which percolates through to PlatformMgr().PostEvent(), rather than the timer queue. In the Select version, it's approximately StartTimer(0, …).

I don't know why the Select version doesn't use ScheduleLambda, but I seem to remember that originally only existed on LwIP platforms, so that might be the only reason. @kghost

kpschoedel · 2022-06-10T21:15:45Z

It turns out there's another subtlety with the Select implementation and being able to Cancel from the expired timers. Suppose timers A and B expire, and A's callback wants to schedule B. This cancels B from the expired timer list and re-adds it to the main list, so it doesn't execute in the current batch. This can happen forever.

For StartTimer(), the cancel/re-add behaviour is documented in the API, but for ScheduleWork() it isn't, and this actually occurs in a unit test (TestReadHandler::KillOverQuotaSubscriptions). Treating ScheduleWork() as StartTimer(0) is probably just wrong.

To minimize risk, the changes here keep the "grab all the timers we should fire, then fire them" setup instead of switching to the "fire the timers one at a time" approach LayerImplFreeRTOS uses. The fix consists of the following parts: 1) Store the list of timers to fire in a member, so we can cancel things from that list as needed. 2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer call in ScheduleWork. This does mean we now allow multiple timers for the same callback+appState in the timer list, if they were created by ScheduleWork, but that should be OK, since the only reason that pair needs to be unique is to allow cancellation and we never want to cancel the things ScheduleWork schedules. As a followup we should stop using the timer list for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS does, but that involves fixing the unit tests that call ScheduleWork without actually running the platfor manager event loop and expect it to work somehow. TestRead was failing the sanity assert for not losing track of timers to fire, because it was spinning a nested event loop. The changes to that test stop it from doing that. Fixes project-chip#19387 Fixes project-chip#22160

* Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history. * Fix timer cancellation to be reliable in LayerImplSelect. To minimize risk, the changes here keep the "grab all the timers we should fire, then fire them" setup instead of switching to the "fire the timers one at a time" approach LayerImplFreeRTOS uses. The fix consists of the following parts: 1) Store the list of timers to fire in a member, so we can cancel things from that list as needed. 2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer call in ScheduleWork. This does mean we now allow multiple timers for the same callback+appState in the timer list, if they were created by ScheduleWork, but that should be OK, since the only reason that pair needs to be unique is to allow cancellation and we never want to cancel the things ScheduleWork schedules. As a followup we should stop using the timer list for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS does, but that involves fixing the unit tests that call ScheduleWork without actually running the platfor manager event loop and expect it to work somehow. TestRead was failing the sanity assert for not losing track of timers to fire, because it was spinning a nested event loop. The changes to that test stop it from doing that. Fixes #19387 Fixes #22160 * Add a unit test that timer cancelling works even for currently "in progress" timers * Address review comments. * Fix shadowing problem. * Turn off TestSystemScheduleWork on esp32 QEMU for now. * Turn off TestSystemScheduleWork on the fake platform too. The fake platform's event loop does not actually process the SystemLayer events. Co-authored-by: Andrei Litvin <[email protected]>

…ip#22375) * Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history. * Fix timer cancellation to be reliable in LayerImplSelect. To minimize risk, the changes here keep the "grab all the timers we should fire, then fire them" setup instead of switching to the "fire the timers one at a time" approach LayerImplFreeRTOS uses. The fix consists of the following parts: 1) Store the list of timers to fire in a member, so we can cancel things from that list as needed. 2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer call in ScheduleWork. This does mean we now allow multiple timers for the same callback+appState in the timer list, if they were created by ScheduleWork, but that should be OK, since the only reason that pair needs to be unique is to allow cancellation and we never want to cancel the things ScheduleWork schedules. As a followup we should stop using the timer list for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS does, but that involves fixing the unit tests that call ScheduleWork without actually running the platfor manager event loop and expect it to work somehow. TestRead was failing the sanity assert for not losing track of timers to fire, because it was spinning a nested event loop. The changes to that test stop it from doing that. Fixes project-chip#19387 Fixes project-chip#22160 * Add a unit test that timer cancelling works even for currently "in progress" timers * Address review comments. * Fix shadowing problem. * Turn off TestSystemScheduleWork on esp32 QEMU for now. * Turn off TestSystemScheduleWork on the fake platform too. The fake platform's event loop does not actually process the SystemLayer events. Co-authored-by: Andrei Litvin <[email protected]>

bzbarsky-apple added V1.0 looks_critical labels Jun 9, 2022

bzbarsky-apple added the security label Jul 2, 2022

woody-apple assigned bzbarsky-apple Aug 31, 2022

bzbarsky-apple mentioned this issue Sep 1, 2022

DMalloc detection of free space overwrite during TestRead #22160

Closed

bzbarsky-apple changed the title ~~Canceling timers does not work reliably in the SystemLayerSelect implementation~~ Canceling timers does not work reliably in the LayerImplSelect implementation Sep 2, 2022

andy31415 mentioned this issue Sep 2, 2022

Allow 'expired system timers' to be cancelled before execution. #22372

Closed

bzbarsky-apple mentioned this issue Sep 2, 2022

Fix timer cancellation to be reliable in LayerImplSelect. #22375

Merged

andy31415 closed this as completed in #22375 Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canceling timers does not work reliably in the LayerImplSelect implementation #19387

Canceling timers does not work reliably in the LayerImplSelect implementation #19387

bzbarsky-apple commented Jun 9, 2022 •

edited

Loading

kpschoedel commented Jun 9, 2022 •

edited

Loading

bzbarsky-apple commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

bzbarsky-apple commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

Canceling timers does not work reliably in the LayerImplSelect implementation #19387

Canceling timers does not work reliably in the LayerImplSelect implementation #19387

Comments

bzbarsky-apple commented Jun 9, 2022 • edited Loading

Problem

Proposed Solution

kpschoedel commented Jun 9, 2022 • edited Loading

bzbarsky-apple commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

bzbarsky-apple commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

kpschoedel commented Jun 10, 2022

bzbarsky-apple commented Jun 9, 2022 •

edited

Loading

kpschoedel commented Jun 9, 2022 •

edited

Loading