Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMalloc detection of free space overwrite during TestRead #22160

Closed
andy31415 opened this issue Aug 25, 2022 · 7 comments · Fixed by #22375
Closed

DMalloc detection of free space overwrite during TestRead #22160

andy31415 opened this issue Aug 25, 2022 · 7 comments · Fixed by #22375

Comments

@andy31415
Copy link
Contributor

andy31415 commented Aug 25, 2022

Problem

Two instances for this at least:

https://github.com/project-chip/connectedhomeip/runs/8005304886?check_suite_focus=true
https://github.com/project-chip/connectedhomeip/runs/7984389882?check_suite_focus=true

Log says:

...
[1661296765.557679][75818:75818] CHIP:EM: Retransmitting MessageCounter:128264913 on exchange 47182i Send Cnt 1
[1661296765.561911][75818:75818] CHIP:IN: Sending encrypted msg 0x7fc57a2058d0 with MessageCounter:128264913 to 0x0000000011223344 (2) at monotonic time: 000000000019EB6C msec
[1661296765.570471][75818:75818] CHIP:DMG: Unblock report hold after min 57311 seconds
debug-malloc library: dumping program, fatal error
  Error: free space has been overwritten (err 67)
Aborted (core dumped)
INF Test 1/1: [FAIL] TestRead

I am unclear where 57311 seconds come from (seems like an abnormally large seconds count) however timestamps do not seem to point at a large thread block either. Unfortunately TestRead in general is very slow to run and with a lot of logs, making it hard to debug.

@bzbarsky-apple
Copy link
Contributor

That log line is doing:

ChipLogDetail(DataManagement, "Unblock report hold after min %d seconds", readHandler->mMinIntervalFloorSeconds);

and mMinIntervalFloorSeconds is a uint16_t. 57311 is 0xdfdf as a 16-bit number. And debug-malloc uses 0xdf as the "freed memory" marker.

So in this case readHandler is pointing to freed memory and we somehow managed to free it without canceling the OnUnblockHoldReportCallback timer....

@mrjerryjohns @yunhanw-google

@bzbarsky-apple
Copy link
Contributor

Oh, and I expect the "free space has been overwritten" bit is from the immediate next line, which does:

    readHandler->mFlags.Set(ReadHandlerFlags::HoldReport, false);

@andy31415
Copy link
Contributor Author

Did not yet manage to run in RR, however only on my intel vm after a few runs I managed to get a coredump:

#0  0x00005556b54787f2 in chip::Transport::Session::IsSecureSession() const (this=0xdfdfdfdfdfdfdfdf)
    at ../../src/transport/Session.h:226
#1  0x00005556b5479d2b in chip::Messaging::ExchangeContext::NotifyResponseTimeout(bool)
    (this=0x7fc1837eb208, aCloseIfNeeded=true) at ../../src/messaging/ExchangeContext.cpp:487
#2  0x00005556b5479cce in chip::Messaging::ExchangeContext::HandleResponseTimeout(chip::System::Layer*, void*) (aSystemLayer=0x5556b54fb1e0 <gSystemLayer>, aAppState=0x7fc1837eb208)
    at ../../src/messaging/ExchangeContext.cpp:476
#3  0x00005556b54cecfd in chip::System::TimerData::Callback::Invoke() const (this=0x7fff03b0f9a0)
    at ../../src/system/SystemTimer.h:61
#4  0x00005556b54d036d in chip::System::TimerPool<chip::System::TimerList::Node>::Invoke(chip::System::TimerList::Node*) (this=0x5556b54fb7e8 <gSystemLayer+1544>, timer=0x7fc1837e7948)
    at ../../src/system/SystemTimer.h:224
#5  0x00005556b54cfeb8 in chip::System::LayerImplSelect::HandleEvents()
    (this=0x5556b54fb1e0 <gSystemLayer>) at ../../src/system/SystemLayerImplSelect.cpp:476
#6  0x00005556b54a5a45 in ServiceEvents(unsigned int) (aSleepTimeMilliseconds=10)
    at ../../src/inet/tests/TestInetCommonPosix.cpp:449
#7  0x00005556b54a5439 in chip::Test::IOContext::DriveIO() (this=0x7fc183937010)
    at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:58
#8  0x00005556b54a546c in chip::Test::IOContext::DriveIOUntil(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >, std::function<bool ()>) (this=0x7fc183937010, maxWait=..., completionFunction=...)
    at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:67
#9  0x00005556b537bc22 in (anonymous namespace)::TestReadInteraction::TestReadHandler_MultipleSubscriptionsWithDataVersionFilter(nlTestSuite*, void*) (apSuite=0x5556b54f6120 <(anonymous namespace)::sSuite>, apContext=0x7fc183937008) at ../../src/controller/tests/data_model/TestRead.cpp:2612

@andy31415
Copy link
Contributor Author

Got a RR trace, investigating. For now the trace there seems different:

#8  0x0000713f0489af1d in dmalloc_malloc () from /usr/local/lib/libdmallocthcxx.so
#9  0x000055aeb2c6dd43 in chip::Platform::MemoryAlloc (size=40) at ../../src/lib/support/CHIPMem-Malloc.cpp:91
#10 0x000055aeb2d1f328 in chip::Platform::New<chip::System::TimerList::Node, chip::System::Layer&, std::chrono::duration<unsigned long, std::ratio<1l, 1000l> >&, void (*&)(chip::System::Layer*, void*), void*&> () at ../../src/lib/support/CHIPMem.h:145
#11 0x000055aeb2d1f137 in chip::HeapObjectPool<chip::System::TimerList::Node>::CreateObject<chip::System::Layer&, std::chrono::duration<unsigned long, std::ratio<1l, 1000l> >&, void (*&)(chip::System::Layer*, void*), void*&> (this=0x55aeb2d4a7a8 <gSystemLayer+1544>) at ../../src/lib/support/Pool.h:325
#12 0x000055aeb2d1ee25 in chip::System::TimerPool<chip::System::TimerList::Node>::Create (this=0x55aeb2d4a7a8 <gSystemLayer+1544>, systemLayer=...,
    awakenTime=..., onComplete=0x55aeb2c41480 <chip::app::ReadHandler::OnRefreshSubscribeTimerSyncCallback(chip::System::Layer*, void*)>,
    appState=0x72cd6c52a708) at ../../src/system/SystemTimer.h:194
#13 0x000055aeb2d1dda3 in chip::System::LayerImplSelect::StartTimer (this=0x55aeb2d4a1a0 <gSystemLayer>, delay=...,
    onComplete=0x55aeb2c41480 <chip::app::ReadHandler::OnRefreshSubscribeTimerSyncCallback(chip::System::Layer*, void*)>, appState=0x72cd6c52a708)
    at ../../src/system/SystemLayerImplSelect.cpp:131
#14 0x000055aeb2c41462 in chip::app::ReadHandler::OnUnblockHoldReportCallback (apSystemLayer=0x55aeb2d4a1a0 <gSystemLayer>, apAppState=0x72cd6c52a708)
    at ../../src/app/ReadHandler.cpp:748
#15 0x000055aeb2d1d861 in chip::System::TimerData::Callback::Invoke

@andy31415
Copy link
Contributor Author

ReadHandler seems to be cleared and the free stack trace according to RR is:

#0  chip::app::ReadHandler::~ReadHandler (this=0x72cd6c52a708, __in_chrg=<optimized out>) at ../../src/app/ReadHandler.cpp:59
#1  0x000055aeb2bf22de in chip::Platform::Delete<chip::app::ReadHandler> (p=0x72cd6c52a708) at ../../src/lib/support/CHIPMem.h:168
#2  0x000055aeb2be0e16 in chip::HeapObjectPool<chip::app::ReadHandler>::ReleaseObject (this=0x55aeb2d46138 <chip::app::sInteractionModelEngine+1752>, object=0x72cd6c52a708) at ../../src/lib/support/Pool.h:360
#3  0x000055aeb2c20250 in chip::app::InteractionModelEngine::OnDone (this=0x55aeb2d45a60 <chip::app::sInteractionModelEngine>, apReadObj=...) at ../../src/app/InteractionModelEngine.cpp:291
#4  0x000055aeb2c3c1fb in chip::app::ReadHandler::Close (this=0x72cd6c52a708) at ../../src/app/ReadHandler.cpp:88
#5  0x000055aeb2c3d408 in chip::app::ReadHandler::OnMessageReceived (this=0x72cd6c52a708, apExchangeContext=0x72cd6c52ac08, aPayloadHeader=..., aPayload=...) at ../../src/app/ReadHandler.cpp:275
#6  0x000055aeb2c1edd9 in chip::Messaging::ExchangeHolder::OnMessageReceived (this=0x72cd6c52a790, ec=0x72cd6c52ac08, payloadHeader=..., payload=...) at ../../src/messaging/ExchangeHolder.h:154
#7  0x000055aeb2cc6b33 in chip::Messaging::ExchangeContext::HandleMessage (this=0x72cd6c52ac08, messageCounter=25554252, payloadHeader=..., msgFlags=..., msgBuf=...) at ../../src/messaging/ExchangeContext.cpp:587
#8  0x000055aeb2cc9636 in operator()<chip::Messaging::ExchangeContext> (__closure=0x7ffc77b0e180, ec=0x72cd6c52ac08) at ../../src/messaging/ExchangeMgr.cpp:219
#9  0x000055aeb2cc9b09 in chip::internal::LambdaProxy<chip::Messaging::ExchangeContext, chip::Messaging::ExchangeManager::OnMessageReceived(const chip::PacketHeader&, const chip::PayloadHeader&, const chip::SessionHandle&, chip::SessionMessageDelegate::DuplicateMessage, chip::System::PacketBufferHandle&&)::<lambda(auto:6*)> >::Call(void *, void *) (context=0x7ffc77b0e180, target=0x72cd6c52ac08) at ../../src/lib/support/Pool.h:126
#10 0x000055aeb2c6d8b1 in chip::internal::HeapObjectList::ForEachNode (this=0x769847aecd88, context=0x7ffc77b0e180, lambda=0x55aeb2cc9ae6 <chip::internal::LambdaProxy<chip::Messaging::ExchangeContext, chip::Messaging::ExchangeManager::OnMessageReceived(const chip::PacketHeader&, const chip::PayloadHeader&, const chip::SessionHandle&, chip::SessionMessageDelegate::DuplicateMessage, chip::System::PacketBufferHandle&&)::<lambda(auto:6*)> >::Call(void *, void *)>) at ../../src/lib/support/Pool.cpp:126
#11 0x000055aeb2cc96c9 in chip::HeapObjectPool<chip::Messaging::ExchangeContext>::ForEachActiveObject<chip::Messaging::ExchangeManager::OnMessageReceived(const chip::PacketHeader&, const chip::PayloadHeader&, const chip::SessionHandle&, chip::SessionMessageDelegate::DuplicateMessage, chip::System::PacketBufferHandle&&)::<lambda(auto:6*)> >(struct {...} &&) (this=0x769847aecd78, function=...) at ../../src/lib/support/Pool.h:401
#12 0x000055aeb2cc8ab5 in chip::Messaging::ExchangeManager::OnMessageReceived (this=0x769847aecd60, packetHeader=..., payloadHeader=..., session=..., isDuplicate=chip::SessionMessageDelegate::DuplicateMessage::No, msgBuf=...) at ../../src/messaging/ExchangeMgr.cpp:212
#13 0x000055aeb2cd3180 in chip::SessionManager::SecureUnicastMessageDispatch (this=0x769847aecc70, packetHeader=..., peerAddress=..., msg=...) at ../../src/transport/SessionManager.cpp:715
#14 0x000055aeb2cd23c5 in chip::SessionManager::OnMessageReceived (this=0x769847aecc70, peerAddress=..., msg=...) at ../../src/transport/SessionManager.cpp:542
#15 0x000055aeb2cd4ffd in chip::TransportMgrBase::HandleMessageReceived (this=0x769847aec028, peerAddress=..., msg=...) at ../../src/transport/TransportMgrBase.cpp:76
#16 0x000055aeb2ba5581 in chip::Transport::Base::HandleMessageReceived (this=0x769847aec050, source=..., buffer=...) at ../../src/transport/raw/Base.h:102
#17 0x000055aeb2ba864f in chip::Test::LoopbackTransport::OnMessageReceived (aSystemLayer=0x55aeb2d4a1a0 <gSystemLayer>, aAppState=0x769847aec050) at ../../src/transport/raw/tests/NetworkTestHelpers.h:98
#18 0x000055aeb2d1d861 in chip::System::TimerData::Callback::Invoke (this=0x7ffc77b0e650) at ../../src/system/SystemTimer.h:61
#19 0x000055aeb2d1f040 in chip::System::TimerPool<chip::System::TimerList::Node>::Invoke (this=0x55aeb2d4a7a8 <gSystemLayer+1544>, timer=0x572370e0d108) at ../../src/system/SystemTimer.h:224
#20 0x000055aeb2d1eaff in chip::System::LayerImplSelect::HandleEvents (this=0x55aeb2d4a1a0 <gSystemLayer>) at ../../src/system/SystemLayerImplSelect.cpp:476
#21 0x000055aeb2cf2e68 in ServiceEvents (aSleepTimeMilliseconds=10) at ../../src/inet/tests/TestInetCommonPosix.cpp:449
#22 0x000055aeb2cf2809 in chip::Test::IOContext::DriveIO (this=0x769847aec010) at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:58
#23 0x000055aeb2cf284b in chip::Test::IOContext::DriveIOUntil(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >, std::function<bool ()>) (this=0x769847aec010, maxWait=..., completionFunction=...) at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:67
#24 0x000055aeb2bb86e2 in (anonymous namespace)::TestReadInteraction::TestReadHandler_MultipleSubscriptions (apSuite=0x55aeb2d45100 <(anonymous namespace)::sSuite>, apContext=0x769847aec008) at ../../src/controller/tests/data_model/TestRead.cpp:1716
#25 0x000055aeb2cf38ad in nlTestRunner (inSuite=0x55aeb2d45100 <(anonymous namespace)::sSuite>, inContext=0x769847aec008) at ../../third_party/nlunit-test/repo/src/nlunit-test.c:213
#26 0x000055aeb2bd5314 in chip::ExecuteTestsWithContext<chip::Test::AppContext> (suite=0x55aeb2d45100 <(anonymous namespace)::sSuite>) at ../../src/lib/support/UnitTestContext.h:46
#27 0x000055aeb2bcdb70 in TestReadInteractionTest () at ../../src/controller/tests/data_model/TestRead.cpp:4441
#28 0x000055aeb2b9cc57 in main () at gen/TestRead.driver.cpp:32

with usage afterwards in:

(rr) print readHandler
$7 = (chip::app::ReadHandler *) 0x72cd6c52a708
(rr) bt 
#0  chip::app::ReadHandler::OnUnblockHoldReportCallback (apSystemLayer=0x55aeb2d4a1a0 <gSystemLayer>, apAppState=0x72cd6c52a708) at ../../src/app/ReadHandler.cpp:742
#1  0x000055aeb2d1d861 in chip::System::TimerData::Callback::Invoke (this=0x7ffc77b0e650) at ../../src/system/SystemTimer.h:61
#2  0x000055aeb2d1f040 in chip::System::TimerPool<chip::System::TimerList::Node>::Invoke (this=0x55aeb2d4a7a8 <gSystemLayer+1544>, timer=0x68342b48) at ../../src/system/SystemTimer.h:224
#3  0x000055aeb2d1eaff in chip::System::LayerImplSelect::HandleEvents (this=0x55aeb2d4a1a0 <gSystemLayer>) at ../../src/system/SystemLayerImplSelect.cpp:476
#4  0x000055aeb2cf2e68 in ServiceEvents (aSleepTimeMilliseconds=10) at ../../src/inet/tests/TestInetCommonPosix.cpp:449
#5  0x000055aeb2cf2809 in chip::Test::IOContext::DriveIO (this=0x769847aec010) at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:58
#6  0x000055aeb2cf284b in chip::Test::IOContext::DriveIOUntil(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >, std::function<bool ()>) (this=0x769847aec010, maxWait=..., completionFunction=...) at ../../src/transport/raw/tests/NetworkTestHelpers.cpp:67
#7  0x000055aeb2bb86e2 in (anonymous namespace)::TestReadInteraction::TestReadHandler_MultipleSubscriptions (apSuite=0x55aeb2d45100 <(anonymous namespace)::sSuite>, apContext=0x769847aec008) at ../../src/controller/tests/data_model/TestRead.cpp:1716
#8  0x000055aeb2cf38ad in nlTestRunner (inSuite=0x55aeb2d45100 <(anonymous namespace)::sSuite>, inContext=0x769847aec008) at ../../third_party/nlunit-test/repo/src/nlunit-test.c:213
#9  0x000055aeb2bd5314 in chip::ExecuteTestsWithContext<chip::Test::AppContext> (suite=0x55aeb2d45100 <(anonymous namespace)::sSuite>) at ../../src/lib/support/UnitTestContext.h:46
#10 0x000055aeb2bcdb70 in TestReadInteractionTest () at ../../src/controller/tests/data_model/TestRead.cpp:4441
#11 0x000055aeb2b9cc57 in main () at gen/TestRead.driver.cpp:32

@andy31415
Copy link
Contributor Author

Note: in OnUnblockHoldReportCallback, the object is obviously freed and cleared:

(rr) x/40 readHandler
0x72cd6c52a708: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a718: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a728: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a738: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a748: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a758: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a768: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a778: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a788: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf
0x72cd6c52a798: 0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf      0xdfdfdfdf

@bzbarsky-apple
Copy link
Contributor

I suspect the key is this bit in the destructor stack:

#0  chip::app::ReadHandler::~ReadHandler (this=0x72cd6c52a708, __in_chrg=<optimized out>) at ../../src/app/ReadHandler.cpp:59
...
#17 0x000055aeb2ba864f in chip::Test::LoopbackTransport::OnMessageReceived (aSystemLayer=0x55aeb2d4a1a0 <gSystemLayer>, aAppState=0x769847aec050) at ../../src/transport/raw/tests/NetworkTestHelpers.h:98
#18 0x000055aeb2d1d861 in chip::System::TimerData::Callback::Invoke (this=0x7ffc77b0e650) at ../../src/system/SystemTimer.h:61
#19 0x000055aeb2d1f040 in chip::System::TimerPool<chip::System::TimerList::Node>::Invoke (this=0x55aeb2d4a7a8 <gSystemLayer+1544>, timer=0x572370e0d108) at ../../src/system/SystemTimer.h:224

So unlike the "normal" non-test case this message is coming off a timer. And then it's very possible that we're hitting #19387 "can't cancel a timer if another timer is firing and the timers were scheduled for close together" issue.

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 2, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 3, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 3, 2022
To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160
andy31415 added a commit that referenced this issue Sep 5, 2022
* Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history.

* Fix timer cancellation to be reliable in LayerImplSelect.

To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes #19387
Fixes #22160

* Add a unit test that timer cancelling works even for currently "in progress" timers

* Address review comments.

* Fix shadowing problem.

* Turn off TestSystemScheduleWork on esp32 QEMU for now.

* Turn off TestSystemScheduleWork on the fake platform too.

The fake platform's event loop does not actually process the SystemLayer events.

Co-authored-by: Andrei Litvin <[email protected]>
isiu-apple pushed a commit to isiu-apple/connectedhomeip that referenced this issue Sep 16, 2022
…ip#22375)

* Duplicate src/system/tests/TestSystemScheduleLambda.cpp history in src/system/tests/TestSystemScheduleWork.cpp history.

* Fix timer cancellation to be reliable in LayerImplSelect.

To minimize risk, the changes here keep the "grab all the timers we
should fire, then fire them" setup instead of switching to the "fire
the timers one at a time" approach LayerImplFreeRTOS uses.

The fix consists of the following parts:

1) Store the list of timers to fire in a member, so we can cancel things from
   that list as needed.
2) To avoid canceling things scheduled by ScheduleWork, remove the CancelTimer
   call in ScheduleWork.  This does mean we now allow multiple timers for the
   same callback+appState in the timer list, if they were created by
   ScheduleWork, but that should be OK, since the only reason that pair needs to
   be unique is to allow cancellation and we never want to cancel the things
   ScheduleWork schedules.  As a followup we should stop using the timer list
   for ScheduleWork altogether and use ScheduleLambda like LayerImplFreeRTOS
   does, but that involves fixing the unit tests that call ScheduleWork without
   actually running the platfor manager event loop and expect it to work
   somehow.

TestRead was failing the sanity assert for not losing track of timers
to fire, because it was spinning a nested event loop.  The changes to
that test stop it from doing that.

Fixes project-chip#19387
Fixes project-chip#22160

* Add a unit test that timer cancelling works even for currently "in progress" timers

* Address review comments.

* Fix shadowing problem.

* Turn off TestSystemScheduleWork on esp32 QEMU for now.

* Turn off TestSystemScheduleWork on the fake platform too.

The fake platform's event loop does not actually process the SystemLayer events.

Co-authored-by: Andrei Litvin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment