Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect processing in Engine::ScheduleRun() #14596

Closed
lnikulin opened this issue Jan 31, 2022 · 13 comments
Closed

Incorrect processing in Engine::ScheduleRun() #14596

lnikulin opened this issue Jan 31, 2022 · 13 comments

Comments

@lnikulin
Copy link
Contributor

lnikulin commented Jan 31, 2022

Problem

Any attempt to read/write any attribute results in TIMEOUT error.

Platform

MIPS24k

Root-cause and Proposed Solution

Looks like there is a race between threads, introduced after the following PR #13093.

In function Engine::ScheduleRun() (/src/app/reporting/Engine.cpp)
mRunScheduled flag is set to true after calling systemLayer->ScheduleWork(Run, this);
and intended to be set to false in callback function (Engine::Run()).
However, if callback finishes before ScheduleWork returns, flag mRunScheduled will be set to true, but the scheduled task has been done already.

There is the following statement in src/include/platform/PlatformManager.h which is violated:

 /**
     * ScheduleWork can be called after InitChipStack has been called.  Calls
     * that happen before either StartEventLoopTask or RunEventLoop will queue
     * the work up but that work will NOT run until one of those functions is
     * called.
     *
     * ScheduleWork can be called safely on any thread without locking the
     * stack.  When called from a thread that is not doing the stack work item
     * processing, the callback function may be called (on the work item
     * processing thread) before ScheduleWork returns.
     */
    void ScheduleWork(AsyncWorkFunct workFunct, intptr_t arg = 0);

History of the issue

Original commit (#11155) introduced mRunScheduled flag ea68796ba9 sometimes crashes, see #13093
This was fixed in the following commit #13093 2ddefedca8 which introduced the race issue described above.

Proposed solution

Set mRunScheduled to true before calling ScheduleWork()

lnikulin added a commit to lnikulin/connectedhomeip that referenced this issue Feb 2, 2022
…Layer->ScheduleWork() returns. (project-chip#14596)

- setup mRunScheduled = true before execute ScheduleWork()
lnikulin added a commit to lnikulin/connectedhomeip that referenced this issue Feb 2, 2022
…mLayer->ScheduleWork() returns. (project-chip#14596)

- setup mRunScheduled = true before execute ScheduleWork()
@mrjerryjohns
Copy link
Contributor

This all assumes that ScheduleRun is safe to call from a threading context other than Chip's, which is not explicitly stipulated nor correct. Unless explicitly stated, all SDK APIs must be called from within the Chip thread context itself.

This specific call gives the impression that it's ok to call from a different threading context but in fact, it actually isn't. It accesses a number of SDK state that can change on it during a shutdown (like the InteractionModelEngine singleton, the underlying SessionManager and SystemLayer objects). None of these are accessed safely with a mutex relative to the CHIP thread, so it's not safe to call this API from outside the CHIP threading context.

@mrjerryjohns
Copy link
Contributor

Received timeout in operation get/set attribute value into any clusters (any attributes) from client application (chip-tool),

@lnikulin can you provide more details on the original observed failure and how that translates to an unsafe call to ScheduleRun?

@bzbarsky-apple
Copy link
Contributor

Specifically, what would be most useful here is a stack to the ScheduleRun call involved.

@yufengwangca
Copy link
Contributor

@lnikulin Do we have the stack dump when the issue occurs? wondering which thread other than CHIP main thread is calling Engine::ScheduleRun()

@lnikulin lnikulin closed this as completed Feb 3, 2022
@lnikulin
Copy link
Contributor Author

lnikulin commented Feb 3, 2022

Call Stack looks like following:
on Linux platform

Engine::ScheduleRun() begin
[1643380890.231073][2200:2200] CHIP:DMG: <ThreadID: 139932194578816> SchedWork() begin
[1643380890.231078][2200:2200] CHIP:DMG: <ThreadID: 139932194578816> Signal begin
[1643380890.231083][2200:2200] CHIP:DMG: <ThreadID: 139932194578816> Signal end
[1643380890.231086][2200:2200] CHIP:DMG: <ThreadID: 139932194578816> SchedWork() end
Engine::ScheduleRun() end
[1643380890.231088][2200:2203] CHIP:DMG: <ThreadID: 139932106483456> Engine::Run() begin
[1643380890.231106][2200:2203] CHIP:DMG: <ThreadID: 139932106483456> Run() begin
[1643380890.231110][2200:2203] CHIP:DMG: <ThreadID: 139932106483456> Run() end
[1643380890.231115][2200:2203] CHIP:DMG: <ThreadID: 139932106483456> Engine::Run() end

on MIPS platform

Engine::ScheduleRun() begin
[1643382659.193314][1927:1927] CHIP:DMG: <ThreadID: 2012581508> SchedWork() begin
[1643382659.193372][1927:1927] CHIP:DMG: <ThreadID: 2012581508> Signal begin
[1643382659.193493][1927:1930] CHIP:DMG: <ThreadID: 2006711656> Engine::Run() begin
[1643382659.193547][1927:1930] CHIP:DMG: <ThreadID: 2006711656> Run() begin
[1643382659.193592][1927:1930] CHIP:DMG: <ThreadID: 2006711656> Run() end
[1643382659.193633][1927:1930] CHIP:DMG: <ThreadID: 2006711656> Engine::Run() end
[1643382659.193744][1927:1927] CHIP:DMG: <ThreadID: 2012581508> Signal end
[1643382659.193795][1927:1927] CHIP:DMG: <ThreadID: 2012581508> SchedWork() end
Engine::ScheduleRun() end

NOTE: see #11155 and #13093

@lnikulin lnikulin reopened this Feb 3, 2022
@bzbarsky-apple
Copy link
Contributor

Where are the "SchedWork() begin" and "Signal begin" prints coming from? That is, what code locations?

@lnikulin
Copy link
Contributor Author

lnikulin commented Feb 4, 2022

File src/system/SystemLayerImplSelect.cpp

CHIP_ERROR LayerImplSelect::ScheduleWork(TimerCompleteCallback onComplete, void * appState)
{
    ChipLogDetail(DataManagement, "<ThreadID: %lu> SchedWork() begin", pthread_self());
    VerifyOrReturnError(mLayerState.IsInitialized(), CHIP_ERROR_INCORRECT_STATE);

    CancelTimer(onComplete, appState);

    TimerList::Node * timer = mTimerPool.Create(*this, SystemClock().GetMonotonicTimestamp(), onComplete, appState);
    VerifyOrReturnError(timer != nullptr, CHIP_ERROR_NO_MEMORY);

#if CHIP_SYSTEM_CONFIG_USE_DISPATCH
    dispatch_queue_t dispatchQueue = GetDispatchQueue();
    if (dispatchQueue)
    {
        (void) mTimerList.Add(timer);
        dispatch_async(dispatchQueue, ^{
            this->HandleTimerComplete(timer);
        });
        return CHIP_NO_ERROR;
    }
#endif // CHIP_SYSTEM_CONFIG_USE_DISPATCH

    if (mTimerList.Add(timer) == timer)
    {
        // The new timer is the earliest, so the time until the next event has probably changed.
        ChipLogDetail(DataManagement, "<ThreadID: %lu> Signal begin", pthread_self());
        Signal();
        ChipLogDetail(DataManagement, "<ThreadID: %lu> Signal end", pthread_self());
    }
    ChipLogDetail(DataManagement, "<ThreadID: %lu> SchedWork() end", pthread_self());
    return CHIP_NO_ERROR;
}

@bzbarsky-apple
Copy link
Contributor

OK, but that's not what we were asking. We are asking: what code is calling ScheduleRun, not what code does ScheduleRun call.

Someone is calling ScheduleRun without synchronization on the wrong thread. The question is who that is.

@lnikulin
Copy link
Contributor Author

lnikulin commented Feb 7, 2022

Write attribute emberAfWriteAttribute() from application thread (Write UniqueId attribute of Basic Cluster EP0)

Main application thread = 2012425860
Chip Event loop thread = 2006556008

[1644241241.522429][1927:1927] CHIP:DMG: <ThreadID: 2012425860> SchedWork() begin
[1644241241.522488][1927:1927] CHIP:DMG: <ThreadID: 2012425860> Signal begin
[1644241241.522604][1927:1930] CHIP:DMG: <ThreadID: 2006556008> Engine::Run() START
[1644241241.522654][1927:1930] CHIP:DMG: <ThreadID: 2006556008> Run() begin, set mRunScheduled = 0
[1644241241.522728][1927:1930] CHIP:DMG: <ThreadID: 2006556008> Run() end
[1644241241.522768][1927:1930] CHIP:DMG: <ThreadID: 2006556008> Engine::Run() END
[1644241241.522880][1927:1927] CHIP:DMG: <ThreadID: 2012425860> Signal end
[1644241241.522928][1927:1927] CHIP:DMG: <ThreadID: 2012425860> SchedWork() end
[1644241241.522970][1927:1927] CHIP:DMG: <ThreadID: 2012425860> ScheduleRun() setup mRunSched = 1

Main application thread callstack:

#0  chip::app::reporting::Engine::ScheduleRun (this=0x55beaff0f058 <chip::app::sInteractionModelEngine+3224>)
    at src/app/reporting/Engine.cpp:432
#1  0x000055beaf97b9e5 in MatterReportingAttributeChangeCallback (endpoint=0, clusterId=40, attributeId=18)
    at src/app/util/ember-compatibility-functions.cpp:927
#2  0x000055beaf97b959 in MatterReportingAttributeChangeCallback (endpoint=0, clusterId=40, attributeId=18, mask=64 '@', manufacturerCode=0, type=66 'B',
    data=0x7fffd907e500 " 8bdb812f40eb433f830f1c31e44a78e8\345\a\331\377\177")
    at src/app/util/ember-compatibility-functions.cpp:912
#3  0x000055beaf9777eb in emAfWriteAttribute (endpoint=0, cluster=40, attributeID=18, mask=64 '@', manufacturerCode=0, data=0x7fffd907e500 " 8bdb812f40eb433f830f1c31e44a78e8\345\a\331\377\177", dataType=66 'B',
    overrideReadOnlyAndDataType=true, justTest=false) at src/app/util/attribute-table.cpp:661
#4  0x000055beaf9763ef in emberAfWriteAttribute (endpoint=0, cluster=40, attributeID=18, mask=64 '@', dataPtr=0x7fffd907e500 " 8bdb812f40eb433f830f1c31e44a78e8\345\a\331\377\177", dataType=66 'B')
    at src/app/util/attribute-table.cpp:93

@bzbarsky-apple
Copy link
Contributor

Write attribute emberAfWriteAttribute() from application thread

Ok, this is the thing that's not allowed. This needs to happen off a task queued on the Matter event loop. Otherwise you can race not only with Engine::ScheduleRun but also with writes to attributes happening as a result of Matter message processing, so can end up with a corrupted attribute store.

@lnikulin
Copy link
Contributor Author

lnikulin commented Feb 8, 2022

@bzbarsky-apple
We could wrap emberAfWriteAttribute() call with ScheduleWork (is that what you mean by queuing up the task?) and most probably it will work.
However, emberAfWriteAttribute() call itself ends up in ScheduleRun which calls ScheduleWork at the end, it just not properly handle the case when the callback is called before ScheduleWork returns.
So, it looks like API designed to be thread agnostic, but there is a bug in the sdk ScheduleRun function.

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Feb 8, 2022

We could wrap emberAfWriteAttribute() call with ScheduleWork

Yes, that is what you should do.

However, emberAfWriteAttribute() call itself ends up in ScheduleRun which calls ScheduleWork at the end

This is after it has already done a bunch of work, like writing to the attribute store, and the remaining (async) work is sending subscription updates as needed.

Writing to the attribute store should only be happening from the Matter event loop. If you do it from a random thread, that write will race with writes due to incoming messages, and you will get corrupted data. Does that make sense?

So, it looks like API designed to be thread agnostic

It's not thread agnostic at all.

@lnikulin
Copy link
Contributor Author

lnikulin commented Feb 9, 2022

In case, we use wrapper for emberWriteAttribute and call it through the chip::DeviceLayer::PlatformMgr().ScheduleWork(), our application works without any errors. I'm closing this issue because the issue is not in SDK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants