Queued deletion setup on Darwin leads to crashes #22320

bzbarsky-apple · 2022-08-31T20:13:33Z

Problem

Steps to reproduce:

Ensure that [darwin_framework_tool] Add a shortcut (CTL('^')) to restart the stac… #22268 or equivalent is applied.
Modify the code in that PR to add a sleep(5) in CHIPCommandBridge::RestartCommissioners between the controller shutdowns and the controller restarts, unless something has been added since then to make this possible without code changes.
Modify the subscribe-all-events command in darwin-framework-tool to turn off auto-resubscribe, unless something has been added to make that possible without code changes.
Add a sleep(10) at the beginning of ReadHandler::SendReportData when !aReadMoreChunks is true.
Recompile all-clusters-app and darwin-framework-tool.
Run all-clusters-app.
Run darwin-framework-tool interactive start.
Pair all-clusters-app as node id 1.
Run any subscribe-all-events 1 2 1 0xFFFF
When there is a pause in the messages due to the 10-second sleep from step 4, hit Ctrl+^ to trigger the stack restart.

This crashes with the stack observed in #20085

What's going on is the following sequence of events:

We start shutting down the controller.
This evicts the ReadClient's secure session.
That lands us in SubscriptionCallback::ReportError, which sets mHaveQueuedDeletion to true and queues a task to do the following:
a. Call our error callback.
b. Queue a task to the Matter queue to delete the SubcriptionCallback, and hence the ReadClient.
We get SubscriptionCallback::OnDone and it's a no-op because mHaveQueuedDeletion is true.
Controller shutdown proceeds. It pauses the Matter queue before the task queued in step 3 has had a chance to run, and shuts down the Matter stack.
The task from step 3 runs, queues the deletion of the SubscriptionCallback on the Matter queue. But that queue is paused, so the block sits there.
After the 5-second sleep ends, we restart the Matter stack. Early in startup this spins up the Matter event loop, which un-pauses the Matter queue, and that runs the "delete the ReadClient" block. At this point very little of the Matter stack is up yet, and we crash because we expect objects to exist that just don't.

Proposed Solution

Still figuring this out.

The text was updated successfully, but these errors were encountered:

The basic issue we could run into is that the Matter stack would shut down while our async block was still running on our client queue, and by the time the "delete this object" block was queued on the Matter queue that queue would be paused. Then if the stack was restarted the queue would be unpaused, and the deletion of the ReadClient would happen early in stack startup, when things were not in a good state yet. The fix is to make sure we queue the async deletion without going through the client queue first, and avoid doing the async bits altogether when we can (when the subscription itself errors out). Fixes project-chip#22320

#22324) The basic issue we could run into is that the Matter stack would shut down while our async block was still running on our client queue, and by the time the "delete this object" block was queued on the Matter queue that queue would be paused. Then if the stack was restarted the queue would be unpaused, and the deletion of the ReadClient would happen early in stack startup, when things were not in a good state yet. The fix is to make sure we queue the async deletion without going through the client queue first, and avoid doing the async bits altogether when we can (when the subscription itself errors out). Fixes #22320

project-chip#22324) The basic issue we could run into is that the Matter stack would shut down while our async block was still running on our client queue, and by the time the "delete this object" block was queued on the Matter queue that queue would be paused. Then if the stack was restarted the queue would be unpaused, and the deletion of the ReadClient would happen early in stack startup, when things were not in a good state yet. The fix is to make sure we queue the async deletion without going through the client queue first, and avoid doing the async bits altogether when we can (when the subscription itself errors out). Fixes project-chip#22320

project-chip#22978 accidentally reintroduced the crash that project-chip#22324 had fixed. To avoid more issues along these lines: 1) Add unit tests that reproduce the crashes described in project-chip#22320 (with the changes from project-chip#22978) and project-chip#22935 (without those changes). 2) Change MTRBaseSubscriptionCallback to always invoke its callbacks synchronously, on the Matter queue, so that we can clean up the MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it gets deleted on the Matter queue. 3) Move the queueing of callbacks to the client queue into the consumers of MTRBaseSubscriptionCallback, so they can do whatever sync work they need (like the above cleanup) before going async. 4) Update documentation.

#22978 accidentally reintroduced the crash that #22324 had fixed. To avoid more issues along these lines: 1) Add unit tests that reproduce the crashes described in #22320 (with the changes from #22978) and #22935 (without those changes). 2) Change MTRBaseSubscriptionCallback to always invoke its callbacks synchronously, on the Matter queue, so that we can clean up the MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it gets deleted on the Matter queue. 3) Move the queueing of callbacks to the client queue into the consumers of MTRBaseSubscriptionCallback, so they can do whatever sync work they need (like the above cleanup) before going async. 4) Update documentation.

…hip#23076) project-chip#22978 accidentally reintroduced the crash that project-chip#22324 had fixed. To avoid more issues along these lines: 1) Add unit tests that reproduce the crashes described in project-chip#22320 (with the changes from project-chip#22978) and project-chip#22935 (without those changes). 2) Change MTRBaseSubscriptionCallback to always invoke its callbacks synchronously, on the Matter queue, so that we can clean up the MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it gets deleted on the Matter queue. 3) Move the queueing of callbacks to the client queue into the consumers of MTRBaseSubscriptionCallback, so they can do whatever sync work they need (like the above cleanup) before going async. 4) Update documentation.

bzbarsky-apple added darwin V1.0 crash labels Aug 31, 2022

bzbarsky-apple self-assigned this Aug 31, 2022

This was referenced Aug 31, 2022

ReadClient crash when ExchangeManager is already shutdown (so trying to get SystemLayer fails) #20085

Closed

Combine the nearly identical SubscriptionCallback implementations on Darwin #22322

Closed

bzbarsky-apple mentioned this issue Aug 31, 2022

Fix lifetime of Darwin SubscriptionCallback to avoid shutdown crashes. #22324

Merged

bzbarsky-apple closed this as completed in #22324 Sep 1, 2022

This was referenced Sep 1, 2022

[DarwinFrameworkTool] Add autoResubscribe optional parameters to comm… #22339

Merged

[darwin-framework-tool] Add a shortcut (CTL('_')) to stop the stack w… #22362

Merged

bzbarsky-apple mentioned this issue Oct 7, 2022

Better fix for crashes around MTRBaseSubscriptionCallback. #23076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queued deletion setup on Darwin leads to crashes #22320

Queued deletion setup on Darwin leads to crashes #22320

bzbarsky-apple commented Aug 31, 2022

Queued deletion setup on Darwin leads to crashes #22320

Queued deletion setup on Darwin leads to crashes #22320

Comments

bzbarsky-apple commented Aug 31, 2022

Problem

Proposed Solution