Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queued deletion setup on Darwin leads to crashes #22320

Closed
bzbarsky-apple opened this issue Aug 31, 2022 · 0 comments · Fixed by #22324
Closed

Queued deletion setup on Darwin leads to crashes #22320

bzbarsky-apple opened this issue Aug 31, 2022 · 0 comments · Fixed by #22324

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

Steps to reproduce:

  1. Ensure that [darwin_framework_tool] Add a shortcut (CTL('^')) to restart the stac… #22268 or equivalent is applied.
  2. Modify the code in that PR to add a sleep(5) in CHIPCommandBridge::RestartCommissioners between the controller shutdowns and the controller restarts, unless something has been added since then to make this possible without code changes.
  3. Modify the subscribe-all-events command in darwin-framework-tool to turn off auto-resubscribe, unless something has been added to make that possible without code changes.
  4. Add a sleep(10) at the beginning of ReadHandler::SendReportData when !aReadMoreChunks is true.
  5. Recompile all-clusters-app and darwin-framework-tool.
  6. Run all-clusters-app.
  7. Run darwin-framework-tool interactive start.
  8. Pair all-clusters-app as node id 1.
  9. Run any subscribe-all-events 1 2 1 0xFFFF
  10. When there is a pause in the messages due to the 10-second sleep from step 4, hit Ctrl+^ to trigger the stack restart.

This crashes with the stack observed in #20085

What's going on is the following sequence of events:

  1. We start shutting down the controller.
  2. This evicts the ReadClient's secure session.
  3. That lands us in SubscriptionCallback::ReportError, which sets mHaveQueuedDeletion to true and queues a task to do the following:
    a. Call our error callback.
    b. Queue a task to the Matter queue to delete the SubcriptionCallback, and hence the ReadClient.
  4. We get SubscriptionCallback::OnDone and it's a no-op because mHaveQueuedDeletion is true.
  5. Controller shutdown proceeds. It pauses the Matter queue before the task queued in step 3 has had a chance to run, and shuts down the Matter stack.
  6. The task from step 3 runs, queues the deletion of the SubscriptionCallback on the Matter queue. But that queue is paused, so the block sits there.
  7. After the 5-second sleep ends, we restart the Matter stack. Early in startup this spins up the Matter event loop, which un-pauses the Matter queue, and that runs the "delete the ReadClient" block. At this point very little of the Matter stack is up yet, and we crash because we expect objects to exist that just don't.

Proposed Solution

Still figuring this out.

@bzbarsky-apple bzbarsky-apple self-assigned this Aug 31, 2022
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Aug 31, 2022
The basic issue we could run into is that the Matter stack would shut down
while our async block was still running on our client queue, and by the time
the "delete this object" block was queued on the Matter queue that queue would
be paused.  Then if the stack was restarted the queue would be unpaused, and
the deletion of the ReadClient would happen early in stack startup, when things
were not in a good state yet.

The fix is to make sure we queue the async deletion without going through the
client queue first, and avoid doing the async bits altogether when we can (when
the subscription itself errors out).

Fixes project-chip#22320
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Aug 31, 2022
The basic issue we could run into is that the Matter stack would shut down
while our async block was still running on our client queue, and by the time
the "delete this object" block was queued on the Matter queue that queue would
be paused.  Then if the stack was restarted the queue would be unpaused, and
the deletion of the ReadClient would happen early in stack startup, when things
were not in a good state yet.

The fix is to make sure we queue the async deletion without going through the
client queue first, and avoid doing the async bits altogether when we can (when
the subscription itself errors out).

Fixes project-chip#22320
bzbarsky-apple added a commit that referenced this issue Sep 1, 2022
#22324)

The basic issue we could run into is that the Matter stack would shut down
while our async block was still running on our client queue, and by the time
the "delete this object" block was queued on the Matter queue that queue would
be paused.  Then if the stack was restarted the queue would be unpaused, and
the deletion of the ReadClient would happen early in stack startup, when things
were not in a good state yet.

The fix is to make sure we queue the async deletion without going through the
client queue first, and avoid doing the async bits altogether when we can (when
the subscription itself errors out).

Fixes #22320
isiu-apple pushed a commit to isiu-apple/connectedhomeip that referenced this issue Sep 16, 2022
project-chip#22324)

The basic issue we could run into is that the Matter stack would shut down
while our async block was still running on our client queue, and by the time
the "delete this object" block was queued on the Matter queue that queue would
be paused.  Then if the stack was restarted the queue would be unpaused, and
the deletion of the ReadClient would happen early in stack startup, when things
were not in a good state yet.

The fix is to make sure we queue the async deletion without going through the
client queue first, and avoid doing the async bits altogether when we can (when
the subscription itself errors out).

Fixes project-chip#22320
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Oct 7, 2022
project-chip#22978 accidentally
reintroduced the crash that
project-chip#22324 had fixed.  To avoid
more issues along these lines:

1) Add unit tests that reproduce the crashes described in
   project-chip#22320 (with the
   changes from project-chip#22978) and
   project-chip#22935 (without those
   changes).
2) Change MTRBaseSubscriptionCallback to always invoke its callbacks
   synchronously, on the Matter queue, so that we can clean up the
   MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it
   gets deleted on the Matter queue.
3) Move the queueing of callbacks to the client queue into the consumers of
   MTRBaseSubscriptionCallback, so they can do whatever sync work they need
   (like the above cleanup) before going async.
4) Update documentation.
andy31415 pushed a commit that referenced this issue Oct 11, 2022
#22978 accidentally
reintroduced the crash that
#22324 had fixed.  To avoid
more issues along these lines:

1) Add unit tests that reproduce the crashes described in
   #22320 (with the
   changes from #22978) and
   #22935 (without those
   changes).
2) Change MTRBaseSubscriptionCallback to always invoke its callbacks
   synchronously, on the Matter queue, so that we can clean up the
   MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it
   gets deleted on the Matter queue.
3) Move the queueing of callbacks to the client queue into the consumers of
   MTRBaseSubscriptionCallback, so they can do whatever sync work they need
   (like the above cleanup) before going async.
4) Update documentation.
selissia pushed a commit to selissia/connectedhomeip that referenced this issue Oct 12, 2022
…hip#23076)

project-chip#22978 accidentally
reintroduced the crash that
project-chip#22324 had fixed.  To avoid
more issues along these lines:

1) Add unit tests that reproduce the crashes described in
   project-chip#22320 (with the
   changes from project-chip#22978) and
   project-chip#22935 (without those
   changes).
2) Change MTRBaseSubscriptionCallback to always invoke its callbacks
   synchronously, on the Matter queue, so that we can clean up the
   MTRClusterStateCacheContainer's pointer to the ClusterStateCache before it
   gets deleted on the Matter queue.
3) Move the queueing of callbacks to the client queue into the consumers of
   MTRBaseSubscriptionCallback, so they can do whatever sync work they need
   (like the above cleanup) before going async.
4) Update documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant