Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up issues for CASE async work #26280

Open
mlepage-google opened this issue Apr 27, 2023 · 5 comments
Open

Follow-up issues for CASE async work #26280

mlepage-google opened this issue Apr 27, 2023 · 5 comments

Comments

@mlepage-google
Copy link
Contributor

mlepage-google commented Apr 27, 2023

PR #25695 introduces a refactoring (for CASESession::SendSigma3 as background/async work) that can use a bit more tweaking in subsequent PRs.

Firstly, the background work helper mechanism introduced in that PR should also be applied to the earlier HandleSigma3 changes.
--> done in #26300

WorkHelper constructor should be private, but smart pointers don't like that. Figure out if we can somehow declare sufficient friend-liness to make this build. (Note though that it should work if constructed normally...)
#25695 (comment)
--> done in #26339

There's some subtle/tricky smart/weak ptr logic going on, e.g. in WorkHandler. Document the traps and pitfalls.
#25695 (comment)
--> done in #26339

Document why fabric index is atomic, and why it's changed if we're going to cancel the work anyways. (Mostly because it's another chance to win the cancel race, and if we lose that race, theoretically the fabric index could have been reused with a different fabric, etc.)
#25695 (comment)
#25695 (comment)
--> done in #26339

Consider changing "exit" style to "return" style in SendSigma3b and HandleSigma3b, now that nothing really happens at the exit label.
--> done in #26339

Improve handling if scheduling work fails.
#25695 (comment)
Example failure: #26737

Improve handling if sending status report fails.
#25695 (comment)

@mlepage-google mlepage-google changed the title Follow-up issues for PR #25695 Follow-up issues for CASE async work Apr 28, 2023
mlepage-google added a commit to mlepage-google/connectedhomeip that referenced this issue Apr 28, 2023
WorkHelper was introduced to help with CASESession::SendSigma3.
Now use it to help with CASESession::HandleSigma3.

Part of issue project-chip#26280.
mlepage-google added a commit that referenced this issue May 1, 2023
* Use work helper with CASE handle sigma3

WorkHelper was introduced to help with CASESession::SendSigma3.
Now use it to help with CASESession::HandleSigma3.

Part of issue #26280.

* Unscope a block
@mlepage-google
Copy link
Contributor Author

mlepage-google commented Jun 16, 2023

Regarding "Improve handling if scheduling work fails"...

Work is broken into A/B/C. The intent is to do A and C in foreground, and B in background. (If background processing is not enabled, then B will also be in foreground, but it's still scheduled by A, and still needs to schedule C.)

If we're in A and scheduling B fails, that's not a problem, because we handle the error (send a status report, reset the session, etc.). If we're in C, there's no more scheduling, and we already have the result of any work (including any failure that occurred during B), so again we can handle the error appropriately, as we're in the foreground. (Sending the status report might fail, but that's a separate problem that was always a problem, even before the A/B/C split.)

It's if we're in B, possibly in the background, and we cannot schedule C (on the foreground), that error handling becomes tricky. We can't just call a callback on the CASESession object, as we're not in a position to do so (background vs. foreground). We would need a ScheduleFailureCallback that is safe to call on the CASESession object, even in the background. We can't rely on mutex, because those are platform-specific, and we're in portable code. We can't rely on a timer (being created at this time), because those also would fail to schedule under the same circumstances (i.e. no more slots in event processing queue).

One potential idea is to add an atomic zombie flag to CASESession, and set that in a ScheduleFailureCallback. Then, whenever something happens in CASEServer (in the foreground), check the pinned CASESession, and reap it if it's a zombie (including sending a status report). We'd have to check for a good spot to do this, ensure we get all paths, etc.

Or, if we can ensure that nothing else happens on that pinned CASESession object in the foreground (e.g. timer goes off and it tries to clean up), then we can be sure it's safe to call a ScheduleFailureCallback from a background thread. I'd like to know (confidently) though that it's safe, and it will have to remain safe in the wake of future changes to CASESession; it should be robust not fragile.

Any other ideas?

@bzbarsky-apple
Copy link
Contributor

if we can ensure that nothing else happens on that pinned CASESession object in the foreground

Matter stack shutdown can always happen in the foreground, right? So I don't think we can ensure that.

@bzbarsky-apple
Copy link
Contributor

one potential idea is to add an atomic zombie flag to CASESession

I guess this would get set specifically when the dispatch fails? That seems like it might work, if we can ensure that the CASESession does not die while we are trying to set the flag.

Importantly, the CASE session itself has a pointer to the background work. So the background work could set the flag on itself and the CASE session could read the state from there, right?

@mlepage-google
Copy link
Contributor Author

Yes, that's why I ended up making a separate background work object and going through the hassle of trying to manage its lifetime, because the CASESession object could be altered or more importantly destroyed at any time.

The issue though is, when is the CASESession object checking this flag? That's what I was alluding to here:

One potential idea is to add an atomic zombie flag to CASESession, and set that in a ScheduleFailureCallback. Then, whenever something happens in CASEServer (in the foreground), check the pinned CASESession, and reap it if it's a zombie (including sending a status report). We'd have to check for a good spot to do this, ensure we get all paths, etc.

@bzbarsky-apple
Copy link
Contributor

We would need to change things to keep the unsolicited message handler registered but drop the messages in CASEServer itself if it's in the middle of a non-zombie handshake, etc.

So basically keep ourselves registered, and when Sigma1 received:

  1. If in middle of handshake but it's zombie, tear down the handshake.
  2. If still in middle of handshake, return without responding.
  3. Do a new handshake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants