Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRP failures trying to commission all-clusters-app on m5stack #17471

Closed
bzbarsky-apple opened this issue Apr 18, 2022 · 9 comments · Fixed by #18773
Closed

MRP failures trying to commission all-clusters-app on m5stack #17471

bzbarsky-apple opened this issue Apr 18, 2022 · 9 comments · Fixed by #18773
Assignees

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

Steps to reproduce:

  1. Check out ToT (or revision 06c6e06 if desired).
  2. Compile chip-all-clusters-app for m5stack and flash it.
  3. Compile chip-tool (scripts/examples/gn_build_example.sh examples/chip-tool out/debug/standalone/ or equivalent)
  4. Run `chip-tool pairing ble-wifi ${NODE_ID} ${SSID} ${PASSWORD} 20202021 3840

Expected results: no errors.

Actual results: commissioning does succeed, but before it does that the client prints:

[1650316259160] [73458:29505188] CHIP: [SC] Sending Sigma3
[1650316259160] [73458:29505188] CHIP: [EM] Piggybacking Ack for MessageCounter:3203922871 on exchange: 40969i
[1650316259160] [73458:29505188] CHIP: [IN] Prepared unauthenticated message 0x7fe1f7814c68 to 0x0000000000000000 (0)  of type 0x32 and protocolId (0, 0) on exchange 40969i with MessageCounter:2794651821.
[1650316259160] [73458:29505188] CHIP: [IN] Sending unauthenticated msg 0x7fe1f7814c68 with MessageCounter:2794651821 to 0x0000000000000000 at monotonic time: 00000000D66B3854 msec
[1650316259160] [73458:29505188] CHIP: [SC] Sent Sigma3 msg
[1650316259465] [73458:29505188] CHIP: [EM] Retransmitting MessageCounter:2794651821 on exchange 40969i Send Cnt 0
[1650316259465] [73458:29505188] CHIP: [IN] Sending unauthenticated msg 0x7fe1f7814c68 with MessageCounter:2794651821 to 0x0000000000000000 at monotonic time: 00000000D66B3985 msec
[1650316259805] [73458:29505189] CHIP: [EM] Retransmitting MessageCounter:2794651821 on exchange 40969i Send Cnt 1
[1650316259805] [73458:29505189] CHIP: [IN] Sending unauthenticated msg 0x7fe1f7814c68 with MessageCounter:2794651821 to 0x0000000000000000 at monotonic time: 00000000D66B3AD9 msec
[1650316260181] [73458:29505188] CHIP: [EM] Retransmitting MessageCounter:2794651821 on exchange 40969i Send Cnt 2
[1650316260181] [73458:29505188] CHIP: [IN] Sending unauthenticated msg 0x7fe1f7814c68 with MessageCounter:2794651821 to 0x0000000000000000 at monotonic time: 00000000D66B3C52 msec
[1650316260762] [73458:29505189] CHIP: [EM] Failed to Send CHIP MessageCounter:2794651821 on exchange 40969i sendCount: 3 max retries: 3
[1650316262426] [73458:29505189] CHIP: [EM] Received message of type 0x40 with protocolId (0, 0) and MessageCounter:3203922872 on exchange 40969i
[1650316262426] [73458:29505189] CHIP: [EM] Found matching exchange: 40969i, Delegate: 0x7fe1f80cb640
[1650316262426] [73458:29505189] CHIP: [EM] CHIP MessageCounter:2794651821 not in RetransTable on exchange 40969i
[1650316262426] [73458:29505189] CHIP: [SC] Success status report received. Session was established
[1650316262426] [73458:29505189] CHIP: [IN] New secure session created for device 0x0000000000000012, LSID:55323 PSID:17828!
[1650316262426] [73458:29505189] CHIP: [CTL] OperationalDeviceProxy[6817EFE2DA1CD08E:0000000000000012]: State change 3 --> 4
[1650316262426] [73458:29505189] CHIP: [CTL] Finished commissioning step 'FindOperational' with error '../../../examples/chip-tool/third_party/connectedhomeip/src/controller/CHIPDeviceController.cpp:1610: Success'
[1650316262426] [73458:29505189] CHIP: [CTL] Going from commissioning step 'FindOperational' with lastErr = '../../../examples/chip-tool/third_party/connectedhomeip/src/controller/CHIPDeviceController.cpp:1610: Success' --> 'SendComplete'
[1650316262426] [73458:29505189] CHIP: [CTL] Performing next commissioning step 'SendComplete' with completion status = '../../../examples/chip-tool/third_party/connectedhomeip/src/controller/CHIPDeviceController.cpp:1610: Success'
[1650316262426] [73458:29505189] CHIP: [DMG] ICR moving to [AddingComm]
[1650316262426] [73458:29505189] CHIP: [DMG] ICR moving to [AddedComma]
[1650316262426] [73458:29505189] CHIP: [IN] Prepared secure message 0x7fe1f7814c68 to 0x0000000000000012 (1)  of type 0x8 and protocolId (0, 1) on exchange 40970i with MessageCounter:14020799.
[1650316262426] [73458:29505189] CHIP: [IN] Sending encrypted msg 0x7fe1f7814c68 with MessageCounter:14020799 to 0x0000000000000012 (1) at monotonic time: 00000000D66B4516 msec
[1650316262426] [73458:29505189] CHIP: [DMG] ICR moving to [CommandSen]
[1650316262426] [73458:29505189] CHIP: [EM] Sending Standalone Ack for MessageCounter:3203922872 on exchange 40969i
[1650316262426] [73458:29505189] CHIP: [IN] Prepared unauthenticated message 0x70000c82b4d0 to 0x0000000000000000 (0)  of type 0x10 and protocolId (0, 0) on exchange 40969i with MessageCounter:2794651822.
[1650316262426] [73458:29505189] CHIP: [IN] Sending unauthenticated msg 0x70000c82b4d0 with MessageCounter:2794651822 to 0x0000000000000000 at monotonic time: 00000000D66B4517 msec
[1650316262426] [73458:29505189] CHIP: [EM] Flushed pending ack for MessageCounter:3203922872 on exchange 40969i
[1650316262466] [73458:29505189] CHIP: [EM] Received message of type 0x10 with protocolId (0, 0) and MessageCounter:3203922873 on exchange 40969i
[1650316262466] [73458:29505189] CHIP: [EM] OnMessageReceived failed, err = ../../../examples/chip-tool/third_party/connectedhomeip/src/messaging/ExchangeMgr.cpp:268: CHIP Error 0x00000070: Unsolicited msg with originator bit clear
[1650316262528] [73458:29505189] CHIP: [EM] Received message of type 0x10 with protocolId (0, 0) and MessageCounter:3203922874 on exchange 40969i
[1650316262528] [73458:29505189] CHIP: [EM] OnMessageReceived failed, err = ../../../examples/chip-tool/third_party/connectedhomeip/src/messaging/ExchangeMgr.cpp:268: CHIP Error 0x00000070: Unsolicited msg with originator bit clear
[1650316262559] [73458:29505189] CHIP: [EM] Received message of type 0x10 with protocolId (0, 0) and MessageCounter:3203922875 on exchange 40969i
[1650316262559] [73458:29505189] CHIP: [EM] OnMessageReceived failed, err = ../../../examples/chip-tool/third_party/connectedhomeip/src/messaging/ExchangeMgr.cpp:268: CHIP Error 0x00000070: Unsolicited msg with originator bit clear
[1650316262636] [73458:29505189] CHIP: [EM] Received message of type 0x9 with protocolId (0, 1) and MessageCounter:13811505 on exchange 40970i

and the server prints:

I (35990) chip[IN]: Sending unauthenticated msg 0x3ffca544 with MessageCounter:3203922871 to 0x0000000000000000 at monotonic time: 0000000000008961 msec
I (36010) chip[SC]: Sent Sigma2 msg
E (36010) chip[DL]: Long dispatch time: 1649 ms, for event type 3
I (36010) chip[BLE]: Releasing end point's BLE connection back to application.
E (36020) chip[BLE]: HandleChipConnectionReceived failed, err = d
E (36020) chip[BLE]: failed handle new chip BLE connection, status = d
I (36220) chip[EM]: Received message of type 0x32 with protocolId (0, 0) and MessageCounter:2794651821 on exchange 40969r
I (36220) chip[EM]: Received message of type 0x32 with protocolId (0, 0) and MessageCounter:2794651821 on exchange 40969r
I (36220) chip[SC]: Received Sigma3 msg
W (37750) ROUTE_HOOK: Ignore invalid ICMP packet
E (39410) chip[SC]: Unable to save session resumption state: a0
I (39420) chip[IN]: Prepared unauthenticated message 0x3ffca544 to 0x0000000000000000 (0)  of type 0x40 and protocolId (0, 0) on exchange 40969r with MessageCounter:3203922872.
I (39430) chip[IN]: Sending unauthenticated msg 0x3ffca544 with MessageCounter:3203922872 to 0x0000000000000000 at monotonic time: 00000000000096CF msec
I (39440) chip[IN]: CASE Session established. Setting up the secure channel.
I (39460) chip[IN]: CASE secure channel is available now.
I (39460) chip[IN]: CASE Server enabling CASE session setups
E (39460) chip[DL]: Long dispatch time: 3244 ms, for event type 3
I (39470) chip[EM]: Received message of type 0x32 with protocolId (0, 0) and MessageCounter:2794651821 on exchange 40969r
I (39490) chip[IN]: Prepared unauthenticated message 0x3ffe7734 to 0x0000000000000000 (0)  of type 0x10 and protocolId (0, 0) on exchange 40969r with MessageCounter:3203922873.
I (39500) chip[IN]: Sending unauthenticated msg 0x3ffe7734 with MessageCounter:3203922873 to 0x0000000000000000 at monotonic time: 0000000000009717 msec
I (39520) chip[EM]: Received message of type 0x32 with protocolId (0, 0) and MessageCounter:2794651821 on exchange 40969r
I (39530) chip[IN]: Prepared unauthenticated message 0x3ffe7734 to 0x0000000000000000 (0)  of type 0x10 and protocolId (0, 0) on exchange 40969r with MessageCounter:3203922874.
I (39540) chip[IN]: Sending unauthenticated msg 0x3ffe7734 with MessageCounter:3203922874 to 0x0000000000000000 at monotonic time: 0000000000009743 msec
I (39560) chip[EM]: Received message of type 0x32 with protocolId (0, 0) and MessageCounter:2794651821 on exchange 40969r
I (39560) chip[IN]: Prepared unauthenticated message 0x3ffe7734 to 0x0000000000000000 (0)  of type 0x10 and protocolId (0, 0) on exchange 40969r with MessageCounter:3203922875.
I (39590) chip[IN]: Sending unauthenticated msg 0x3ffe7734 with MessageCounter:3203922875 to 0x0000000000000000 at monotonic time: 000000000000976E msec

Git bisect says this started with #17003. @turon

Proposed Solution

Figure out what's going on and why we are not seeing our acks in a timely manner.

@bzbarsky-apple
Copy link
Contributor Author

So on the client side we send the Sigma3 at monotonic time 3597350996 and then do the resends at:

3597351301
3597351641
3597352018

which are ~300ms, ~300ms, and ~400ms apart previous message. Then we decide the message has failed:

Failed to Send CHIP MessageCounter:2794651821 on exchange 40969i sendCount: 3 max retries: 3

We then eventually get the acks and treat them as unsolicited messages. We don't log timestamps on those, but it's after we send the message with message counter 2794651822, which happens at monotonic time 3597354263, so 3200ms or so after we sent Sigma3.

On the server, we send sigma2 at monotonic time 35169, then receive a Sigma3 (time not logged), then send a StatusReport responding to the Sigma3 at monotonic time 38607, so 3450ms or so after sending Sigma2.

So one plausible hypothesis is that Sigma3 processing takes a long time (several seconds) and happens on the Matter thread, so we never get a chance to send a standalone ack to the Sigma3 even though our processing is taking longer than the standalone ack timeout.... Then we finally finish whatever work it is Sigma3 processing does and all the piled-up resends get delivered... And it's possible that we did more backing off on our unauthenticated resends before, so didn't run into this problem.

I expect a similar problem occurs with Sigma2 if the slow device (e.g. the m5stack in this case) is the CASE initiator...

@turon
Copy link
Contributor

turon commented Apr 19, 2022

TL;DR: CASE ultimately succeeds, but sends excessive retries and logs those message failures as warnings.
Standalone acks may be getting blocked by crypto work for some reason.

@bzbarsky-apple
Copy link
Contributor Author

The reason being that the crypto work happens on the thread that messages are sent/received on...

@bzbarsky-apple
Copy link
Contributor Author

Maybe we should have manual FlushAcks calls before we go off into the long-running crypto bits when handling sigma2/sigma3? Or can we put those crypto bits off on a separate thread?

@turon
Copy link
Contributor

turon commented May 24, 2022

Suggestion from @tcarmelveilleux to explicitly call SendStandaloneAckMessage before long running crypto starts.
Though implications on ReliableMessageMgr need to be considered.
@kghost

@tcarmelveilleux
Copy link
Contributor

Overall, I would recommend the following:

  • On receiving sigma messages, and on pake messages, but not the status reports, and not the PBKDFParam*, we would send a standalone ack before processing the message
  • This behaviour would need to be conditional on a new CHIP_DEVICE_CONFIG variable, which is only set on devices that cannot process Sigma1, 2, 3 in less than ~200ms
  • The PASE and CASE session establishment state machines need to allow a bit more time and retries since these are costly operations

turon added a commit to turon/connectedhomeip that referenced this issue May 24, 2022
@kghost
Copy link
Contributor

kghost commented May 24, 2022

@tcarmelveilleux

On receiving sigma messages, and on pake messages, but not the status reports, and not the PBKDFParam*, we would send a standalone ack before processing the message

It works, but not ideal. Since m5 has a dual core SoC, we should defer the encryption work to another core while the chip core is handling chip events. Blocking chip thread masses up lots of timers, if there is any other on going exchanges, it will suffer the same problem

@tcarmelveilleux
Copy link
Contributor

@kghost Deferring the crypto requires a ton of changes which are riskier. This is why I suggest enable that feature with a CHIP_DEVICE_CONFIG flag

@bzbarsky-apple
Copy link
Contributor Author

Yes, ideally we would do the crypto on a background task on hardware that is multi-core, but we still need something here for slow single-core hardware.

The proposed plan in #17471 (comment) sounds good to me.

turon added a commit to turon/connectedhomeip that referenced this issue May 24, 2022
… in CASE and PASE.

- Added CHIP_CONFIG_SLOW_CRYPTO with default of 1
- Force send a standalone ack response to all CASE and PASE when above flag enabled.
- Adjusted expected # of expected packets in CASE and PASE tests.
@turon turon self-assigned this May 24, 2022
turon added a commit to turon/connectedhomeip that referenced this issue May 24, 2022
bzbarsky-apple added a commit that referenced this issue Jun 1, 2022
…PASE. (#18773)

* [mrp] Fix #17471 - force send standalone ack for CASE.

* [mrp] Adjust expected # of packets in CASE test.

* [mrp] Add CHIP_CONFIG_SLOW_CRYPTO and support to standalone ack for PASE.

* [mrp] Move CHIP_DEVICE_CONFIG_SLOW_CRYPTO to device layer config.

* [mrp] Only send slow crypto ack for crypto ops.

* [mrp] Call FlushAcks rather than SendStandaloneAckMessage.

Co-authored-by: Boris Zbarsky <[email protected]>

* [mrp] Assume fast crypto on linux and darwin.

Co-authored-by: Boris Zbarsky <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants