Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASE Session Establishment often fails #11233

Closed
samanthamar opened this issue Oct 29, 2021 · 15 comments
Closed

CASE Session Establishment often fails #11233

samanthamar opened this issue Oct 29, 2021 · 15 comments

Comments

@samanthamar
Copy link

samanthamar commented Oct 29, 2021

Often after running the resolve command using chip-device-ctrl, the DUT fails to be commissioned due to the errors in the logs attached.

The workaround is to factory reset the DUT and power cycle the controllers and OTBR, however, this does not always work.

DUT is a lighting device that is Thread enabled.

CASE_fail.txt - DUT
CASE_fail_controller.txt - controller

@cjandhyala
Copy link
Contributor

@samanthamar Can you chk if this IP address 'fd47:92dd:b85c:9bf8:ff81:6e7:958d:e389' is reachable from your controller ?

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

Device shows CASESession timed out while waiting for a response from the peer. Expected message type was hu
Controller seems to believe things are fine:

Sent SigmaR3 msg
[1635349215.466278][9434:9441] CHIP:IN: New secure session created for device 0x0000000000BC5C02, key 1!!
Commissioning complete

So either that message did not get delivered or the timeout happened before that message was delivered. Timing out seems odd - there doesn't seem to be THAT much time between the sigma1 and sigma3 message send from the controller. We do know the device is receiving the messages because it reports:

Sent SigmaR2 msg
00> <error > [DL] Long dispatch time: 1576 ms, for event type 4

From my reading of the code, though, it's supposed to wait a full 10 seconds before that timer goes off.
From CASESession.cpp

// Wait at most 10 seconds for the response from the peer.
// This timeout value assumes the underlying transport is reliable.
// The session establishment fails if the response is not received within timeout window.
static constexpr ExchangeContext::Timeout kSigma_Response_Timeout = System::Clock::Seconds16(10);

So either that timer is not getting reset properly somehow, or the timer is going off early, or the sigma3 message is being lost.

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

@pan-apple and @kpschoedel

@pan-apple
Copy link
Contributor

I have noticed these kinds of errors mostly when

  1. The network has reachability issues, or
  2. There are two controller apps running on a machine, and both are listening on the same port. So the messages get delivered to the app that's not expecting it.

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

In this case, though, it would be the device that missed the message. And it looks like a thread device, so I'm thinking it's physically a different device? Can you confirm the setup @samanthamar ?

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

I don't think it's reachability issues because we can see the sigma1 message get through.

@pan-apple
Copy link
Contributor

Yes, that's correct. Sigma2 is received by the controller, and Sigma3 was sent. But the device never received it. So reachability and multiple controller apps should not be the issue.

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

Here's something odd:

00> <detail> [DL] Thread packet RCVD: UDP, len 647
00> <detail> [DL]     src  FD5A:90D5:86D6:1:DEA6:32FF:FEA9:A3E7, port 48767
00> <detail> [DL]     dest FD47:92DD:B85C:9BF8:FF81:6E7:958D:E389, port 5540
00> <info  > [EM] Received message of type 0x32 with vendorId 0x0000 and protocolId 0x0000 on exchange 3201r
00> <error > [EM] CHIP MessageCounter:3129F646 not in RetransTable
00> <error > [SC] CASESession timed out while waiting for a response from the peer. Expected message type was hu
00> <info  > [IN] CASE Session establishment failed: Error CHIP:0x00000032
00> <info  > [IN] CASE Server enabling CASE session setups

If you look there, the exchange manager claims to receive a message of type 0x32 on exchange 3201r. Isn't 0x32 the sigma3 message?

@pan-apple
Copy link
Contributor

@samanthamar this looks like an old branch. The log messages (and a lot of other code) in CASESession.cpp were updated in #9904. Could you confirm what branch/commit you are using?

@pan-apple
Copy link
Contributor

Here's something odd:

00> <detail> [DL] Thread packet RCVD: UDP, len 647
00> <detail> [DL]     src  FD5A:90D5:86D6:1:DEA6:32FF:FEA9:A3E7, port 48767
00> <detail> [DL]     dest FD47:92DD:B85C:9BF8:FF81:6E7:958D:E389, port 5540
00> <info  > [EM] Received message of type 0x32 with vendorId 0x0000 and protocolId 0x0000 on exchange 3201r
00> <error > [EM] CHIP MessageCounter:3129F646 not in RetransTable
00> <error > [SC] CASESession timed out while waiting for a response from the peer. Expected message type was hu
00> <info  > [IN] CASE Session establishment failed: Error CHIP:0x00000032
00> <info  > [IN] CASE Server enabling CASE session setups

If you look there, the exchange manager claims to receive a message of type 0x32 on exchange 3201r. Isn't 0x32 the sigma3 message?

Yeah, I wonder if the timeout happened right when the SigmaR3 was received. The logs do not have timestamps to correlate it.

Also, the state machine is a bit different now, since we added another message to the CASE message exchange (SatusReport).

@cecille
Copy link
Contributor

cecille commented Oct 30, 2021

If status is 0x10, that appears right before 0x32 in the log, I just didn't copy it.

@pan-apple
Copy link
Contributor

0x10 is the standalone ack. 0x40 is the status report.
This log also doesn't have other changes (e.g. SigmaR1 was replaced with Sigma1, and similar changes to other Sigma messages). So this code is definitely from prior to #9904 PR.

@samanthamar
Copy link
Author

Hi all,

  • Setup: 1 OTBR (wifi), 1 RPi controller (wifi) and 1 Thread DUT (lighting). All are co-located together on the same desk.
  • Can confirm we are at the head of the TE6 branch. @pan-apple , the PR you are referring to is merged to master and not TE6 branch though correct?

@pan-apple
Copy link
Contributor

That's right. PR is on master, and TE7 branch. The change was too big to be cherry-picked.
Any chance you can try and reproduce it on master or TE7?

@samanthamar
Copy link
Author

This is resolved in TE 7. Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants