Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multiple admins fails with CASE error on TE6 branch #10337

Closed
cjandhyala opened this issue Oct 7, 2021 · 15 comments
Closed

Adding multiple admins fails with CASE error on TE6 branch #10337

cjandhyala opened this issue Oct 7, 2021 · 15 comments
Labels
bug Something isn't working

Comments

@cjandhyala
Copy link
Contributor

cjandhyala commented Oct 7, 2021

Adding 2nd admin(2nd python control) fails On TE6 branch (82bcd4e).

Steps to reproduce :

  1. Launch lighting app on a raspy
  2. commission the lighting app with first python controller (running on raspi) over BLE
  3. Open commissioning window
  4. add the 2nd admin (2nd python controller running on different raspi than 1st controller) using connect -IP cmd
  5. 2nd commissioner resolves the node, but fails to establish CASE session.
    (This setup uses 3 different raspis , one for accessory, one for 1st controller , one 2nd controller)

Attached logs,
1st_controller.txt
2ndController.txt
accessory_log.txt

@bluebin14
Copy link
Contributor

It works for me, however I used our own BLE/Thread device instead of Lighting app. It looks like an mdns/routing problem, can you try with both controllers on same raspi as a test?

@cecille
Copy link
Contributor

cecille commented Oct 8, 2021

It doesn't seem like an mdns problem to me - if you look at line 370/371 on the 2ndController.txt, you can see that avahi resolves the address, and I confirmed with Chaitanya that it's as expected. However, there is an error on line 386 that's concerning - have you seen OnMessageReceived failed, err = ../../src/messaging/ExchangeMgr.cpp:298: CHIP Error 0x0000002A: Invalid message type happen before? Do you know what that might be about?

@jmartinez-silabs
Copy link
Member

I see a lot of flakyness on our platform too with multi-admin (EFR32) on both TE6 branch and ToT. The commissionning flow fails for the 2nd admin

@bluebin14
Copy link
Contributor

bluebin14 commented Oct 8, 2021

controller1 resolves ipv6 address fd7b:a81b:9770:0:dea6:32ff:fe8d:7072 but controller2 gets given ipv4 address 192.168.4.223. Then avahi resolves ipv4 address 192.168.6.222 which is again different. My setup uses ipv6 exclusively and it works.

@bzbarsky-apple
Copy link
Contributor

So it looks like controller2 correctly resolves the device. It sends a Sigma1 (line 383 of 2ndController.txt). The device receives it (line 2358 of accessory_log.txt) and sends a Sigma2 (line 2394). The commissioner receives the Sigma2 but (starting line 384 in 2ndController.txt):

[1633646951.870298][3191:3199] CHIP:EM: Received message of type 0x31 with vendorId 0x0000 and protocolId 0x0000 on exchange 13378i
[1633646951.870406][3191:3199] CHIP:EM: Handling via exchange: 13378i, Delegate: 0x(nil)
[1633646951.870469][3191:3199] CHIP:EM: OnMessageReceived failed, err = ../../src/messaging/ExchangeMgr.cpp:298: CHIP Error 0x0000002A: Invalid message type

So most pressingly, the exchange delegate is null, so the Sigma2 message is not delivered to the CASE state machine and then the handshake fails.

I am also not sure what the deal is with the "invalid message type" message there. That means this check tested true:

        if (ec->IsEncryptionRequired() != packetHeader.GetFlags().Has(Header::FlagValues::kEncryptedMessage))

which is pretty odd in this case.

Anyway, after this the commissioner retransmits Sigma1 a few times, and the other side retransmits Sigma2 a few times, but they all get dropped as duplicates. And in any case, on the server side we had, 5 seconds after sending Sigma2 (line 2396 in the accessory log):

[1633646956.576393][3488:3488] CHIP:SVR: Commissioning failed with error ../../examples/lighting-app/linux/third_party/connectedhomeip/src/platform/DeviceControlServer.cpp:48: CHIP Error 0x00000032: Timeout
[1633646956.576453][3488:3488] CHIP:SVR: Commissioning failed (attempt 1): ../../examples/lighting-app/linux/third_party/connectedhomeip/src/platform/DeviceControlServer.cpp:48: CHIP Error 0x00000032: Timeout

which happens before the retransmits.

But the big question is what closed the CASE establishment exchange on 2ndController after sending Sigma1, so that the delegate ended up null. @pan-apple any ideas?

@bluebin14
Copy link
Contributor

bluebin14 commented Oct 8, 2021

Here is what closed the CASE exchange:
In accessory_log.txt:
At 1633646951: Arm Fail-safe command with ExpiryLengthSeconds 5, meaning 5 seconds max till receiving CommissioningComplete command,
At 1633646956 i.e. 5 seconds later: CHIP Error 0x00000032: Timeout
Looks like 5 seconds is not enough for commissioning.

@tcarmelveilleux
Copy link
Contributor

5 seconds is definitely no enough! CASE can take longer than that on very constrained CPUs

@cecille
Copy link
Contributor

cecille commented Oct 8, 2021

K, so let's bump that up and see if it resolves things - that was set pre-case.

@bzbarsky-apple
Copy link
Contributor

Here is what closed the CASE exchange:
In accessory_log.txt:

That's on the server, though. But the exchange is already closed on the client (commissioner) when we get the Sigma2....

@bzbarsky-apple
Copy link
Contributor

That said, @cjandhyala could you try the change in #10357 and see if that helps?

@jmartinez-silabs
Copy link
Member

jmartinez-silabs commented Oct 12, 2021

Hi, I cherry-picked #10357 on the test_event_6 branch (82bcd4e) for the two controllers.

Multi-admin still often fails on my end for the second admin.
I use the same setup as describe in this ticket except that my end device node is a EFR32 lighting app

I mostly see two different errors when it fails

  1. controller # 2 give me the following error at the OpCSRResponse

Sigma3_TO_Controller#1.txt
Sigma3_TO_EFR32.txt
Sigma3_TO_Controller#2.txt

Received certificate signing request from the device
[1634068509.841325][3044:3052] CHIP:CTL: Getting certificate chain for the device from the issuer
[1634068509.841372][3044:3052] CHIP:CTL: Verifying Certificate Signing Request
[1634068509.841956][3044:3052] CHIP:CR:  ssl err  elliptic curve routines EC_POINT_set_affine_coordinates point is not on curve

[1634068509.842054][3044:3052] CHIP:CR:  ssl err  elliptic curve routines o2i_ECPublicKey EC lib

[1634068509.842138][3044:3052] CHIP:CR:  ssl err  elliptic curve routines eckey_pub_decode decode error

[1634068509.842217][3044:3052] CHIP:CR:  ssl err  x509 certificate routines x509_pubkey_decode public key decode error

[1634068509.842292][3044:3052] CHIP:CTL: Failed to process the certificate signing request
[1634068509.842371][3044:3052] CHIP:CTL: SyncSetKeyValue on StartKeyID
[1634068509.842531][3044:3052] CHIP:CTL: SyncSetKeyValue on PairedDevicec9
Failed to establish secure session to device: 172
  1. Controller # 2 timeout in on packet after sending SigmaR3 msg
Sent SigmaR3 msg
[1634070546.199595][3124:3132] CHIP:IN: New secure session created for device 0x00000000000000CA, key 3!!
[1634070546.199758][3124:3132] CHIP:CTL: Calling commissioning complete
[1634070546.199841][3124:3132] CHIP:DMG: ICR moving to [Initialize]
[1634070546.199961][3124:3132] CHIP:DMG: ICR moving to [AddCommand]
[1634070546.200395][3124:3132] CHIP:IN: Secure message was encrypted: Msg ID 0
[1634070546.200421][3124:3132] CHIP:IN: Build encrypted message 0x36f60d90 to 0x00000000000000CA of type 8 and protocolId 1 on exchange 6460r with MessageCounter 0.
[1634070546.200555][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12316362 msec
[1634070546.200580][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070546.200705][3124:3132] CHIP:DMG: ICR moving to [   Sending]
[1634070551.187118][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321348 msec
[1634070551.187218][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.187434][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 1
[1634070551.187521][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321349 msec
[1634070551.187572][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.187852][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 1
[1634070551.440478][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321602 msec
[1634070551.440554][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.440724][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 2
[1634070551.440808][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321602 msec
[1634070551.440896][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.441142][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 2
[1634070551.696626][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321858 msec
[1634070551.696701][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.696857][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 3
[1634070551.696937][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321858 msec
[1634070551.697051][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.697226][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 3
[1634070551.952746][3124:3132] CHIP:EM: Failed to Send CHIP MessageCounter:A5298C7A sendCount: 3 max retries: 3
[1634070551.952855][3124:3132] CHIP:EM: Failed to Send CHIP MessageCounter:00000000 sendCount: 3 max retries: 3
[1634070558.203486][3124:3132] CHIP:DMG: Time out! failed to receive invoke command response from Exchange: 6460i
[1634070558.203663][3124:3132] CHIP:ZCL: DefaultResponse:
[1634070558.203723][3124:3132] CHIP:ZCL:   Transaction: 0xffffabaf7390
[1634070558.203771][3124:3132] CHIP:ZCL:   status: EMBER_ZCL_STATUS_FAILURE (0x01)
[1634070558.203832][3124:3132] CHIP:CTL: Received failure response 1

[1634070558.203943][3124:3132] CHIP:CTL: SyncSetKeyValue on StartKeyID
[1634070558.204161][3124:3132] CHIP:CTL: SyncSetKeyValue on PairedDeviceca
Failed to establish secure session to device: 1

OTBR fails to update the last SRP service after receiving SigmaR3. I am not sure if it is related.
The following log snipped is from the EFR32 lighting app.

00> <detail> [SC] Received SigmaR3 msg
00> <detail> [DL] Thread packet RCVD: UDP, len 100
00> <detail> [DL]     src  FD11:33::DEA6:32FF:FE72:7537, port 44111
00> <detail> [DL]     dest FD11:22::F387:97AD:2458:B238, port 5540
00> <error > [DL] OnSrpClientNotification: Operation refused for security reasons
00> <error > [DL] OnSrpClientNotification: Operation refused for security reasons
00> <detail> [DL] Thread packet RCVD: UDP, len 645
00> <detail> [DL]     src  FD11:33::DEA6:32FF:FE72:7537, port 44111
00> <detail> [DL]     dest FD11:22::F387:97AD:2458:B238, port 5540

SSL_ERR_Controller#1.txt
SSL_ERR_EFR32.txt
SSL_ERR_Controller#2.txt

I am attaching all 3 logs for both scenario. (Ctrl 1, Ctrl 2, EFR32_Light_app)

Can you make any sense of either of those failures?

@cecille
Copy link
Contributor

cecille commented Oct 18, 2021

I think I'm running into the same issue now, and I'm noticing some logs that suggest we may be adding opcerts twice. It's hard to tell because the device attestation code re-uses some of the same stuff, but I'm seeing two entries in the fabric table for one commissioning.

@jmeg-sfy jmeg-sfy pinned this issue Nov 2, 2021
@jmeg-sfy jmeg-sfy unpinned this issue Nov 4, 2021
@msandstedt
Copy link
Contributor

#12821 will contribute to this.

@woody-apple woody-apple added bug Something isn't working needs v1.0 review labels Jan 25, 2022
@bzbarsky-apple
Copy link
Contributor

@cjandhyala Is this still an issue?

@woody-apple
Copy link
Contributor

Issue Review: Closing issues prior to TE8. Please re-file if this is still an issue on newer SDKs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants