[BUG] CASE timeout is to long on LIT ICD #35354

ldoering · 2024-09-02T13:34:36Z

Reproduction steps

After sending/receiving a Sigma1 (but no follow up). a CASE Session is opened but no timeout is ever triggered. This leads to an error when trying to etablish a new CASE Sesssion;
[SC] Received error (protocol code 4) during pairing process: src/protocols/secure_channel/CASESession.cpp:2144: CHIP Error 0x000000DB: The Resource is busy and cannot process the request
It also prevents an ICD to leave active mode, draining the battery (#35355).

I reproduced the issue with the LIT-ICD App on the Silabs EFR32 Platform, compiled with the Following Parameters: --icd --low-power enable_openthread_cli=false chip_enable_ota_requestor=true chip_persist_subscriptions=false
I also enabled the openthread log, so I can see poll atempts to check the interval.

Flash the software, commission the device
Trigger a CASE Session from Chiptool, but exit Chiptool directly after sending the Sigma1
Watch the Logs of the Device. It will never timeout the case session
Wait some time (at least the expected case timeout)
Start Chiptool again, Trigger a CASE Session again
Observer the "Resource is busy" Error message in Chiptool and the device log.

The Issue can also be observed in the reverse way, if the DUT triggered the CASE Session, but I did not manage to do that with an unmodified example.

Bug prevalence

Everytime I ran the listed steps.

GitHub hash of the SDK that was being used

0e3434a

Platform

efr32

Platform Version(s)

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

bzbarsky-apple · 2024-09-03T22:33:33Z

a CASE Session is opened but no timeout is ever triggered.

This should happen after 30+ seconds, when the device stops waiting for a response to the Sigma2 that it sent.

@ldoering when you "Wait some time (at least the expected case timeout)", what was the actual wait time and where did the "expected case timeout" come from? Can you please attach, not paste, the logs from the device that was the CASE responder in this case?

ldoering · 2024-09-04T12:43:30Z

I expected the session to fail after 30s (observing the retries after several seconds) and waited multiple minutes without observing a timeout. To get the actual expected timeout I added logging to SetResponseTimeout in ExchangeContext.cpp.
Result:
[00:00:06.805][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000da00
[00:00:06.805][info ][EM] Set Timeout: 45896846 ms

This timeout ist much longer than expected, it it based on the CHIP_DEVICE_CONFIG_ICD_SLOW_POLL_INTERVAL which is set to 3600000ms (1 hour) in the silabs LIT-ICD sample app (examples/lit-icd-app/silabs/openthread.gni), with additional backoff.

What kind of timeouts are expected in the LIT case? Is the prolonged interval a bug or by design? (I assume a bug, since it also prevents the ICD from sleeping)

bzbarsky-apple · 2024-09-04T22:32:23Z

Hmm. src/messaging/ReliableMessageProtocolConfig.cpp does:

config.mIdleRetransTimeout += ICDConfigurationData::GetInstance().GetSlowPollingInterval();

in GetLocalMRPConfig. And then in CASESession's ComputeRoundTripTimeout we have:

        GetRetransmissionTimeout(localMRPConfig.mActiveRetransTimeout, localMRPConfig.mIdleRetransTimeout,
                                 // Peer will assume we are active, since it's
                                 // responding to our message.
                                 System::SystemClock().GetMonotonicTimestamp(), localMRPConfig.mActiveThresholdTime);

What values are you seeing passed to GetRetransmissionTimeout there? What does it return?

bzbarsky-apple · 2024-09-04T22:33:49Z

@ldoering And definitely a timeout of 45896846s for a Sigma2 seems like a bug.

ldoering · 2024-09-05T12:05:54Z

For the LIT-Mode:

[00:00:08.661][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:00:08.661][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:00:08.662][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 8662
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 3000 idle 3602000 tslast 0 thresh 5000
[00:00:08.663][info  ][EM] Set Timeout: 45896846 ms

I also disabled the LIT Mode and received a timeout of 281907 ms for the SIT configuration.

[00:03:37.693][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:03:37.693][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:03:37.693][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:03:37.694][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 217694
[00:03:37.694][info  ][EM] GetRetransmissionTimeout: active 3000 idle 17000 tslast 0 thresh 5000
[00:03:37.694][info  ][EM] Set Timeout: 281907 ms

I also attached the full device log for both modes:

ICD SIT Timeout.txt
ICD LIT Timeout.txt

bzbarsky-apple · 2024-09-06T01:32:15Z

@ldoering Is tslast 0 logging the lastActivityTime value inside the GetRetransmissionTimeout call? If so, both 2 and 0 seem like really odd values for that; who is calling it with those values?

I assume UA GetMessageReceiptTimeout means UnathenticatedSession::GetMessageReceiptTimeout, but that should be passing ourLastActivity for lastActivityTime, which is being logged as very much not being 0 in both of the logs above.

bzbarsky-apple · 2024-09-06T01:35:38Z

Oh, this is interesting. So ourLastActivity is a TimeStamp, but lastActivityTime on GetRetransmissionTimeout is a Timeout.

using Timeout = Milliseconds32;
using Timestamp = Milliseconds64;

Does fixing the type of the lastActivityTime argument of GetRetransmissionTimeout to be Timestamp instead of Timeout change the behavior at all for you?

bzbarsky-apple · 2024-09-06T01:41:31Z

Ah, this even came up in the review in #33093 but it was missed that this is in fact a serious problem because the actual timestamp might not fit in a 32-bit millisecond counter.... That said, I can't quite reconcile this possible issue with the values for ourLastActivity and tslast logged above, but I don't know what casting and logging format was used there....

ldoering · 2024-09-06T06:37:20Z

Sorry to confuse you there. I converted the timestamps to 32bit timeout for printing since the lib did not support direct 64bit printing. I assumed that this would be fine, since all tests run directly after reboot, so 32bit should be sufficient to count the runtime in ms.
Tslast is timeSinceLastActivity, I thought the absolute time passed in the lastActivityTime would not be too helpful. Since this is run directly after receiving the packet it looked good to me.

To me it looks like a problem to use the LIT idle time to calculate timeouts.

bzbarsky-apple · 2024-09-06T17:10:40Z

To me it looks like a problem to use the LIT idle time to calculate timeouts.

Sure, but why is it being used?

I wonder whether one issue is that the threshold (5000) is so close to the active interval (4000). What that's saying is:

In active mode we check for activity ever 4 seconds.
After 5 seconds of inactivity we go into idle mode.
In Idle mode we check every 3602000 seconds or something.

So when computing how MRP will work here, assuming we start in active mode.

First packet will get sent.
Second packet will get sent something like 4 seconds later.
If the device has not gotten the packet by now it will transition to idle mode.
The idle interval will be used for computing the backoff on the remaining retransmits.

What the device actually does is that it never transitions to idle mode at all. But that's not what it claims to do, either to the other side or to itself....

@turon @Damian-Nordic @mkardous-silabs we need to figure out what should actually be happening here, because this is clearly broken.

Damian-Nordic · 2024-09-10T07:23:02Z

The question is if we should treat the active threshold shorter than all the MRP backoffs (using the active interval) as misconfiguration and we should warn a user if such is configured, or we should just use it to cancel a transmission earlier (before retransmitting a message 4 times).

In the latter case, we could make GetRetransmissionTimeout() take a flag indicating if the device operates as LIT and it should be modified as such:

for (uint8_t i = 0; i < CHIP_CONFIG_RMP_DEFAULT_MAX_RETRANS + 1; i++)
{
    System::Clock::Timeout baseInterval;
    if (timeSinceLastActivity + timeout < activityThreshold)
    {
        baseInterval = activeInterval;
    }
    else if (lit)
    {
        break;
    }
    else
    {
         baseInterval = idleInterval;
    }
    timeout += Messaging::ReliableMessageMgr::GetBackoff(baseInterval, i, /* computeMaxPossible */ true);
}

However, then ReliableMessageMgr::CalculateNextRetransTime() would also detect whenever an idle interval is about to be used for LIT and remove the message from the retrans table in such a case.

Btw, System::Clock::Timestamp could be replaced with System::Clock::Timeout as our timers don't handle 64-bit timeouts anyway.

bzbarsky-apple · 2024-09-10T21:02:34Z

As @Damian-Nordic pointed out on Slack, GetRetransmissionTimeout also needs to match the logic in ReliableMessageMgr::CalculateNextRetransTime which always treats things as "active" for non-initial messages on the exchange, which is very much not what GetRetransmissionTimeout() does.

ldoering added bug Something isn't working needs triage labels Sep 2, 2024

ldoering mentioned this issue Sep 2, 2024

[ICD] ICD Fail Safe against battery draining #35355

Open

ldoering changed the title ~~[BUG] CASE timeout not triggered~~ [BUG] CASE timeout not triggered on LIT ICD Sep 4, 2024

ldoering changed the title ~~[BUG] CASE timeout not triggered on LIT ICD~~ [BUG] CASE timeout is to long on LIT ICD Sep 6, 2024

yunhanw-google added this to [Feature] Intermittently Connected Devices (ICD) Sep 9, 2024

mkardous-silabs added the icd Intermittently Connected Devices label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CASE timeout is to long on LIT ICD #35354

[BUG] CASE timeout is to long on LIT ICD #35354

ldoering commented Sep 2, 2024 •

edited

Loading

bzbarsky-apple commented Sep 3, 2024

ldoering commented Sep 4, 2024

bzbarsky-apple commented Sep 4, 2024

bzbarsky-apple commented Sep 4, 2024 •

edited

Loading

ldoering commented Sep 5, 2024

bzbarsky-apple commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024

ldoering commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024 •

edited

Loading

Damian-Nordic commented Sep 10, 2024 •

edited

Loading

bzbarsky-apple commented Sep 10, 2024

[BUG] CASE timeout is to long on LIT ICD #35354

[BUG] CASE timeout is to long on LIT ICD #35354

Comments

ldoering commented Sep 2, 2024 • edited Loading

Reproduction steps

Bug prevalence

GitHub hash of the SDK that was being used

Platform

Platform Version(s)

Anything else?

bzbarsky-apple commented Sep 3, 2024

ldoering commented Sep 4, 2024

bzbarsky-apple commented Sep 4, 2024

bzbarsky-apple commented Sep 4, 2024 • edited Loading

ldoering commented Sep 5, 2024

bzbarsky-apple commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024

ldoering commented Sep 6, 2024

bzbarsky-apple commented Sep 6, 2024 • edited Loading

Damian-Nordic commented Sep 10, 2024 • edited Loading

bzbarsky-apple commented Sep 10, 2024

ldoering commented Sep 2, 2024 •

edited

Loading

bzbarsky-apple commented Sep 4, 2024 •

edited

Loading

bzbarsky-apple commented Sep 6, 2024 •

edited

Loading

Damian-Nordic commented Sep 10, 2024 •

edited

Loading