Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CASE timeout is to long on LIT ICD #35354

Open
ldoering opened this issue Sep 2, 2024 · 12 comments
Open

[BUG] CASE timeout is to long on LIT ICD #35354

ldoering opened this issue Sep 2, 2024 · 12 comments
Labels
bug Something isn't working icd Intermittently Connected Devices needs triage

Comments

@ldoering
Copy link

ldoering commented Sep 2, 2024

Reproduction steps

After sending/receiving a Sigma1 (but no follow up). a CASE Session is opened but no timeout is ever triggered. This leads to an error when trying to etablish a new CASE Sesssion;
[SC] Received error (protocol code 4) during pairing process: src/protocols/secure_channel/CASESession.cpp:2144: CHIP Error 0x000000DB: The Resource is busy and cannot process the request
It also prevents an ICD to leave active mode, draining the battery (#35355).

I reproduced the issue with the LIT-ICD App on the Silabs EFR32 Platform, compiled with the Following Parameters: --icd --low-power enable_openthread_cli=false chip_enable_ota_requestor=true chip_persist_subscriptions=false
I also enabled the openthread log, so I can see poll atempts to check the interval.

  • Flash the software, commission the device
  • Trigger a CASE Session from Chiptool, but exit Chiptool directly after sending the Sigma1
  • Watch the Logs of the Device. It will never timeout the case session
  • Wait some time (at least the expected case timeout)
  • Start Chiptool again, Trigger a CASE Session again
  • Observer the "Resource is busy" Error message in Chiptool and the device log.

The Issue can also be observed in the reverse way, if the DUT triggered the CASE Session, but I did not manage to do that with an unmodified example.

Bug prevalence

Everytime I ran the listed steps.

GitHub hash of the SDK that was being used

0e3434a

Platform

efr32

Platform Version(s)

No response

Anything else?

No response

@bzbarsky-apple
Copy link
Contributor

a CASE Session is opened but no timeout is ever triggered.

This should happen after 30+ seconds, when the device stops waiting for a response to the Sigma2 that it sent.

@ldoering when you "Wait some time (at least the expected case timeout)", what was the actual wait time and where did the "expected case timeout" come from? Can you please attach, not paste, the logs from the device that was the CASE responder in this case?

@ldoering
Copy link
Author

ldoering commented Sep 4, 2024

I expected the session to fail after 30s (observing the retries after several seconds) and waited multiple minutes without observing a timeout. To get the actual expected timeout I added logging to SetResponseTimeout in ExchangeContext.cpp.
Result:
[00:00:06.805][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000da00
[00:00:06.805][info ][EM] Set Timeout: 45896846 ms

This timeout ist much longer than expected, it it based on the CHIP_DEVICE_CONFIG_ICD_SLOW_POLL_INTERVAL which is set to 3600000ms (1 hour) in the silabs LIT-ICD sample app (examples/lit-icd-app/silabs/openthread.gni), with additional backoff.

What kind of timeouts are expected in the LIT case? Is the prolonged interval a bug or by design? (I assume a bug, since it also prevents the ICD from sleeping)

@ldoering ldoering changed the title [BUG] CASE timeout not triggered [BUG] CASE timeout not triggered on LIT ICD Sep 4, 2024
@bzbarsky-apple
Copy link
Contributor

Hmm. src/messaging/ReliableMessageProtocolConfig.cpp does:

config.mIdleRetransTimeout += ICDConfigurationData::GetInstance().GetSlowPollingInterval();

in GetLocalMRPConfig. And then in CASESession's ComputeRoundTripTimeout we have:

        GetRetransmissionTimeout(localMRPConfig.mActiveRetransTimeout, localMRPConfig.mIdleRetransTimeout,
                                 // Peer will assume we are active, since it's
                                 // responding to our message.
                                 System::SystemClock().GetMonotonicTimestamp(), localMRPConfig.mActiveThresholdTime);

What values are you seeing passed to GetRetransmissionTimeout there? What does it return?

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Sep 4, 2024

@ldoering And definitely a timeout of 45896846s for a Sigma2 seems like a bug.

@ldoering
Copy link
Author

ldoering commented Sep 5, 2024

For the LIT-Mode:

[00:00:08.661][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:00:08.661][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:00:08.662][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 8662
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 3000 idle 3602000 tslast 0 thresh 5000
[00:00:08.663][info  ][EM] Set Timeout: 45896846 ms

I also disabled the LIT Mode and received a timeout of 281907 ms for the SIT configuration.

[00:03:37.693][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:03:37.693][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:03:37.693][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:03:37.694][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 217694
[00:03:37.694][info  ][EM] GetRetransmissionTimeout: active 3000 idle 17000 tslast 0 thresh 5000
[00:03:37.694][info  ][EM] Set Timeout: 281907 ms

I also attached the full device log for both modes:

ICD SIT Timeout.txt
ICD LIT Timeout.txt

@bzbarsky-apple
Copy link
Contributor

@ldoering Is tslast 0 logging the lastActivityTime value inside the GetRetransmissionTimeout call? If so, both 2 and 0 seem like really odd values for that; who is calling it with those values?

I assume UA GetMessageReceiptTimeout means UnathenticatedSession::GetMessageReceiptTimeout, but that should be passing ourLastActivity for lastActivityTime, which is being logged as very much not being 0 in both of the logs above.

@bzbarsky-apple
Copy link
Contributor

Oh, this is interesting. So ourLastActivity is a TimeStamp, but lastActivityTime on GetRetransmissionTimeout is a Timeout.

using Timeout = Milliseconds32;
using Timestamp = Milliseconds64;

Does fixing the type of the lastActivityTime argument of GetRetransmissionTimeout to be Timestamp instead of Timeout change the behavior at all for you?

@bzbarsky-apple
Copy link
Contributor

Ah, this even came up in the review in #33093 but it was missed that this is in fact a serious problem because the actual timestamp might not fit in a 32-bit millisecond counter.... That said, I can't quite reconcile this possible issue with the values for ourLastActivity and tslast logged above, but I don't know what casting and logging format was used there....

@ldoering
Copy link
Author

ldoering commented Sep 6, 2024

Sorry to confuse you there. I converted the timestamps to 32bit timeout for printing since the lib did not support direct 64bit printing. I assumed that this would be fine, since all tests run directly after reboot, so 32bit should be sufficient to count the runtime in ms.
Tslast is timeSinceLastActivity, I thought the absolute time passed in the lastActivityTime would not be too helpful. Since this is run directly after receiving the packet it looked good to me.

To me it looks like a problem to use the LIT idle time to calculate timeouts.

@ldoering ldoering changed the title [BUG] CASE timeout not triggered on LIT ICD [BUG] CASE timeout is to long on LIT ICD Sep 6, 2024
@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Sep 6, 2024

To me it looks like a problem to use the LIT idle time to calculate timeouts.

Sure, but why is it being used?

I wonder whether one issue is that the threshold (5000) is so close to the active interval (4000). What that's saying is:

  1. In active mode we check for activity ever 4 seconds.
  2. After 5 seconds of inactivity we go into idle mode.
  3. In Idle mode we check every 3602000 seconds or something.

So when computing how MRP will work here, assuming we start in active mode.

  1. First packet will get sent.
  2. Second packet will get sent something like 4 seconds later.
  3. If the device has not gotten the packet by now it will transition to idle mode.
  4. The idle interval will be used for computing the backoff on the remaining retransmits.

What the device actually does is that it never transitions to idle mode at all. But that's not what it claims to do, either to the other side or to itself....

@turon @Damian-Nordic @mkardous-silabs we need to figure out what should actually be happening here, because this is clearly broken.

@Damian-Nordic
Copy link
Contributor

Damian-Nordic commented Sep 10, 2024

The question is if we should treat the active threshold shorter than all the MRP backoffs (using the active interval) as misconfiguration and we should warn a user if such is configured, or we should just use it to cancel a transmission earlier (before retransmitting a message 4 times).

In the latter case, we could make GetRetransmissionTimeout() take a flag indicating if the device operates as LIT and it should be modified as such:

for (uint8_t i = 0; i < CHIP_CONFIG_RMP_DEFAULT_MAX_RETRANS + 1; i++)
{
    System::Clock::Timeout baseInterval;
    if (timeSinceLastActivity + timeout < activityThreshold)
    {
        baseInterval = activeInterval;
    }
    else if (lit)
    {
        break;
    }
    else
    {
         baseInterval = idleInterval;
    }
    timeout += Messaging::ReliableMessageMgr::GetBackoff(baseInterval, i, /* computeMaxPossible */ true);
}

However, then ReliableMessageMgr::CalculateNextRetransTime() would also detect whenever an idle interval is about to be used for LIT and remove the message from the retrans table in such a case.

Btw, System::Clock::Timestamp could be replaced with System::Clock::Timeout as our timers don't handle 64-bit timeouts anyway.

@bzbarsky-apple
Copy link
Contributor

As @Damian-Nordic pointed out on Slack, GetRetransmissionTimeout also needs to match the logic in ReliableMessageMgr::CalculateNextRetransTime which always treats things as "active" for non-initial messages on the exchange, which is very much not what GetRetransmissionTimeout() does.

@mkardous-silabs mkardous-silabs added the icd Intermittently Connected Devices label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working icd Intermittently Connected Devices needs triage
Projects
Development

No branches or pull requests

4 participants