-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434
Comments
@plan44
Logs in the file, which is weird given that this log has been removed in this specific PR. Have you tried a clean rebuild of the bridge app? The fact that this log is still happening leads me to believe something might not have been updated properly. |
…n of ReportScheduler into the ReadHandler and IM engine. (#28104) - this commit causes endless re-subscriptions and thus flash wear - is being investigated in project-chip/connectedhomeip#28434
I thought I had, but apparently not, my bad, sorry. I did now (deleting the Please find the new log attached. |
So far I wasn't able to replicate the behavior with the linux example bridge app on a raspberrypi 4. I initially build the linux bridge app and commissioned it with an applet TV on c9ce7f2. This didn't show the behavior in the logs here so I tried replicating the subscription resumption shown in the logs. The steps used to try to replicate with resumption were :
From the log lines:
I am suspecting that the flash memory is already full from the previous failures to re-establish the subscription, thus causing the failure in the new log. Would you mind trying to clear the persistant memory to see if that might help solving the issue? Also, I can't seem to reconstruct the proper chain of event? When you built the first time with c9ce7f2, were you resuming a subscription as I assumed? If not, what exactly were the steps that led you to this? |
Thanks for the investigation.
This is not related to flash (BTW, it technically couldn't be, because the persistent data store on Linux is just a file, with no size limits except for the disk size, which is huge). This error just signals that the report response needs to be chunked. It happens at the TLV level, bubbles up to Now this could be the difference between your and my setup: while my bridge is based on the bridge-example, in the meantime it manages a larger number of active endpoints with more data to report, so chunked reports are happening. Probably, in the reduced bridge-example setup, there's no chunking happening at all? |
Responding to your other points:
Correct.
I am a bit reluctant to do that with the bridge I found this problem with - because given that nobody else has complained so far, this seems pretty much an edge case. But on that setup, it is a 100% A-B testable one - if I run the build from c9ce7f2, the app enters the resubscription loop, if I run the build from fe4c679 with the same persistent memory data everything runs normal. I did this back and forth several times. I'd hate to loose that A-B testability and maybe the problem itself, until it becomes more clear what is actually happening. So I'd rather try to reproduce the problem on a second bridge, with a similar set of dynamic endpoints. I'll try that on Monday and will post what I find. |
Thanks for the insight. The chunking is definitely worth investigating, our CI is testing for Chunking capability but this seems to be a corner case where we have re-subscribe + chunking, which might need more testing on our CI. In the meantime another bug was Identified where the readHandler is scheduling itself with a min and a max of 0:
I am reworking the scheduler/readHandler to prevent that. In the event that it doesn't fix your problem I would definitely need to setup a re-subcription configuration with chunking so please let me know if you are able to reproduce this bug on a second bridge. |
@lpbeliveau-silabs Unfortunately, #28536 does not solve the problem. I did a A-B test with full rebuilds of both bf0b45a (which includes d62d80b) and fe4c679. I repeated the A-B test several times, it is still 100% reproducible. Please find two logs attached - this is the same device, with no other changes than once using a build off bf0b45a (loops) and one off fe4c679 (works ok). Comparing the two, I see that things start to go differently only after report chunking gets enabled (at line 423 in Maybe the A-B logs (with details enabled) can already reveal something. I will now also commission a second bridge with the same set of devices and see what happens there (and post the results here) |
@lpbeliveau-silabs As promised, now I commissioned a second bridge with (nearly) the same set of devices. It shows exactly the same A-B behaviour as the original unit. What I did:
testbridge-setup - loops - bf0b45a.txt This clearly shows that the issue cannot be stale persistent data (there is none in the first run, and still the resubscription loop happens). I hope this helps to narrow down the cause for this problem. Just let me know when you need more specific information. |
Adding cert blocker so I can track this for now. |
@plan44 Thank you for providing the logs, there clearly seems to have some chunking related problems. This might be due to a change in behavior where in the previous implementation, chunking would result in cancelling the subscription refresh timer without restarting it were-as now it doesn't cancel it, which might be why we see this problem in your case as your subscription resumption takes longer than the max interval to expire, which might lead to a timer generating a report ahead of the subscription being completed. Weirdly enough, the functioning subscription resumption ends in two failures to establish a case session:
were-as the looping one only has one failure at that point:
And proceeds to receive another sigma 1. |
I have raised my number of devices on the bridge, generated some chunking on commissioning and re-subscription and I am still unable to reproduce the issue. Would it be possible for you to provide the code / executable or at least the endpoint/cluster configurations of your app so I can try to reproduce it more reliably locally? Also, what is the version of your apple tv? I am using tvOS 16.6. I have some hypothesis as to what is happening, it seems the new ReportScheduler ends up not sending a report at max interval once the subscription is done, which might be why the appleTV keeps sending sigma1 over and over, but I cannot reproduce it locally so it is hard to figure out exactly why this is happening in this specific case. On your side, could you try reproducing the issue using chiptool? If this fails with the same intervals, could you try increasing the intervals to5s or 10s to see if it confirms the timing hypothesis? |
The bridge project is open source, http://github.com/plan44/p44mbrd. I don't know if that helps, it has evolved from the original bridge-app quite a bit. I am fully aware that this is hard to debug, and also that the problem is most likely caused by my particular setup of the overall app, which certainly has slightly different timing than the original bridge app had. In particular, the setup is strictly single-threaded which the standard linux examples are not (these need a separate thread in SystemLayerImplSelect). The DNS-SD part is using Avahi, not minimal-mDNS. Still, I cannot see anything actually wrong in that setup, just a bit different from the example. Can you imagine anything in your report engine changes that would rely on a specific order of events being processed, that might not be possible in a strictly single-threaded setup?
I just did the same A-B test on a different hardware platform (RaspberryPi B, vs MediaTek MT7688 before) with a reduced number of devices (just root EP0, bridge EP1 and two "Extended Color Light" (0x010d) EPs). The A-B behaviour is exactly the same. So I doubt it is something caused only by a particular EP configuration only.
Same here: tvOS 16.6 (20M73) Let me know if you have ideas what I could do to help narrow this down. |
I did some more digging in the Rpi A-B logs and a new log from the original bridge app I built from bf0b45a.
So the question is why that report gets delayed so much in the bf0b45a RE, and not in the fe4c679 one. I'm trying to track this down here right now, as it is most likely somehow related to my particular setup and don't want to waste more of your time before I understand that detail. Still, I wonder, on a larger scale, if handling a relatively minor report delay every time with a quite drastic and resource intensive (flash wear) SecureSession restart in 5 second intervals forever is a good idea. Is this mandated by the specs (did not find it so far) or just Apple's choice of implementation? |
…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain).
@lpbeliveau-silabs I found the problem, finally 😅! All of it was caused by the report timer firing a few milliseconds too early. Yes, it had to do with my setup, which is that it uses a libev mainloop and timers (similar to how Darwin uses dispatch). So my fault, basically 😞 Here's what happened (log from the Rpi bf0b45a build, with some extra instrumentation log lines (
So it boils down to the ReportEngine absolutely relying on timers NEVER firing even only a bit too early. Which is a reasonable assumption, and apparently true for other timer implementations, but not libev's. As adding libev was my contribution (#24232), I should have realized that. My bad, sorry. On the other hand, with timers always comes the question of tolerance. If implemenations may rely on absolutely none in the negative direction, it maybe should be stated in the header for |
…ly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp
…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp
…ly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp
That is definitely a bug in the scheduler.
No, it's not. This assumption definitely does not hold on some platforms. Filed #28929. This is a must-fix for 1.2, @lpbeliveau-silabs |
…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp
) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp * SystemLayerImplSelect: libev: eliminate I/O wakeup thread libev based regular builds do not need a separate I/O wakeup thread, so the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case. In normal operation, libev based builds also never call Signal(), so if it is still called it now emits a error log message, to indicate something might be wrong in the setup. However we keep Signal() and related LayerSocketsLoop methods, following the Darwin dispatch implementation, as fallback to select-based event handling seems to be needed for some I/O tests. As noted in a comment to the original libev PR [1], eventually, the select() based mainloop and external mainloop based solutions like Darwin Dispatch and libev should be detangled and extracted into separate classes, adapting all the tests that somehow rely on the select() fallback. As a single self-funded developer however, I cannot possibly be expected to solve this for Apple ;-) [1] #24232 (review) * SystemLayerImplSelect: reword comment: early firing timers can happen - the fix prevents them in normal libev case - however the caller MUST NOT rely on timers *never* firing a bit early
…#28434) (project-chip#28740) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp * SystemLayerImplSelect: libev: eliminate I/O wakeup thread libev based regular builds do not need a separate I/O wakeup thread, so the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case. In normal operation, libev based builds also never call Signal(), so if it is still called it now emits a error log message, to indicate something might be wrong in the setup. However we keep Signal() and related LayerSocketsLoop methods, following the Darwin dispatch implementation, as fallback to select-based event handling seems to be needed for some I/O tests. As noted in a comment to the original libev PR [1], eventually, the select() based mainloop and external mainloop based solutions like Darwin Dispatch and libev should be detangled and extracted into separate classes, adapting all the tests that somehow rely on the select() fallback. As a single self-funded developer however, I cannot possibly be expected to solve this for Apple ;-) [1] project-chip#24232 (review) * SystemLayerImplSelect: reword comment: early firing timers can happen - the fix prevents them in normal libev case - however the caller MUST NOT rely on timers *never* firing a bit early
Reproduction steps
Commit c9ce7f2 from PR #28104 causes a regression in subscription handling, which is 100% reproducible in my (Linux Bridge) commissioned into a AppleTV:
Bug prevalence
Always
GitHub hash of the SDK that was being used
c9ce7f2
Platform
core
Platform Version(s)
n/a
Anything else?
For some reason, the report engine keeps generating a IM:ReportData message that is rejected by the controller (AppleTV) with INVALID_SUBSCRIPTION. This causes the contoller to send a new IM:SubscribeRequest in a new session, which again causes the invalid IM:ReportData, forever (not good for the flash storage).
@bzbarsky-apple comments in #28104:
Here's a log excerpt of the repeating part:
(full log as attachment)
c9ce7f2 - reporting regression.txt
The text was updated successfully, but these errors were encountered: