-
-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Many) Issues with Sitemap Event Subscription #3620
Comments
@lolodomo Do you think we need an alive event here, too? |
@J-N-K there is already an alive event from what I can see from the logs. However I don't see how it could solve any of those problems, as in all 3 cases the connection is successful and open. |
Correct. Maybe 1 minute is too long and the connection dies before? But I don't understand why the subscription is not found. I'll try to check that. |
For the client the connection is never closed (so I guess the server keeps it open), in my provided script (or also with curl) the connection stays open and will never be terminated by the server. |
I just tried with your script above (slightly modified):
and the result is:
and
|
So the question is why does it occur for me and not for you. Also I noticed the script does not 100% reproduce it for me and it is definitely depending on the pageId. I think it is related to how many updates/changing items a page has. For my "electricity" page where i show multiple items which update frequently (PV, battery, grid power, device power, ..., each 1-2 times per second) the script can reproduce it about 70% of the time, and better with a longer sleep time. For other pages it is more difficult to reproduce. Maybe after all it is related to load? I think also on the android app during the day the sitemaps are not refreshing way more often than at night, where way less items update per second. Also would be interesting if issue 2 and 3 from the bug description also do not occur for you as they occur for me. Do you know if there is a good guide to attach a debugger to my openhabian? If yes I can continue debugging if I find something. |
My development system is started with You can then use remote debugging (in IntelliJ IDEA: "Remote JVM Debug", debugger mode "Attach to remote JVM", command line "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005"). I believe there is a similar setting in Eclipse, but I don't know. Make sure that you checkout the openhab-core repository and use the 3.4.x branch, so you have the correct sources. |
Will try that thanks, also I wonder if your comment from #2891 is related
Especially during the day I would consider my openhab as high-traffic. Actually I think it could be about 50 item updates per second (a request to |
Also I noticed issue 2 and 3 do not occur at night, issue 1 still occurs. Right now I can only imagine it's somehow related to the lower rate of item updates at night. |
@J-N-K I enabled trace logs for
Nearly for all item updates I see that log line with 0 sinks.
How is this possible? the number of sinks is varying without the sinks changing. This aligns with the issue 2 and 3 I described in the description. Any ideas on that? Edit: Nvm I think it is unrelated. Just now (at around 21:30 CEST) magically all sitemaps/subscriptions work unconditionally (as every day around that time). When checking the trace logs then they are similar. I think this is because there are multiple SseBroadcaster instances. |
placing a breakpoint here and here conditionally for one item which is updated every 1s with the openhab rest api shows the following, so somewhere in openhab (core) the events get lost.
|
@J-N-K I'm quite sure I found the issue. I haven't figured out yet why it is so slow and the tasks pile up but I could imagine it's due to the same threadpool contains tasks for sitemap handling and also persistence handling. Possible fixes I can think of right now:
|
There was an issue with persistence handling. This has been fixed in OH4 |
@J-N-K checked with openhab4, it seems a bit faster but doesn't solve the issue, as in my last post described the queue still grows faster then it is processed. after 30min runtime it is at 15000. And also the CPU usage is incredible low. So it is probably still bottlenecked with i/o (as described in my previous post). We should at very least use a different thread pool for sitemap updates as for persistence handling. |
Do you know which task is the bottleneck? IIRC the one-thread solution was introduced to prevent issues with concurrent event processing in consumers. |
Unfortunately I don't. The sitemap subscriptions seem to be fast though. And also as described earlier, the CPU load of openhab is very low (40% on htop with a 4 core system, so 400% would be available). Network traffic is for sure also not the bottleneck, as I can do (non sitemap related) subscriptions as many as I want (SSE or websocket) which are all supplied with all events without any delay or htop spikes. I did not dive into the persistence logic, however I could imagine it is related as I am persisting many items (never noticed anything wrong there though) Sitemap subscriptions definitely need some fix (some suggestions mentioned earlier). They are completely unusable for openhab systems with many (30+) item updates per second. I'm surprised the issue was not noticed earlier. 30-50 updates per second does not sound much for me and also besides sitemap subscriptions openhab handles it with ease. |
I suppose you could try disabling persistence temporarily (unless you've already tried that). That would rule persistence in or out as the culprit. |
@mhilbush disabling the persistence will 99% mitigate the issue. Right now my setup is at the edge of what this single executor thread pool can manage. It's exactly 30 item changes per second. If I increase it to 35 item changes per second the queue will grow. Disabling the persistence will therefore not give us any hints, if it "solves" the issue, we still need a fix, as persistence is a wanted feature. And if it doesn't solve it we need the fix even more. Interesting would be the ratio (of runtime) for persistence tasks and sitemap subscription tasks. To get good numbers I think the only good way to do that would be load testing. Nevertheless I would suggest to at least split the executors, so one will do persistence and another one sitemap tasks. With that separated I could also measure again for my setup where are the limits for each of those executors then (if load tests are too much effort) |
Can you let us know why you have 30 item changes per second? That sounds awfully much |
Clearly! |
That is probably an order of magnitude more than I've experienced personally and I've seen in logs reported to the forum. I'm not saying improving OH's performance here isn't a good idea, but at least in our experience that's well outside the expected number of changes for a home automation system.
I think first understanding why they are one thread in the first place needs to be understood. simply splitting them may solve this specific problem but reintroduce a problem for everyone else. Both need to be accommodated. I don't know the right keywords to find the PR where the one thread was implemented. |
@rkoshak the code was introduced January 2019 (that's more than 4 years) with a single commit and never touched since then. I would assume it's forgotten with the introduction of OH3 ui. @J-N-K @lolodomo why not? If you are interested I can show you my home automation system with openhab. I love it, it's blazing fast, I have multiple tablets in my home with openhab sitemaps open, showing information about photovoltaic, electricity usage of devices, charge rates of car(s), charge rates of batteries, water flow rates, pool control like filter pressure and water flow and much more. Seeing those values update in real time is amazing, it's quite a difference to seeing the PV output 1 time every 10s compared to see 1-2 times a second the output of every PV string. And that's not everything, There's much more on my mind to add. As I said earlier, I'm running it on a raspbery pi, and it's not even close to struggle. CPU load is low and besides the sitemap subscriptions everything works perfectly and fast. And still I'm quite sure it can also handle that with ease, another point to proof that is that I can add as many subscriptions as I want and it's not affecting the issue, it's either all subscriptions (sitemaps) suffering at the same time or none. And even when the issue is occurring, load on htop is not high. |
I'm preparing a test bundle that has a thread per subscriber. This should still ensure the order of the events when they arrive at the subscriber but make the subscribers independent of each other. Would you be willing to test that? |
absolutely, thank you! Edit: Like in this example, #3620 (comment) here the queue was stacked with 1million tasks, though when the sitemap received an update it was the latest value. |
That issue was part of moving all the Eclipse Smart Home stuff back into OH. I think we would need to root around the PRs for ESH to figure how why that change was made.
We don't know that it was UI driven in the first place. That's my main concern. Maybe it is "forgotten" by MainUI but something in persistence or elsewhere depends on it. I just don't want to break things blindly just because the original PR is old. I looks like @J-N-K has you covered so it's all good. |
@rkoshak I also don't know. With my "expertise" looking through the code for a few days now, I would have used this method for sitemap subscription handling, just the same way it is used for websockets, which is really fast. And also this method is already used exactly for this usecase, just only for |
On the KARAF console:
then stop openHAB and re-start. This only works on OH4, latest SNAPSHOTs, otherwise there will be incompatibilities. |
Ok thanks I will try. How to revert back to the original version? |
Installing an arbitrary add-on should refresh the bundle to the original version. A restart is probably needed. |
Tested it, I don't really see a difference. I think I applied it correctly, because on the UI it reads |
If you see only one thread, then it is not working. Can you do
You can hog back to the original version by installing an add-on. |
I installed an add-on, with restart it did not go back to the original version. |
The change is in the |
Not sure how that would help for the issue. The sitemap subscription does not use the |
I have done some more refactoring for the sitemap resource. Pleas ADDITIONALLY
cleaning cache should revert all changes |
@J-N-K thanks! again didn't find the change you did, but whatever it is, that's definitely a huge improvement. Testing with slightly more than 50 updates/s, everything works perfectly. Sitemaps responses are fast and instant, nothing is delayed. |
I noticed some units are changed on the sitemap though, percentages now show e.g. 3700% instead of 37% and liters show 12000l instead of 12l, but I assume it's something fixed in 4.0 M2 (as with this version it works) while your changes are based on 4.0 ? |
Sitemap Event Subscription is incredible unreliable. There are many scenarios where events are not sent or subscriptions stall. This occurs and can be reproduced within the browser (basic UI) and the android app (sitemaps).
Expected Behavior
All changes to items should be reflected in real time for the sitemap which is currently opened.
Current Behavior
There are many issues where this is not the case:
This will create the following log output (grepping for this subscription id):
Also very important to notice is that the connection is kept open! It is not closed as I think was desired here. This tricks the client to think the connection will send events but it never will. You can also easily reproduce it with curl and manually copy the subscription id from request 1 to request 2, as this will always take longer than 2 seconds.
Here somehow only a few events are sent within minutes, while in reality multiple events should be sent per second. In this state, whenever I refresh the page/app manually it will update to the latest correct values which clearly indicates the current values were outdated.
I thought that openhab might be overloaded as indeed I have a setup with 10-30 item updates per second, but I don't think it is as CPU load is <= 50% and also everything loads incredible fast when refreshing manually and also all other services (like persistence) don't encounter any problems.
autoupdate="false"
are never (event-based) updated on the UI when changed on the UI. To reproduce, create such an item, include it to a sitemap asSetpoint
, write a script/rule which listens for commands of this item and update the state of the item. Now the logs look like this when changing it with the setpoint element:but the sitemap event stream never contains this change even for a working subscription (where point 1 and 2 did not occur).
Possible Solution
For me it looks like the sitemap event subscription could need a rework, maybe replacing it entirely with websockets as openhab supports this by now already.
Steps to Reproduce (for Bugs)
provided in the description
Context
I'm using sitemaps to show the state of my home. Sitemaps are awesome and in my opinion far superior to the OH3 UI (e.g. they are blazing fast, no cumbersome login setups required on android, native app navigation works, video streams work through openhab cloud, way easier to setup, etc.). I think it's a pity they don't see a lot of love since OH3 UI.
Your Environment
The text was updated successfully, but these errors were encountered: