-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533
Comments
Updated by sudeshkumar on 2016-01-14 14:58:31 +00:00
I too have related issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing. I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot. |
Updated by sudeshkumar on 2016-01-18 13:46:16 +00:00 For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue. I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory. Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin. Please help me to resolve this. |
Updated by mfriedrich on 2016-02-24 22:21:30 +00:00
Please re-test this with 2.4.3. |
Updated by penyilas on 2016-03-03 12:16:55 +00:00 dnsmichi wrote:
Hi, I've tested it with CentOS7/icinga 2.4.3, and it's still not working. |
Updated by mfriedrich on 2016-03-04 15:54:15 +00:00
|
Updated by mfriedrich on 2016-04-14 10:13:33 +00:00
|
Updated by mfriedrich on 2016-07-28 16:14:45 +00:00 We've encountered a similar problem with 4 endpoints in a zone. Our current suggestion is to use 2 endpoints for now until a proper investigation and fix will happen. |
Updated by mfriedrich on 2016-07-28 16:15:06 +00:00
|
Updated by mfriedrich on 2016-08-19 07:09:50 +00:00
|
Updated by mfriedrich on 2016-11-09 14:52:15 +00:00
|
Updated by mfriedrich on 2017-01-09 15:29:31 +00:00
|
Any news? |
Would also like to see an update on this. If you simply want to have a group of 'clients' or 'workers' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported? |
It should. If you can help us fix the issue, i.e. by looking into the current message loop and proposing a fix, or granting us time to look into it, we can speed things up here. |
I've not done much contribution before, I can take a look through the stack traces above and try correlate what is going on? Unfortunately C is not my strong point. |
How is the status here? |
Hello dear core devs and a happy new year 2018! while evaluating monitoring "topology" for my private setup I tried (and failed) to reproduce this. (I2 2.6 on Debian 9) Just for inspiration: What about testing "dis-meshing" the respective zones (by A -> D <- B instead of A <-> B <-> C <-> D <-> A <-> C Best, |
So this has been open for 3.5 years now. I'm new to using Icinga, but is the workaround here to make a separate zone and endpoint for each server I want monitored, or else look at NRPE or similar? I'm still trying to get a sense of how to deploy a simple setup with a master and several clients that can do more than just basic external checks, but I thought the idea was to have a zone for all clients that aren't the master. |
Our workaround was simply to shift the bulk of our monitoring (Linux + services) from Icinga2/graphite to Prometheus. We still use Icinga2 for network devices (several thousand), but we just run standalone boxes, or simple 2-node clusters (1 in each geo-location), all of which write to a shared graphite cluster. My experience is that it's easier to just big one big Icinga2 box, and separate out things like MySQL to somewhere highly performant. Clustering for load balancing is a bit of a pain, and if you want to have more than 2, the bug in this thread stops it working. It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container) and voila, but we'll see if anyone wants to sponsor it. |
No, that won't work by design. Zones exist for High availability amongst zone members, and to separate specific tasks into roles - master, satellites, clients/agents. Agreed that it is cumbersome that a client would need a single zone and endpoint, but unless someone comes up with a better solution for this, it will be the one thing you need in the future.
The thing I don't like in issues is when people tell about the age of issues which implicitly blames developers. It doesn't really matter whether a ticket has an age of one month or five years - if there are no solutions, no-one willing to work on this, nor any support requirements, nor any sponsors really requiring it then issues like this won't receive much love. Consider that Icinga is open source software, not something you'll pay for. Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues. If you have more details or a reliable test-setup (Vagrant, Docker, etc.) where this always happens, and you can provide all the debug logs, gdb backtraces and insights to work on a fix, please do so. You can also dive into the code, I've recently improved the development docs even more. If you want us developers do it, kindly request a quote for sponsoring.
Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees. If you want something like that, this needs a more fine granular approach with e.g. disabling all features except for checker/api and then optimize this again for speed and better round-robin / balancing algorithms. The feature request with check groups in #7160 moves into this direction for example. That being said, this issue and the idea of pooling/grouping is known but no-one is actively working on a concept nor a PoC at the moment. Cheers, |
Besides, for loadbalancing purposes it's very easy with Icinga2 to setup Icinga2 clients as dedicated "checker satellites". Such a client would indeed only have the checker/api feature active and acts as command_endpoint for certain service checks. |
Can you tell me where (in which file/files) the routing algorithm is located? |
lib/remote - apilistener*, jsonrpc* and partially lib/icinga - clusterevents* |
I'm running a setup with a Master zone with 1 node managing a slave zone of 10 satellites for running plugins and it works. |
Meanwhile, the routing has been documented at https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#cluster-message-routing |
Hi, is there any news about this issue please? (2 masters, many zones and more than 2 satellites per zone)? |
Hi, We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it. All the best, |
Btw we use a lot of subzone as a workaround and this is working perfectly.
Regards,
Le mar. 26 janv. 2021 à 17:51, Eric Lippmann <[email protected]> a
écrit :
… Hi,
We have no immediate plans to support more than 2 nodes per zone. But that
doesn't mean we will never support it.
All the best,
Eric
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3533 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARYYPT34EQADKJ4NVO6C3DS33XIPANCNFSM4C7MG3YA>
.
|
@DisSsha How did you configure zone.conf on master and satellite? Could you please tell us your Icinga2 topology? |
There's currently no plan to work on this in the near future. We'll keep this in mind for when we think about reworking the cluster communication, which also won't happen anytime soon. |
Hello When you talk about subzone, you have 2 levels of satellites nodes ?
Regards |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/10435
Created by penyilas on 2015-10-22 13:12:26 +00:00
Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-11-09 14:52:15 +00:00 (in Redmine)
Hi,
After I was installed 2.3.11, i've attached back the two remining nodes into cluster and it looks fine at first...
But i've realized, the service checks start "lagging" eg. the time between two servicechecks is more than check_interval. Sometimes 3-4 times more.
Just for sure, i've added an nrpe date check with 5s check_interval, and the results are same like cluster heartbeat checks based on ITIL (usually 15-20s between two checks).
On a live environment sometimes (with usual 5m check_interval) the elapsed time is more than 20mins.
If I removed two nodes from zone (stopping them won't enough), everything's back on normal...
My test environment is same as before:
2 master (zone master), 3 satellites (zone icinga)
Configs are same like https://dev.icinga.org/issues/10131
I've attached stack traces, if you need anything else, just let me know...
Attachments
Relations:
The text was updated successfully, but these errors were encountered: