Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

Closed
icinga-migration opened this issue Oct 22, 2015 · 31 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working help wanted Extra attention is needed needs-sponsoring Not low on priority but also not scheduled soon without any incentive

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10435

Created by penyilas on 2015-10-22 13:12:26 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-11-09 14:52:15 +00:00 (in Redmine)

Icinga Version: 2.3.11
Backport?: Not yet backported
Include in Changelog: 1

Hi,

After I was installed 2.3.11, i've attached back the two remining nodes into cluster and it looks fine at first...
But i've realized, the service checks start "lagging" eg. the time between two servicechecks is more than check_interval. Sometimes 3-4 times more.
Just for sure, i've added an nrpe date check with 5s check_interval, and the results are same like cluster heartbeat checks based on ITIL (usually 15-20s between two checks).

On a live environment sometimes (with usual 5m check_interval) the elapsed time is more than 20mins.

If I removed two nodes from zone (stopping them won't enough), everything's back on normal...

My test environment is same as before:
2 master (zone master), 3 satellites (zone icinga)
Configs are same like https://dev.icinga.org/issues/10131

I've attached stack traces, if you need anything else, just let me know...

Attachments


Relations:

@icinga-migration
Copy link
Author

Updated by sudeshkumar on 2016-01-14 14:58:31 +00:00

  • File added workqueue.PNG

I too have related issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing.

I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot.

@icinga-migration
Copy link
Author

Updated by sudeshkumar on 2016-01-18 13:46:16 +00:00

For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue.

I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory.

Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin.

Please help me to resolve this.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-24 22:21:30 +00:00

  • Status changed from New to Feedback

Please re-test this with 2.4.3.

@icinga-migration
Copy link
Author

Updated by penyilas on 2016-03-03 12:16:55 +00:00

dnsmichi wrote:

Please re-test this with 2.4.3.

Hi,

I've tested it with CentOS7/icinga 2.4.3, and it's still not working.
With 3node slave zone, it looks, the "check_interval = 5s" checks are running in every 15s.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-04 15:54:15 +00:00

  • Parent Id set to 11313

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-04-14 10:13:33 +00:00

  • Status changed from Feedback to New

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-07-28 16:14:45 +00:00

We've encountered a similar problem with 4 endpoints in a zone. Our current suggestion is to use 2 endpoints for now until a proper investigation and fix will happen.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-07-28 16:15:06 +00:00

  • Relates set to 11948

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-08-19 07:09:50 +00:00

  • Priority changed from Normal to High

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-11-09 14:52:15 +00:00

  • Parent Id deleted 11313

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2017-01-09 15:29:31 +00:00

  • Relates set to 13861

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@jkroepke
Copy link

jkroepke commented Feb 8, 2017

Any news?

@iDemonix
Copy link

Would also like to see an update on this.

If you simply want to have a group of 'clients' or 'workers' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?

@dnsmichi
Copy link
Contributor

It should. If you can help us fix the issue, i.e. by looking into the current message loop and proposing a fix, or granting us time to look into it, we can speed things up here.

@dnsmichi dnsmichi removed their assignment May 31, 2017
@iDemonix
Copy link

I've not done much contribution before, I can take a look through the stack traces above and try correlate what is going on? Unfortunately C is not my strong point.

@SimonHoenscheid
Copy link

How is the status here?

@dnsmichi dnsmichi added needs-sponsoring Not low on priority but also not scheduled soon without any incentive and removed help wanted Extra attention is needed labels Nov 9, 2017
@dnsmichi dnsmichi removed the blocker Blocks a release or needs immediate attention label Jan 3, 2018
@Al2Klimov
Copy link
Member

Al2Klimov commented Jan 7, 2018

Hello dear core devs and a happy new year 2018!

while evaluating monitoring "topology" for my private setup I tried (and failed) to reproduce this. (I2 2.6 on Debian 9)

Just for inspiration: What about testing "dis-meshing" the respective zones (by zones.conf)? E.g.:

A -> D <- B
C -> D

instead of

A <-> B <-> C <-> D <-> A <-> C
B <-> D

Best,
AK

@dnsmichi dnsmichi changed the title [dev.icinga.com #10435] checks start "lagging" when cluster nodes > 2 in a zone [dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone Sep 28, 2018
@dnsmichi dnsmichi removed the wishlist label May 9, 2019
@relrod
Copy link

relrod commented May 16, 2019

So this has been open for 3.5 years now. I'm new to using Icinga, but is the workaround here to make a separate zone and endpoint for each server I want monitored, or else look at NRPE or similar?

I'm still trying to get a sense of how to deploy a simple setup with a master and several clients that can do more than just basic external checks, but I thought the idea was to have a zone for all clients that aren't the master.

@iDemonix
Copy link

Our workaround was simply to shift the bulk of our monitoring (Linux + services) from Icinga2/graphite to Prometheus. We still use Icinga2 for network devices (several thousand), but we just run standalone boxes, or simple 2-node clusters (1 in each geo-location), all of which write to a shared graphite cluster.

My experience is that it's easier to just big one big Icinga2 box, and separate out things like MySQL to somewhere highly performant. Clustering for load balancing is a bit of a pain, and if you want to have more than 2, the bug in this thread stops it working. It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container) and voila, but we'll see if anyone wants to sponsor it.

@dnsmichi
Copy link
Contributor

thought the idea was to have a zone for all clients that aren't the master.

No, that won't work by design. Zones exist for High availability amongst zone members, and to separate specific tasks into roles - master, satellites, clients/agents. Agreed that it is cumbersome that a client would need a single zone and endpoint, but unless someone comes up with a better solution for this, it will be the one thing you need in the future.

So this has been open for 3.5 years now.

The thing I don't like in issues is when people tell about the age of issues which implicitly blames developers. It doesn't really matter whether a ticket has an age of one month or five years - if there are no solutions, no-one willing to work on this, nor any support requirements, nor any sponsors really requiring it then issues like this won't receive much love. Consider that Icinga is open source software, not something you'll pay for.

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

If you have more details or a reliable test-setup (Vagrant, Docker, etc.) where this always happens, and you can provide all the debug logs, gdb backtraces and insights to work on a fix, please do so. You can also dive into the code, I've recently improved the development docs even more. If you want us developers do it, kindly request a quote for sponsoring.

It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container)

Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees.

If you want something like that, this needs a more fine granular approach with e.g. disabling all features except for checker/api and then optimize this again for speed and better round-robin / balancing algorithms.

The feature request with check groups in #7160 moves into this direction for example. That being said, this issue and the idea of pooling/grouping is known but no-one is actively working on a concept nor a PoC at the moment.

Cheers,
Michael

@Corbyn
Copy link

Corbyn commented May 17, 2019

Besides, for loadbalancing purposes it's very easy with Icinga2 to setup Icinga2 clients as dedicated "checker satellites". Such a client would indeed only have the checker/api feature active and acts as command_endpoint for certain service checks.
E.g. we had a memory problem with the check_wmi_plus plugin, so we couldn't run all our Windows service checks on the Icinga2 server. Instead we dedicated one Icinga2 client on separate hosts to one Windows service check (cpu, paging, disk, etc.). So it's absolutely no problem to horizontally scale out the load to Icinga2 clients/satellites.
Regarding high availability it would be indeed nice to have the possibility to have more than two nodes in an Icinga2 server cluster but a two node cluster (which works perfectly fine) is already a quite good HA solution.

@tkoeck
Copy link

tkoeck commented Jun 12, 2019

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

Can you tell me where (in which file/files) the routing algorithm is located?

@dnsmichi
Copy link
Contributor

lib/remote - apilistener*, jsonrpc* and partially lib/icinga - clusterevents*

@DisSsha
Copy link

DisSsha commented Sep 9, 2019

I'm running a setup with a Master zone with 1 node managing a slave zone of 10 satellites for running plugins and it works.

@dnsmichi
Copy link
Contributor

dnsmichi commented Feb 5, 2020

Meanwhile, the routing has been documented at https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#cluster-message-routing

@AurelienFo
Copy link

Hi, is there any news about this issue please? (2 masters, many zones and more than 2 satellites per zone)?
Thanks!

@lippserd
Copy link
Member

Hi,

We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it.

All the best,
Eric

@DisSsha
Copy link

DisSsha commented Jan 26, 2021 via email

@firatalkis
Copy link

@DisSsha How did you configure zone.conf on master and satellite? Could you please tell us your Icinga2 topology?

@N-o-X
Copy link
Contributor

N-o-X commented Sep 20, 2021

There's currently no plan to work on this in the near future. We'll keep this in mind for when we think about reworking the cluster communication, which also won't happen anytime soon.

@N-o-X N-o-X closed this as completed Sep 20, 2021
@ymartin-ovh
Copy link
Contributor

ymartin-ovh commented Dec 9, 2021

Btw we use a lot of subzone as a workaround and this is working perfectly. Regards, Le mar. 26 janv. 2021 à 17:51, Eric Lippmann [email protected] a écrit :

Hi, We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it. All the best, Eric — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3533 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYYPT34EQADKJ4NVO6C3DS33XIPANCNFSM4C7MG3YA .

Hello

When you talk about subzone, you have 2 levels of satellites nodes ?

master 1/2 --- satellite 1/2 (zone A) --- satellite 1/2 (subzone A1)
                                      --- satellite 1/2 (subzone A2)

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working help wanted Extra attention is needed needs-sponsoring Not low on priority but also not scheduled soon without any incentive
Projects
None yet
Development

No branches or pull requests