[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

icinga-migration · 2015-10-22T13:12:26Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10435

Created by penyilas on 2015-10-22 13:12:26 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-11-09 14:52:15 +00:00 (in Redmine)

Icinga Version: 2.3.11
Backport?: Not yet backported
Include in Changelog: 1

Hi,

After I was installed 2.3.11, i've attached back the two remining nodes into cluster and it looks fine at first...
But i've realized, the service checks start "lagging" eg. the time between two servicechecks is more than check_interval. Sometimes 3-4 times more.
Just for sure, i've added an nrpe date check with 5s check_interval, and the results are same like cluster heartbeat checks based on ITIL (usually 15-20s between two checks).

On a live environment sometimes (with usual 5m check_interval) the elapsed time is more than 20mins.

If I removed two nodes from zone (stopping them won't enough), everything's back on normal...

My test environment is same as before:
2 master (zone master), 3 satellites (zone icinga)
Configs are same like https://dev.icinga.org/issues/10131

I've attached stack traces, if you need anything else, just let me know...

Attachments

master1.gdb_bt.log.gz penyilas - 2015-10-22 12:53:25 +00:00
master2.gdb_bt.log.gz penyilas - 2015-10-22 12:53:26 +00:00
master1-logs.tgz penyilas - 2015-10-22 12:53:28 +00:00
slave1.gdb_bt.log.gz penyilas - 2015-10-22 12:53:28 +00:00
slave2.gdb_bt.log.gz penyilas - 2015-10-22 12:53:28 +00:00
slave2-logs.tgz penyilas - 2015-10-22 12:53:31 +00:00
slave3.gdb_bt.log.gz penyilas - 2015-10-22 12:53:31 +00:00
master2-logs.tgz penyilas - 2015-10-22 12:53:33 +00:00
slave1-logs.tgz.part1.rar penyilas - 2015-10-22 12:58:21 +00:00
slave1-logs.tgz.part2.rar penyilas - 2015-10-22 12:58:58 +00:00
workqueue.PNG sudeshkumar - 2016-01-14 14:58:09 +00:00

Relations:

relates #10435
relates #10435

The text was updated successfully, but these errors were encountered:

icinga-migration · 2016-01-14T14:58:31Z

Updated by sudeshkumar on 2016-01-14 14:58:31 +00:00

File added workqueue.PNG

I too have related issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing.

I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot.

icinga-migration · 2016-01-18T13:46:16Z

Updated by sudeshkumar on 2016-01-18 13:46:16 +00:00

For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue.

I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory.

Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin.

Please help me to resolve this.

icinga-migration · 2016-02-24T22:21:30Z

Updated by mfriedrich on 2016-02-24 22:21:30 +00:00

Status changed from New to Feedback

Please re-test this with 2.4.3.

icinga-migration · 2016-03-03T12:16:55Z

Updated by penyilas on 2016-03-03 12:16:55 +00:00

dnsmichi wrote:

Please re-test this with 2.4.3.

Hi,

I've tested it with CentOS7/icinga 2.4.3, and it's still not working.
With 3node slave zone, it looks, the "check_interval = 5s" checks are running in every 15s.

icinga-migration · 2016-03-04T15:54:15Z

Updated by mfriedrich on 2016-03-04 15:54:15 +00:00

Parent Id set to 11313

icinga-migration · 2016-04-14T10:13:33Z

Updated by mfriedrich on 2016-04-14 10:13:33 +00:00

Status changed from Feedback to New

icinga-migration · 2016-07-28T16:14:45Z

Updated by mfriedrich on 2016-07-28 16:14:45 +00:00

We've encountered a similar problem with 4 endpoints in a zone. Our current suggestion is to use 2 endpoints for now until a proper investigation and fix will happen.

icinga-migration · 2016-07-28T16:15:06Z

Updated by mfriedrich on 2016-07-28 16:15:06 +00:00

Relates set to 11948

icinga-migration · 2016-08-19T07:09:50Z

Updated by mfriedrich on 2016-08-19 07:09:50 +00:00

Priority changed from Normal to High

icinga-migration · 2016-11-09T14:52:15Z

Updated by mfriedrich on 2016-11-09 14:52:15 +00:00

Parent Id deleted ~~11313~~

icinga-migration · 2017-01-09T15:29:31Z

Updated by mfriedrich on 2017-01-09 15:29:31 +00:00

Relates set to 13861

jkroepke · 2017-02-08T16:37:53Z

Any news?

iDemonix · 2017-05-31T16:09:54Z

Would also like to see an update on this.

If you simply want to have a group of 'clients' or 'workers' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?

dnsmichi · 2017-05-31T17:32:27Z

It should. If you can help us fix the issue, i.e. by looking into the current message loop and proposing a fix, or granting us time to look into it, we can speed things up here.

iDemonix · 2017-05-31T18:24:10Z

I've not done much contribution before, I can take a look through the stack traces above and try correlate what is going on? Unfortunately C is not my strong point.

SimonHoenscheid · 2017-10-27T11:18:07Z

How is the status here?

Al2Klimov · 2018-01-07T12:34:27Z

Hello dear core devs and a happy new year 2018!

while evaluating monitoring "topology" for my private setup I tried (and failed) to reproduce this. (I2 2.6 on Debian 9)

Just for inspiration: What about testing "dis-meshing" the respective zones (by zones.conf)? E.g.:

A -> D <- B
C -> D

instead of

A <-> B <-> C <-> D <-> A <-> C
B <-> D

Best,
AK

relrod · 2019-05-16T22:45:09Z

So this has been open for 3.5 years now. I'm new to using Icinga, but is the workaround here to make a separate zone and endpoint for each server I want monitored, or else look at NRPE or similar?

I'm still trying to get a sense of how to deploy a simple setup with a master and several clients that can do more than just basic external checks, but I thought the idea was to have a zone for all clients that aren't the master.

iDemonix · 2019-05-16T23:05:14Z

Our workaround was simply to shift the bulk of our monitoring (Linux + services) from Icinga2/graphite to Prometheus. We still use Icinga2 for network devices (several thousand), but we just run standalone boxes, or simple 2-node clusters (1 in each geo-location), all of which write to a shared graphite cluster.

My experience is that it's easier to just big one big Icinga2 box, and separate out things like MySQL to somewhere highly performant. Clustering for load balancing is a bit of a pain, and if you want to have more than 2, the bug in this thread stops it working. It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container) and voila, but we'll see if anyone wants to sponsor it.

dnsmichi · 2019-05-17T06:47:35Z

thought the idea was to have a zone for all clients that aren't the master.

No, that won't work by design. Zones exist for High availability amongst zone members, and to separate specific tasks into roles - master, satellites, clients/agents. Agreed that it is cumbersome that a client would need a single zone and endpoint, but unless someone comes up with a better solution for this, it will be the one thing you need in the future.

So this has been open for 3.5 years now.

The thing I don't like in issues is when people tell about the age of issues which implicitly blames developers. It doesn't really matter whether a ticket has an age of one month or five years - if there are no solutions, no-one willing to work on this, nor any support requirements, nor any sponsors really requiring it then issues like this won't receive much love. Consider that Icinga is open source software, not something you'll pay for.

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

If you have more details or a reliable test-setup (Vagrant, Docker, etc.) where this always happens, and you can provide all the debug logs, gdb backtraces and insights to work on a fix, please do so. You can also dive into the code, I've recently improved the development docs even more. If you want us developers do it, kindly request a quote for sponsoring.

It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container)

Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees.

If you want something like that, this needs a more fine granular approach with e.g. disabling all features except for checker/api and then optimize this again for speed and better round-robin / balancing algorithms.

The feature request with check groups in #7160 moves into this direction for example. That being said, this issue and the idea of pooling/grouping is known but no-one is actively working on a concept nor a PoC at the moment.

Cheers,
Michael

Corbyn · 2019-05-17T08:54:42Z

Besides, for loadbalancing purposes it's very easy with Icinga2 to setup Icinga2 clients as dedicated "checker satellites". Such a client would indeed only have the checker/api feature active and acts as command_endpoint for certain service checks.
E.g. we had a memory problem with the check_wmi_plus plugin, so we couldn't run all our Windows service checks on the Icinga2 server. Instead we dedicated one Icinga2 client on separate hosts to one Windows service check (cpu, paging, disk, etc.). So it's absolutely no problem to horizontally scale out the load to Icinga2 clients/satellites.
Regarding high availability it would be indeed nice to have the possibility to have more than two nodes in an Icinga2 server cluster but a two node cluster (which works perfectly fine) is already a quite good HA solution.

tkoeck · 2019-06-12T15:31:41Z

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

Can you tell me where (in which file/files) the routing algorithm is located?

dnsmichi · 2019-06-18T09:35:15Z

lib/remote - apilistener*, jsonrpc* and partially lib/icinga - clusterevents*

DisSsha · 2019-09-09T12:51:14Z

I'm running a setup with a Master zone with 1 node managing a slave zone of 10 satellites for running plugins and it works.

dnsmichi · 2020-02-05T10:30:21Z

Meanwhile, the routing has been documented at https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#cluster-message-routing

AurelienFo · 2020-12-03T09:36:49Z

Hi, is there any news about this issue please? (2 masters, many zones and more than 2 satellites per zone)?
Thanks!

lippserd · 2021-01-26T16:51:01Z

Hi,

We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it.

All the best,
Eric

DisSsha · 2021-01-26T19:36:14Z

Btw we use a lot of subzone as a workaround and this is working perfectly. Regards, Le mar. 26 janv. 2021 à 17:51, Eric Lippmann <[email protected]> a écrit :

…

Hi, We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it. All the best, Eric — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3533 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARYYPT34EQADKJ4NVO6C3DS33XIPANCNFSM4C7MG3YA> .

firatalkis · 2021-02-22T15:36:43Z

@DisSsha How did you configure zone.conf on master and satellite? Could you please tell us your Icinga2 topology?

N-o-X · 2021-09-20T07:05:11Z

There's currently no plan to work on this in the near future. We'll keep this in mind for when we think about reworking the cluster communication, which also won't happen anytime soon.

ymartin-ovh · 2021-12-09T12:35:46Z

Btw we use a lot of subzone as a workaround and this is working perfectly. Regards, Le mar. 26 janv. 2021 à 17:51, Eric Lippmann [email protected] a écrit :
…
Hi, We have no immediate plans to support more than 2 nodes per zone. But that doesn't mean we will never support it. All the best, Eric — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3533 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYYPT34EQADKJ4NVO6C3DS33XIPANCNFSM4C7MG3YA .

Hello

When you talk about subzone, you have 2 levels of satellites nodes ?

master 1/2 --- satellite 1/2 (zone A) --- satellite 1/2 (subzone A1)
                                      --- satellite 1/2 (subzone A2)

Regards

icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017

dnsmichi added the wishlist label Feb 6, 2017

dnsmichi self-assigned this Mar 30, 2017

dnsmichi added the help wanted Extra attention is needed label Apr 26, 2017

dnsmichi mentioned this issue May 19, 2017

Icinga2 should failover the IDO DB writer when DB is gone #5271

Closed

dnsmichi removed their assignment May 31, 2017

stefanandres mentioned this issue Aug 9, 2017

1 out of 2 zones, cluster-zone check reports "zone not connected, lag: 0" - but zone is in sync #5482

Closed

baurmatt mentioned this issue Sep 13, 2017

Allow specificing icinga2 version voxpupuli/puppet-icinga2#376

Closed

dnsmichi added needs-sponsoring Not low on priority but also not scheduled soon without any incentive and removed help wanted Extra attention is needed labels Nov 9, 2017

dnsmichi removed the blocker Blocks a release or needs immediate attention label Jan 3, 2018

dnsmichi mentioned this issue Jul 2, 2018

Multi-master multi-zone sync-loop causing active checks getting stale #6422

Closed

dnsmichi changed the title ~~[dev.icinga.com #10435] checks start "lagging" when cluster nodes > 2 in a zone~~ [dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone Sep 28, 2018

dnsmichi mentioned this issue Sep 28, 2018

Icinga2 Segmantation Fault #6520

Closed

dnsmichi removed the wishlist label May 9, 2019

dnsmichi mentioned this issue Feb 5, 2020

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

Open

Al2Klimov added the help wanted Extra attention is needed label May 15, 2020

Al2Klimov mentioned this issue Sep 4, 2020

Sync-Conflict on Satellite - more recent config is overwritten by older config #8212

Closed

N-o-X closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

icinga-migration commented Oct 22, 2015

icinga-migration commented Jan 14, 2016

icinga-migration commented Jan 18, 2016

icinga-migration commented Feb 24, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Apr 14, 2016

icinga-migration commented Jul 28, 2016

icinga-migration commented Jul 28, 2016

icinga-migration commented Aug 19, 2016

icinga-migration commented Nov 9, 2016

icinga-migration commented Jan 9, 2017

jkroepke commented Feb 8, 2017

iDemonix commented May 31, 2017

dnsmichi commented May 31, 2017

iDemonix commented May 31, 2017

SimonHoenscheid commented Oct 27, 2017

Al2Klimov commented Jan 7, 2018 •

edited

Loading

relrod commented May 16, 2019

iDemonix commented May 16, 2019

dnsmichi commented May 17, 2019

Corbyn commented May 17, 2019

tkoeck commented Jun 12, 2019 •

edited

Loading

dnsmichi commented Jun 18, 2019

DisSsha commented Sep 9, 2019

dnsmichi commented Feb 5, 2020

AurelienFo commented Dec 3, 2020

lippserd commented Jan 26, 2021

DisSsha commented Jan 26, 2021 via email

firatalkis commented Feb 22, 2021

N-o-X commented Sep 20, 2021

ymartin-ovh commented Dec 9, 2021 •

edited

Loading

[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

[dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone #3533

Comments

icinga-migration commented Oct 22, 2015

icinga-migration commented Jan 14, 2016

icinga-migration commented Jan 18, 2016

icinga-migration commented Feb 24, 2016

icinga-migration commented Mar 3, 2016

icinga-migration commented Mar 4, 2016

icinga-migration commented Apr 14, 2016

icinga-migration commented Jul 28, 2016

icinga-migration commented Jul 28, 2016

icinga-migration commented Aug 19, 2016

icinga-migration commented Nov 9, 2016

icinga-migration commented Jan 9, 2017

jkroepke commented Feb 8, 2017

iDemonix commented May 31, 2017

dnsmichi commented May 31, 2017

iDemonix commented May 31, 2017

SimonHoenscheid commented Oct 27, 2017

Al2Klimov commented Jan 7, 2018 • edited Loading

relrod commented May 16, 2019

iDemonix commented May 16, 2019

dnsmichi commented May 17, 2019

Corbyn commented May 17, 2019

tkoeck commented Jun 12, 2019 • edited Loading

dnsmichi commented Jun 18, 2019

DisSsha commented Sep 9, 2019

dnsmichi commented Feb 5, 2020

AurelienFo commented Dec 3, 2020

lippserd commented Jan 26, 2021

DisSsha commented Jan 26, 2021 via email

firatalkis commented Feb 22, 2021

N-o-X commented Sep 20, 2021

ymartin-ovh commented Dec 9, 2021 • edited Loading

Al2Klimov commented Jan 7, 2018 •

edited

Loading

tkoeck commented Jun 12, 2019 •

edited

Loading

ymartin-ovh commented Dec 9, 2021 •

edited

Loading