Icinga2 should failover the IDO DB writer when DB is gone #5271

xenuser · 2017-05-17T14:34:05Z

In a multi-master setup with IDO DB enabled, there is only one instance writing into the DB. If this "active IDO DB writer" is gone (e.g. kill icinga2), the active IDO DB writer role will move to the next master.

This is great, of course. However, when an Icinga2 master can no longer access the database, there is no failover of the IDO DB writer role. Instead, no instance is able to write to the database although all the other masters are able to connect to the database.

Expected Behavior

Icinga2 should failover when accessing the DB is no longer possible.
This could be done in the same way when Icinga2 failovers the IDO DB role when Icinga2 itself is gone (same timeout value in the config, for example).

One might also consider that Icinga2 shots itself in the head after reaching a certain timeout, meaning that Icinga2 shuts down so other Icinga2 instances can take over.

This can be achieved with external tools, such as Pacemaker or Monit, but I think the design of Icinga2 should solve this issue on its own.

Current Behavior

3 Icinga2 masters, all with IDO DB, DB host is 127.0.0.1
When the local Galera node is shut off, Icinga2 just seems to wait for the DB to come back online.

Steps to Reproduce (for bugs)

Setup 3 masters with IDO DB and Galera
Each master server also contains a local MariaDB instance
All Icinga2 instances connect to 127.0.0.1
Shut down MariaDB on one of the master nodes.

Context

Normally, one would run a HA DB setup with Pacemaker or similar mechanics, meaning that there is a floating service IP in case something goes wrong. However, in some scenarios there is no possibility to use floating IPs for HA services. Instead, Icinga2 has to connect to the local DB instance directly.

If this is the case, the IDO DB writer role has no chance to switch to another node if one of the local MariaDB instances is going down (and when Icinga2 is still running).

For me, this is something which happens in real-life projects. I understand if someone says that I should build other HA DB setups then; however, this would not be the solution, but more of a workaround. Instead, Icinga2 should behave like expected.

Your Environment

Version used (icinga2 --version): 2.6.1
Operating System and version: SLES 11 SP 1
Enabled features (icinga2 feature list): api checker command ido-mysql influxdb mainlog notification
Icinga Web 2 version and modules (System - About): Icingaweb 2.4.1, Director, Fileshipper

The text was updated successfully, but these errors were encountered:

xenuser · 2017-05-18T10:24:56Z

See also the discussion on the mailing list for further background information:
https://lists.icinga.org/pipermail/icinga-users/2017-May/011992.html

dnsmichi · 2017-05-19T09:52:20Z

Keep in mind that (short) connection losses will also enforce a failover then. When HA plays ping pong this is even worse. When the database is not available, you probably have different issues already and it is safe to say that a dedicated MySQL cluster with a central VIP removes that burden from the monitoring core feature.

I wouldn't bother with local databases and MySQL replication in place but suggest to use a MySQL cluster and a VIP. We've seen setups which ran into multiple issues with MySQL replication and slave lag, that's something you hardly can control within Icinga 2 itself.

Btw - you are running 3 endpoints in zone, which is known to have bugs (#3533). Keep that in mind for now.

I'd update the docs accordingly for scenarios, but without any MySQL replication involved. If you really need this feature, please sponsor design and development time for it.

dgoetz · 2017-05-19T10:46:54Z

Also keep in mind that there are more possiblities to prevent split-brain and redundant write request if all Icinga 2 instance would reliably write to one instance using the VIP. So such a scenario typically runs much more stable and requires less support.

And just to add: For make MySQL high-available keep in mind to stick to an asynchronous mechanism like master-master-replication, because you want Icinga 2 to be able to write data as fast as possible.

tmey · 2017-05-23T09:10:45Z

The problem will occur on the suggested solution with VIP also. Even more: it will occur more often with an additional network connection. Just imagine masters and databases distributed over mutiple datacenters....

If you really need this feature, please sponsor design and development time for it.

To do so, we should understand the existing design of IDO-DB HA.
Is there a design-documentation?

gunnarbeutner · 2017-05-23T09:53:20Z

Unfortunately there's no documentation for that, sorry. :)

tmey · 2017-05-23T11:32:38Z

Ok. Then let's have a look into the code:

/lib/db_ido_mysql/idomysqlconnection.cpp#L344
If the last entry in IDDOB of the other instance is older than 60 secs, the icinga2 instance is taking over.

/lib/db_ido_mysql/idomysqlconnection.cpp#L366
But only, if NOT IsPaused. What is happening at this place?

xenuser · 2017-05-24T14:08:25Z

@gunnarbeutner No offense: Is there any documentation regarding internal design (e.g. internal message bus, what affects Icinga2's decision to resume or suspend DB activity etc.)?

The community would highly appreciate if important design aspects like this are written down, so everybody knows what happens in specific situations.

I have another question regarding IDO MySQL and Icinga2:

I have a demo setup with 1 master, 1 satellite, 1 client. The master has IDO DB activated and connects to a lokal MySQL instance.

Today, I extended this demo setting to:
1 master with local MySQL and IDO DB + NEW: 1 master WITHOUT IDO DB and WITHOUT local MySQL
2 satellites, both in the same zone
1 client

So I just added a new master, but I don't need IDO MySQL there so I didn't even install it.
I re-configured zones and stuff and soon both masters were happily living together, exchaning information and being in sync (according to check results).

Soon I noticed a strange behavior: After some time, masterA (with IDO DB) stopped connecting to MySQL. I noticed that if I stop Icinga2 on masterB, masterA re-connects to MySQL again and works fine.

So, I started to dig and after a while I noticed that "enable_ha = true" is a default configuration setting for IDO MySQL. So I went ahead and extended my ido-mysql.conf with "enable_ha = false".

After doing this, masterA will still work with the local DB when masterB is online.

My question is: Somehow I understand that enable_ha triggers this behavior. But still: Why does masterA stop working with MySQL as soon as masterB is online, although masterB not even has IDO MySQL installed (and not activated)? Shouldn't there be some sort of mechanism which lets masterB signal masterA: "It is all good, you don't need to think of HA since I am not able to take over anyways.".

And: Why is there even an issue/a failover attempt when masterA can reach the local DB?

Another question:
I noticed that the role "IDO DB WRITER" switches back to a node after a failover, as soon as the former "broken node" is online and working again. Is that really what should happen? Why not let the new active IDO DB WRITER stay?

dnsmichi · 2017-06-16T19:24:38Z

I really don't like the tone in this issue.

@gunnarbeutner No offense: Is there any documentation regarding internal design (e.g. internal message bus, what affects Icinga2's decision to resume or suspend DB activity etc.)?

The community would highly appreciate if important design aspects like this are written down, so everybody knows what happens in specific situations.

First off, I think it is offending that you're now speaking for our community and using that as an argument.

Second, there are no such design documents. There might be some flipchart drawings somewhere, or notes hidden on our backup, but nothing one could share "somewhere".

You can kindly ask for development support and deeper insights into the code. If you need more than the source code and some tips to get things going by yourself, you can also sponsor development time and one of the Icinga devs will take care about it during workhours.

I'd suggest to proceed in a new issue, this one is burned and contains too many questions which are not part of the original request.

dnsmichi added question area/distributed Distributed monitoring (master, satellites, clients) area/db-ido Database output labels May 24, 2017

dnsmichi closed this as completed Jun 16, 2017

tlo8640 mentioned this issue Dec 5, 2017

icinga2 master does not switch when ido connection is lost on one node #5832

Closed

dnsmichi added no-issue Better asked in one of our support channels and removed no-issue Better asked in one of our support channels question labels May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga2 should failover the IDO DB writer when DB is gone #5271

Icinga2 should failover the IDO DB writer when DB is gone #5271

xenuser commented May 17, 2017

xenuser commented May 18, 2017

dnsmichi commented May 19, 2017

dgoetz commented May 19, 2017

tmey commented May 23, 2017

gunnarbeutner commented May 23, 2017

tmey commented May 23, 2017

xenuser commented May 24, 2017 •

edited

Loading

dnsmichi commented Jun 16, 2017

Icinga2 should failover the IDO DB writer when DB is gone #5271

Icinga2 should failover the IDO DB writer when DB is gone #5271

Comments

xenuser commented May 17, 2017

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

xenuser commented May 18, 2017

dnsmichi commented May 19, 2017

dgoetz commented May 19, 2017

tmey commented May 23, 2017

gunnarbeutner commented May 23, 2017

tmey commented May 23, 2017

xenuser commented May 24, 2017 • edited Loading

dnsmichi commented Jun 16, 2017

xenuser commented May 24, 2017 •

edited

Loading