Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 should failover the IDO DB writer when DB is gone #5271

Closed
xenuser opened this issue May 17, 2017 · 8 comments
Closed

Icinga2 should failover the IDO DB writer when DB is gone #5271

xenuser opened this issue May 17, 2017 · 8 comments
Labels
area/db-ido Database output area/distributed Distributed monitoring (master, satellites, clients) no-issue Better asked in one of our support channels

Comments

@xenuser
Copy link
Contributor

xenuser commented May 17, 2017

In a multi-master setup with IDO DB enabled, there is only one instance writing into the DB. If this "active IDO DB writer" is gone (e.g. kill icinga2), the active IDO DB writer role will move to the next master.

This is great, of course. However, when an Icinga2 master can no longer access the database, there is no failover of the IDO DB writer role. Instead, no instance is able to write to the database although all the other masters are able to connect to the database.

Expected Behavior

Icinga2 should failover when accessing the DB is no longer possible.
This could be done in the same way when Icinga2 failovers the IDO DB role when Icinga2 itself is gone (same timeout value in the config, for example).

One might also consider that Icinga2 shots itself in the head after reaching a certain timeout, meaning that Icinga2 shuts down so other Icinga2 instances can take over.

This can be achieved with external tools, such as Pacemaker or Monit, but I think the design of Icinga2 should solve this issue on its own.

Current Behavior

3 Icinga2 masters, all with IDO DB, DB host is 127.0.0.1
When the local Galera node is shut off, Icinga2 just seems to wait for the DB to come back online.

Steps to Reproduce (for bugs)

  1. Setup 3 masters with IDO DB and Galera
  2. Each master server also contains a local MariaDB instance
  3. All Icinga2 instances connect to 127.0.0.1
  4. Shut down MariaDB on one of the master nodes.

Context

Normally, one would run a HA DB setup with Pacemaker or similar mechanics, meaning that there is a floating service IP in case something goes wrong. However, in some scenarios there is no possibility to use floating IPs for HA services. Instead, Icinga2 has to connect to the local DB instance directly.

If this is the case, the IDO DB writer role has no chance to switch to another node if one of the local MariaDB instances is going down (and when Icinga2 is still running).

For me, this is something which happens in real-life projects. I understand if someone says that I should build other HA DB setups then; however, this would not be the solution, but more of a workaround. Instead, Icinga2 should behave like expected.

Your Environment

  • Version used (icinga2 --version): 2.6.1
  • Operating System and version: SLES 11 SP 1
  • Enabled features (icinga2 feature list): api checker command ido-mysql influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About): Icingaweb 2.4.1, Director, Fileshipper
@xenuser
Copy link
Contributor Author

xenuser commented May 18, 2017

See also the discussion on the mailing list for further background information:
https://lists.icinga.org/pipermail/icinga-users/2017-May/011992.html

@dnsmichi
Copy link
Contributor

Keep in mind that (short) connection losses will also enforce a failover then. When HA plays ping pong this is even worse. When the database is not available, you probably have different issues already and it is safe to say that a dedicated MySQL cluster with a central VIP removes that burden from the monitoring core feature.

I wouldn't bother with local databases and MySQL replication in place but suggest to use a MySQL cluster and a VIP. We've seen setups which ran into multiple issues with MySQL replication and slave lag, that's something you hardly can control within Icinga 2 itself.

Btw - you are running 3 endpoints in zone, which is known to have bugs (#3533). Keep that in mind for now.

I'd update the docs accordingly for scenarios, but without any MySQL replication involved. If you really need this feature, please sponsor design and development time for it.

@dgoetz
Copy link
Contributor

dgoetz commented May 19, 2017

Also keep in mind that there are more possiblities to prevent split-brain and redundant write request if all Icinga 2 instance would reliably write to one instance using the VIP. So such a scenario typically runs much more stable and requires less support.

And just to add: For make MySQL high-available keep in mind to stick to an asynchronous mechanism like master-master-replication, because you want Icinga 2 to be able to write data as fast as possible.

@tmey
Copy link

tmey commented May 23, 2017

The problem will occur on the suggested solution with VIP also. Even more: it will occur more often with an additional network connection. Just imagine masters and databases distributed over mutiple datacenters....

If you really need this feature, please sponsor design and development time for it.

To do so, we should understand the existing design of IDO-DB HA.
Is there a design-documentation?

@gunnarbeutner
Copy link
Contributor

Unfortunately there's no documentation for that, sorry. :)

@tmey
Copy link

tmey commented May 23, 2017

Ok. Then let's have a look into the code:

/lib/db_ido_mysql/idomysqlconnection.cpp#L344
If the last entry in IDDOB of the other instance is older than 60 secs, the icinga2 instance is taking over.

/lib/db_ido_mysql/idomysqlconnection.cpp#L366
But only, if NOT IsPaused. What is happening at this place?

@xenuser
Copy link
Contributor Author

xenuser commented May 24, 2017

@gunnarbeutner No offense: Is there any documentation regarding internal design (e.g. internal message bus, what affects Icinga2's decision to resume or suspend DB activity etc.)?

The community would highly appreciate if important design aspects like this are written down, so everybody knows what happens in specific situations.

I have another question regarding IDO MySQL and Icinga2:

I have a demo setup with 1 master, 1 satellite, 1 client. The master has IDO DB activated and connects to a lokal MySQL instance.

Today, I extended this demo setting to:
1 master with local MySQL and IDO DB + NEW: 1 master WITHOUT IDO DB and WITHOUT local MySQL
2 satellites, both in the same zone
1 client

So I just added a new master, but I don't need IDO MySQL there so I didn't even install it.
I re-configured zones and stuff and soon both masters were happily living together, exchaning information and being in sync (according to check results).

Soon I noticed a strange behavior: After some time, masterA (with IDO DB) stopped connecting to MySQL. I noticed that if I stop Icinga2 on masterB, masterA re-connects to MySQL again and works fine.

So, I started to dig and after a while I noticed that "enable_ha = true" is a default configuration setting for IDO MySQL. So I went ahead and extended my ido-mysql.conf with "enable_ha = false".

After doing this, masterA will still work with the local DB when masterB is online.

My question is: Somehow I understand that enable_ha triggers this behavior. But still: Why does masterA stop working with MySQL as soon as masterB is online, although masterB not even has IDO MySQL installed (and not activated)? Shouldn't there be some sort of mechanism which lets masterB signal masterA: "It is all good, you don't need to think of HA since I am not able to take over anyways.".

And: Why is there even an issue/a failover attempt when masterA can reach the local DB?

Another question:
I noticed that the role "IDO DB WRITER" switches back to a node after a failover, as soon as the former "broken node" is online and working again. Is that really what should happen? Why not let the new active IDO DB WRITER stay?

@dnsmichi dnsmichi added question area/distributed Distributed monitoring (master, satellites, clients) area/db-ido Database output labels May 24, 2017
@dnsmichi
Copy link
Contributor

I really don't like the tone in this issue.

@gunnarbeutner No offense: Is there any documentation regarding internal design (e.g. internal message bus, what affects Icinga2's decision to resume or suspend DB activity etc.)?

The community would highly appreciate if important design aspects like this are written down, so everybody knows what happens in specific situations.

First off, I think it is offending that you're now speaking for our community and using that as an argument.

Second, there are no such design documents. There might be some flipchart drawings somewhere, or notes hidden on our backup, but nothing one could share "somewhere".

You can kindly ask for development support and deeper insights into the code. If you need more than the source code and some tips to get things going by yourself, you can also sponsor development time and one of the Icinga devs will take care about it during workhours.

I'd suggest to proceed in a new issue, this one is burned and contains too many questions which are not part of the original request.

@dnsmichi dnsmichi added no-issue Better asked in one of our support channels and removed no-issue Better asked in one of our support channels question labels May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output area/distributed Distributed monitoring (master, satellites, clients) no-issue Better asked in one of our support channels
Projects
None yet
Development

No branches or pull requests

5 participants