-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2 should failover the IDO DB writer when DB is gone #5271
Comments
See also the discussion on the mailing list for further background information: |
Keep in mind that (short) connection losses will also enforce a failover then. When HA plays ping pong this is even worse. When the database is not available, you probably have different issues already and it is safe to say that a dedicated MySQL cluster with a central VIP removes that burden from the monitoring core feature. I wouldn't bother with local databases and MySQL replication in place but suggest to use a MySQL cluster and a VIP. We've seen setups which ran into multiple issues with MySQL replication and slave lag, that's something you hardly can control within Icinga 2 itself. Btw - you are running 3 endpoints in zone, which is known to have bugs (#3533). Keep that in mind for now. I'd update the docs accordingly for scenarios, but without any MySQL replication involved. If you really need this feature, please sponsor design and development time for it. |
Also keep in mind that there are more possiblities to prevent split-brain and redundant write request if all Icinga 2 instance would reliably write to one instance using the VIP. So such a scenario typically runs much more stable and requires less support. And just to add: For make MySQL high-available keep in mind to stick to an asynchronous mechanism like master-master-replication, because you want Icinga 2 to be able to write data as fast as possible. |
The problem will occur on the suggested solution with VIP also. Even more: it will occur more often with an additional network connection. Just imagine masters and databases distributed over mutiple datacenters....
To do so, we should understand the existing design of IDO-DB HA. |
Unfortunately there's no documentation for that, sorry. :) |
Ok. Then let's have a look into the code: /lib/db_ido_mysql/idomysqlconnection.cpp#L344 /lib/db_ido_mysql/idomysqlconnection.cpp#L366 |
@gunnarbeutner No offense: Is there any documentation regarding internal design (e.g. internal message bus, what affects Icinga2's decision to resume or suspend DB activity etc.)? The community would highly appreciate if important design aspects like this are written down, so everybody knows what happens in specific situations. I have another question regarding IDO MySQL and Icinga2: I have a demo setup with 1 master, 1 satellite, 1 client. The master has IDO DB activated and connects to a lokal MySQL instance. Today, I extended this demo setting to: So I just added a new master, but I don't need IDO MySQL there so I didn't even install it. Soon I noticed a strange behavior: After some time, masterA (with IDO DB) stopped connecting to MySQL. I noticed that if I stop Icinga2 on masterB, masterA re-connects to MySQL again and works fine. So, I started to dig and after a while I noticed that "enable_ha = true" is a default configuration setting for IDO MySQL. So I went ahead and extended my ido-mysql.conf with "enable_ha = false". After doing this, masterA will still work with the local DB when masterB is online. My question is: Somehow I understand that enable_ha triggers this behavior. But still: Why does masterA stop working with MySQL as soon as masterB is online, although masterB not even has IDO MySQL installed (and not activated)? Shouldn't there be some sort of mechanism which lets masterB signal masterA: "It is all good, you don't need to think of HA since I am not able to take over anyways.". And: Why is there even an issue/a failover attempt when masterA can reach the local DB? Another question: |
I really don't like the tone in this issue.
First off, I think it is offending that you're now speaking for our community and using that as an argument. Second, there are no such design documents. There might be some flipchart drawings somewhere, or notes hidden on our backup, but nothing one could share "somewhere". You can kindly ask for development support and deeper insights into the code. If you need more than the source code and some tips to get things going by yourself, you can also sponsor development time and one of the Icinga devs will take care about it during workhours. I'd suggest to proceed in a new issue, this one is burned and contains too many questions which are not part of the original request. |
In a multi-master setup with IDO DB enabled, there is only one instance writing into the DB. If this "active IDO DB writer" is gone (e.g. kill icinga2), the active IDO DB writer role will move to the next master.
This is great, of course. However, when an Icinga2 master can no longer access the database, there is no failover of the IDO DB writer role. Instead, no instance is able to write to the database although all the other masters are able to connect to the database.
Expected Behavior
Icinga2 should failover when accessing the DB is no longer possible.
This could be done in the same way when Icinga2 failovers the IDO DB role when Icinga2 itself is gone (same timeout value in the config, for example).
One might also consider that Icinga2 shots itself in the head after reaching a certain timeout, meaning that Icinga2 shuts down so other Icinga2 instances can take over.
This can be achieved with external tools, such as Pacemaker or Monit, but I think the design of Icinga2 should solve this issue on its own.
Current Behavior
3 Icinga2 masters, all with IDO DB, DB host is 127.0.0.1
When the local Galera node is shut off, Icinga2 just seems to wait for the DB to come back online.
Steps to Reproduce (for bugs)
Context
Normally, one would run a HA DB setup with Pacemaker or similar mechanics, meaning that there is a floating service IP in case something goes wrong. However, in some scenarios there is no possibility to use floating IPs for HA services. Instead, Icinga2 has to connect to the local DB instance directly.
If this is the case, the IDO DB writer role has no chance to switch to another node if one of the local MariaDB instances is going down (and when Icinga2 is still running).
For me, this is something which happens in real-life projects. I understand if someone says that I should build other HA DB setups then; however, this would not be the solution, but more of a workaround. Instead, Icinga2 should behave like expected.
Your Environment
icinga2 --version
): 2.6.1icinga2 feature list
): api checker command ido-mysql influxdb mainlog notificationThe text was updated successfully, but these errors were encountered: