possible performance/crash issue #13
Replies: 6 comments 3 replies
-
Hi! Huge thanks for giving some feedback! This behaviour is expected, it is a deliberate consequence of several design decisions I made. Let me explain a bit. History export modules are not first-class citizens in Zabbix. There is no caching/buffering and no retries implemented on Zabbix side, which would allow modules to make a pause in sending data to alternative backends and catch up later. Module has a single chance of sending data to storage backend. If backend is having performance issues, module is basically between hammer and anvil. There are following options available (and none of them is particularly good):
In a way current module's behaviour mirrors the relationship between Zabbix database and Zabbix server - when database performs poorly, Zabbix may have troubles too. That's why it is very important to monitor health status of Zabbix database and, if you are using this module, you should monitor availability of InfluxDB as well. Preferably, with an independent monitoring setup. I understand that current module's behaviour may not fit all use cases, so please let me know what are your thoughts on this topic. Would you prefer option 1? It is relatively easy to add a configuration parameter for this strategy. Or do you want me to pursue option 2? With enough support from other users I might give this a try. Again, with some support from other users, we can even push Zabbix to implement buffering on their side. |
Beta Was this translation helpful? Give feedback.
-
I am reading that there is influxdb-relay which can do the buffering. Maybe you can plug it into your setup between module and InfluxDB? |
Beta Was this translation helpful? Give feedback.
-
Hi thanks for the feedback it helps my understanding and at least we have confirmed that the behaviour is by design - it would be nice to have a configurable option to drop if timedout or similar but i think by default it should be off by default to retain current behaviour for all i am actually going to look at one of the influx relay binaries to solve this problem in the short term most likely while i review the DB scaling issue i have on the influx side which looks entirely unrelated to the zabbix data feed but something entirely seperate |
Beta Was this translation helpful? Give feedback.
-
You are welcome! Feel free to throw all your questions at me. Please share your findings afterwards. I think they will make a good addition to the documentation. |
Beta Was this translation helpful? Give feedback.
-
@anthonysomerset, |
Beta Was this translation helpful? Give feedback.
-
It’s not ideal but it would make sense if this behaviour could be turned on and off depending on users preference. Perhaps default to not invoking the discard behaviour to preserve current behaviour for users
That being said we have moved away from using this module due to the availability and native support for timescaledb and compression in zabbix which makes less moving parts for us to maintain
On 18 Jul 2022, at 03:26, i-ky ***@***.***> wrote:
Official loadable module documentation<https://www.zabbix.com/documentation/current/en/manual/config/items/loadablemodules#providing-history-export-callbacks> now says:
In case of internal error in history export module it is recommended that module is written in such a way that it does not block whole monitoring until it recovers but discards data instead and allows Zabbix server to continue running.
@anthonysomerset<https://github.com/anthonysomerset>, do you think that module's behaviour needs to be changed in order to match the above recommendations?
—
Reply to this email directly, view it on GitHub<#13 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTE2GGERX4GEEOLJHGTXDVUUBIZANCNFSM5BYJ4N2A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi there
I need to caveat that this is is an i think scenario not confirmed as yet
i appear to be having routine performance issues with influx... now i would expect this not to be an issue here but it is appearing that as a result of said issues the zabbix module is getting backed up trying to write data to influx and not timing out or failing in any way.
this is appear to cause some silent backlog in zabbix and suddenly no data at all is being inserted into zabbix.
restart zabbix server and/or influxd seems to fix it (trying to narrow down which specific daemon at present)
raising this somewhat pre-emptively in case there is an obvious reason why the influx write would block up zabbix server like this
EDIT - semi confirmed this - my influxd instance is currently disk IO constrained and this backup seems to be causing the writer on the Zabbix side to get stuck and everything to get stuck behind it
the moment i stop influx the connection drops and things resume nearly immediately
Beta Was this translation helpful? Give feedback.
All reactions