-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2 doesn't close connections #7203
Comments
Hi, are you able to reproduce this behaviour with the snapshot packages? Looking at netstat, all your clients are configured to connect to the master.
How exactly are these checks executed, are these command endpoint checks? Could this be reproduced with a simple firewall in between, or does it need an unresponsive client cutting off the connection? Cheers, |
Hello @dnsmichi About how these checks executed:
True, our agents are connected to master (contains "host" in endpoint). I think we can test it with firewall, but we have to catch the right moment. Thanks, Ivan |
If your custom repository uses ours as upstream, you may just switch it from release to snapshot. If you build your own packages, you may just build the master instead of a tag. Otherwise please share the construction and maybe I'll be able to explain how to get snapshot packages there. |
@Punkoivan Here's a howto I wrote a while ago for manually fetching packages for a local DMZ. Maybe this helps, it would be great to know whether only 2.10 is affected or we are dealing with something breaking for 2.11. https://community.icinga.com/t/installation-in-a-dmz-without-internet-access/200 Thanks in advance! |
Hello @Al2Klimov we used almost 2.10, just fixed some dependencies (package names actually) so I believe it's almost vanilla version from you repository. @dnsmichi , unfortunately, our processes too slow to update package and we do not have "playground" lile stage - still waiting for servers. What we can do - we can simulate this with enabling port filtering on stage servers (where agent is installed) with our current version. Please let me know if this can help. |
I'm not sure if that helps. Since you're on 2.10.2 too, and not 2.10.5 it is hard to tell where this originates from. For 2.10.x I've fixed a similar problem with TLS shutdowns, but I am not sure if that's related: #6718 In terms of 2.11, I'll merge the linked PR and would be glad if you could still test them somewhere, e.g. with a newly built stage. I would assume that this doesn't need that many agents, but a special configuration and especially, network setting. Are you on a low latency connection to these clients? |
I did some tests with 2.10.2 and 2.11 and wasn't able to reproduce the behaviour. I'm leaving this issue open for after 2.11 and your ongoing tests, but remove this as task for the new release. |
Makes sense, thank you. Thank you for great product guys! |
There's an RC planned prior to the release, scheduled for CW30 approx. Would be nice if you can test this one then too. |
We're on our way to set up a stage envmso will be able to change version in more efficient way. |
I can see some TIME_WAIT things, but no CLOSE_WAIT in my tests. |
@Punkoivan 2.11 RC1 packages are available to test, please do so: https://icinga.com/2019/07/25/icinga-2-11-release-candidate/ |
Hi all, I just wanted to add to this thread because we have a similar issue using version 2.11.2. Since we are beginning to deploy icinga in a very large organization, we have stumbled upon this issue because we are still in the process of opening some ports. So for example port 5665 may be open in one direction only. In this condition we see through the web console lots of unknown or pending states however more importantly the open connections on the icinga satellites are sky rocketing causing the service to crash when it hits a hard coded limit (we have set it to 200 000 for the moment). While we are fixing the firewall rules it would be appropriate in any case that open connections do not continue to increase exponentially causing the icinga service on the satellites to crash when ports are closed or there is any type of network related issue. |
Hey there. |
How to reproduce
|
@Punkoivan If the snapshot packages are still affected, please also test the packages from here/here – pick your OS from the "binary"/"Build" column and then "Download" the "Job artifacts". |
@lippserd - absolutely. attached here. Filtered for icinga and pulled out host identifiers |
Is the agent in question already signed? |
Could you share the agent log please? |
Here are the sanitized agent logs you requested from @heyjodom The certificate for the agent was not autosigned on the master. I manually signed it with the 'icinga2 ca sign' command and restarted the agent which seemed to resolve the issue. Master Version
Agent Version
If you wan't any other info or diagnostics i'm happy to provide it. This is a HA environment with 2 masters and each zone has 2 satellites. |
@Al2Klimov @lippserd Agreed -- this looks like #7680 Any chance anything is planned to back port that fix from 2.12 to 2.11.x? We tried upgrading to to 2.11.4 on a problematic system (were on 2.11.3) saw the same behavior. Also looking on our side at running the 2.12rc client. |
We're still on 2.10 =( I will not make promises since it hard to say when we're going to update to 2.11.X |
We have the same problem as well from time to time. We also have a 2 nodes master setup and we sign the certificates of the agents with the icinga2 ca sign command. We currently run on 2.10.7-1 (because I have another problem with the 2.11.x currently and wanted to wait for 2.12). |
Which problem? |
Well, I tried to upgrade to 2.11.3 but it was not working. The clients where not able to connec to the master servers. I needed to get things up and running again quickly so I hadn't (sadly) a lot of time to troubleshoot. It looked like the RPC Issue to me, but thats just a guess. |
same problem here: too many open files |
Closed this issue because #8292 has been merged. |
Someone closed this issue just now, but due to a missing feature in the GitHub API and the high amount of comments here I can't figure out whether this issue was closed due to a PR merge. Please check by yourself whether this issue is on the correct milestone. |
Hi, we have in our infrastructure the same issue. We running two Icinga2-servers (CentOS 7) with version icinga2-2.13.2-1 installed. We increased the open file limit for the system to 65535 But increasing the file limits doesn´t solved the problem, because the openfiles only increase then to the next open file limit. After restart of icinga2-service the open files are cleaned, but only till the limit is reached again. what we found out is that the established connections between the icinga-server and the clients are not closed. for example:
is there a possibilty to set a timeout for the connections or has to be fixed in the softwarepackage ? |
Hi @janfickler, Could please open a new issue with your findings. Make sure to share some details about your zone setup, e.g. versions, connection directions, number of satellites and/or agents. Logs of an example node which has more than one connection would be great as well. All the best, |
Describe the bug
we have 50% of hosts in DOWN state and 50% of checks for unknown state.
error is:
Checked on backend:
Then checked open files for icinga process:
And few connections for example:
Our /var mountpoint is full with this messages:
As result, icinga cannot proceed any checks, but in IcingaWeb2 we see that monitoring health is fine.
It was our second master (HA configured) and actually checks have not been moved to another master, because healtcheck says - "folks, everything is OK"
To Reproduce
i am not really sure how we can reproduce this.
For me it looks like this:
Expected behavior
Icinga closes connections if some threshold is reached and no answer from remote host.
It shoouldn't break anything.
BUT if this occures, at least this master should ne marked "unhealthy" and all checks should be rebalanced to another master in case of HA.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):icinga2 feature list
):audit | 1.0.0
businessprocess | 2.1.0
director | 1.6.2
doc | 2.6.2
fileshipper | 1.1.0
grafana | 1.3.4
monitoring | 2.6.2
icinga2 daemon -C
):zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.SInce we're using agents, we have a lot of Endpoints defined.
HA is working fine for 2 months.
Additional context
We will increase open files to some value, but truly I beleive that this is critical bug.
Please let me know if I can provide more information.
The text was updated successfully, but these errors were encountered: