Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 doesn't close connections #7203

Closed
Punkoivan opened this issue May 28, 2019 · 38 comments · Fixed by #7864
Closed

Icinga2 doesn't close connections #7203

Punkoivan opened this issue May 28, 2019 · 38 comments · Fixed by #7864
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@Punkoivan
Copy link

Describe the bug

we have 50% of hosts in DOWN state and 50% of checks for unknown state.
error is:

Exception occurred while checking 'some_host!ip_ipmi': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'

Checked on backend:

/var/log/icinga2/icinga2.log
[2019-05-27 14:14:13 -0700] critical/ApiListener: Cannot accept new connection. [2019-05-27 14:14:13 -0700] critical/Socket: accept() failed with error code 24, "Too many open files"

Then checked open files for icinga process:

[ipunko@prod-mon2 ~]$ 
pgrep icinga 
3197 3246 
[ipunko@prod-mon2 ~]$ sudo lsof -p 3197 | wc -l 
16433
[ipunko@prod-mon2 ~]$ netstat -tonp | grep CLOSE_WAIT | wc -l
(No info could be read for "-p": geteuid()=45452 but you should be root.)
16305

And few connections for example:

tcp      295      0 10.8.192.109:5665       10.8.44.169:58020       CLOSE_WAIT  -                    off (0.00/0/0)
tcp      309      0 10.8.192.109:5665       10.8.145.16:43634       CLOSE_WAIT  -                    off (0.00/0/0)
tcp      309      0 10.8.192.109:5665       10.8.145.14:39628       CLOSE_WAIT  -                    off (0.00/0/0)
tcp      309      0 10.8.192.109:5665       10.8.145.13:42508       CLOSE_WAIT  -                    off (0.00/0/0)

Our /var mountpoint is full with this messages:

[root@prod-mon2 log]# egrep -i 'too many open files' icinga2.log |wc -l 
96571507

As result, icinga cannot proceed any checks, but in IcingaWeb2 we see that monitoring health is fine.

It was our second master (HA configured) and actually checks have not been moved to another master, because healtcheck says - "folks, everything is OK"

To Reproduce

i am not really sure how we can reproduce this.
For me it looks like this:

  1. icinga proceed checks
  2. Network connection was broke while checks were in place
  3. Icinga still waiting for connection to be closed (CLOSE_WAIT TCP state)

Expected behavior

Icinga closes connections if some threshold is reached and no answer from remote host.
It shoouldn't break anything.
BUT if this occures, at least this master should ne marked "unhealthy" and all checks should be rebalanced to another master in case of HA.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):

icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.2-1)
Copyright (c) 2012-2018 Icinga Development Team (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-862.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: unknown
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

  • Operating System and version:

[ipunko@prod-mon2 ~]$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

  • Enabled features (icinga2 feature list):

[ipunko@prod-mon2 ~]$ sudo icinga2 feature list
Disabled features: compatlog debuglog elasticsearch gelf graphite influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker command ido-mysql mainlog notification

  • Icinga Web 2 version and modules (System - About):

audit | 1.0.0
businessprocess | 2.1.0
director | 1.6.2
doc | 2.6.2
fileshipper | 1.1.0
grafana | 1.3.4
monitoring | 2.6.2

  • Config validation (icinga2 daemon -C):

[ipunko@prod-mon2 ~]$ sudo icinga2 daemon -C
[2019-05-28 01:08:52 -0700] information/cli: Icinga application loader (version: r2.10.2-1)
[2019-05-28 01:08:52 -0700] information/cli: Loading configuration file(s).
[2019-05-28 01:08:52 -0700] information/ConfigItem: Committing config item(s).
[2019-05-28 01:08:52 -0700] information/ApiListener: My API identity: prod-mon2.com
[2019-05-28 01:08:52 -0700] warning/ApplyRule: Apply rule 'tools_notification_hosts' (in /var/lib/icinga2/api/zones/master/director/notification_apply.conf: 19:1-19:53) for type 'Notification' does not match anywhere!
[2019-05-28 01:08:52 -0700] warning/ApplyRule: Apply rule 'Web_service_check_ports' (in /var/lib/icinga2/api/zones/director-global/director/service_apply.conf: 15:1-15:75) for type 'Service' does not match anywhere!
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1139 Services.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 106 Hosts.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 EventCommand.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 FileLogger.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 2 NotificationCommands.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1241 Notifications.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 9 HostGroups.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 ApiListener.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 11 Comments.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 47 Zones.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 46 Endpoints.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 4 ApiUsers.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 3 Users.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 221 CheckCommands.
[2019-05-28 01:08:52 -0700] information/ConfigItem: Instantiated 1 ServiceGroup.
[2019-05-28 01:08:52 -0700] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-05-28 01:08:52 -0700] information/cli: Finished validating the configuration file(s).

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
    SInce we're using agents, we have a lot of Endpoints defined.
    HA is working fine for 2 months.

Additional context

[ipunko@prod-mon2 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514843
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

We will increase open files to some value, but truly I beleive that this is critical bug.

Please let me know if I can provide more information.

@dnsmichi
Copy link
Contributor

Hi,

are you able to reproduce this behaviour with the snapshot packages?

Looking at netstat, all your clients are configured to connect to the master.

Network connection was broke while checks were in place

How exactly are these checks executed, are these command endpoint checks? Could this be reproduced with a simple firewall in between, or does it need an unresponsive client cutting off the connection?

Cheers,
Michael

@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) needs feedback We'll only proceed once we hear from you again labels May 28, 2019
@dnsmichi dnsmichi added this to the 2.11.0 milestone May 28, 2019
@dnsmichi dnsmichi added the blocker Blocks a release or needs immediate attention label May 28, 2019
@Punkoivan
Copy link
Author

Hello @dnsmichi
unfortunately we cannot test it from snapsohot due to our procedure - we have to use our own repository.

About how these checks executed:
host checks executed from master host, services checks executed both from master (SNMP) and from master.

Looking at netstat, all your clients are configured to connect to the master.

True, our agents are connected to master (contains "host" in endpoint).

I think we can test it with firewall, but we have to catch the right moment.

Thanks, Ivan

@Al2Klimov
Copy link
Member

If your custom repository uses ours as upstream, you may just switch it from release to snapshot. If you build your own packages, you may just build the master instead of a tag. Otherwise please share the construction and maybe I'll be able to explain how to get snapshot packages there.

@dnsmichi
Copy link
Contributor

@Punkoivan Here's a howto I wrote a while ago for manually fetching packages for a local DMZ. Maybe this helps, it would be great to know whether only 2.10 is affected or we are dealing with something breaking for 2.11.

https://community.icinga.com/t/installation-in-a-dmz-without-internet-access/200

Thanks in advance!

@Punkoivan
Copy link
Author

Hello @Al2Klimov we used almost 2.10, just fixed some dependencies (package names actually) so I believe it's almost vanilla version from you repository.

@dnsmichi , unfortunately, our processes too slow to update package and we do not have "playground" lile stage - still waiting for servers.

What we can do - we can simulate this with enabling port filtering on stage servers (where agent is installed) with our current version.

Please let me know if this can help.

@dnsmichi
Copy link
Contributor

dnsmichi commented Jun 3, 2019

I'm not sure if that helps. Since you're on 2.10.2 too, and not 2.10.5 it is hard to tell where this originates from. For 2.10.x I've fixed a similar problem with TLS shutdowns, but I am not sure if that's related: #6718

In terms of 2.11, I'll merge the linked PR and would be glad if you could still test them somewhere, e.g. with a newly built stage. I would assume that this doesn't need that many agents, but a special configuration and especially, network setting.

Are you on a low latency connection to these clients?

@dnsmichi
Copy link
Contributor

dnsmichi commented Jun 5, 2019

I did some tests with 2.10.2 and 2.11 and wasn't able to reproduce the behaviour. I'm leaving this issue open for after 2.11 and your ongoing tests, but remove this as task for the new release.

@dnsmichi dnsmichi removed the blocker Blocks a release or needs immediate attention label Jun 5, 2019
@dnsmichi dnsmichi removed this from the 2.11.0 milestone Jun 5, 2019
@Punkoivan
Copy link
Author

Makes sense, thank you.
We're planing update to 2.11 as soon as released.

Thank you for great product guys!
We will help improve it, at least with testing and issues :)

@dnsmichi
Copy link
Contributor

There's an RC planned prior to the release, scheduled for CW30 approx. Would be nice if you can test this one then too.

@Punkoivan
Copy link
Author

We're on our way to set up a stage envmso will be able to change version in more efficient way.

@dnsmichi
Copy link
Contributor

I can see some TIME_WAIT things, but no CLOSE_WAIT in my tests.

@dnsmichi
Copy link
Contributor

@Punkoivan 2.11 RC1 packages are available to test, please do so: https://icinga.com/2019/07/25/icinga-2-11-release-candidate/

@drapiti
Copy link

drapiti commented Dec 18, 2019

Hi all, I just wanted to add to this thread because we have a similar issue using version 2.11.2. Since we are beginning to deploy icinga in a very large organization, we have stumbled upon this issue because we are still in the process of opening some ports. So for example port 5665 may be open in one direction only. In this condition we see through the web console lots of unknown or pending states however more importantly the open connections on the icinga satellites are sky rocketing causing the service to crash when it hits a hard coded limit (we have set it to 200 000 for the moment). While we are fixing the firewall rules it would be appropriate in any case that open connections do not continue to increase exponentially causing the icinga service on the satellites to crash when ports are closed or there is any type of network related issue.

@Punkoivan
Copy link
Author

Hey there.
Unfortunately, we still have not tested 2.11.2 but tomorrow we are going to build for FreeBSD.
But now I am a little frustrated since we were hoping that this bug is closed..
huh...

@Al2Klimov Al2Klimov self-assigned this Feb 27, 2020
@Al2Klimov
Copy link
Member

How to reproduce

  • Two masters
  1. Ensure they're connected via netstat
  2. On one master: iptables -I INPUT -p tcp -m tcp --dport 5665 -j DROP ; iptables -I OUTPUT -p tcp -m tcp --dport 5665 -j DROP
  3. On the other one wait for the reconnect and check netstat:
tcp        0      1 116.203.71.17:56330     116.203.67.127:5665     SYN_SENT    106        37752      7295/icinga2
tcp        0    661 116.203.71.17:5665      116.203.67.127:55306    ESTABLISHED 106        36448      7295/icinga2

@Al2Klimov
Copy link
Member

@Punkoivan If the snapshot packages are still affected, please also test the packages from here/here – pick your OS from the "binary"/"Build" column and then "Download" the "Job artifacts".

@Al2Klimov Al2Klimov removed their assignment Feb 27, 2020
Al2Klimov added a commit to Al2Klimov/icinga2 that referenced this issue Jun 17, 2020
@heyjodom
Copy link

@lippserd - absolutely. attached here. Filtered for icinga and pulled out host identifiers

icinga2-network-connects-sanitized.txt

@lippserd
Copy link
Member

Is the agent in question already signed?

@lippserd
Copy link
Member

Could you share the agent log please?

@cconstantakis
Copy link

Here are the sanitized agent logs you requested from @heyjodom
icinga2-san.log

The certificate for the agent was not autosigned on the master. I manually signed it with the 'icinga2 ca sign' command and restarted the agent which seemed to resolve the issue.

Master Version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.3-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Agent Version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.3-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.8.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

If you wan't any other info or diagnostics i'm happy to provide it.

This is a HA environment with 2 masters and each zone has 2 satellites.

@Al2Klimov
Copy link
Member

@N-o-X Shall we close this as a dup of #7680?

@heyjodom
Copy link

@Al2Klimov @lippserd Agreed -- this looks like #7680

Any chance anything is planned to back port that fix from 2.12 to 2.11.x? We tried upgrading to to 2.11.4 on a problematic system (were on 2.11.3) saw the same behavior.

Also looking on our side at running the 2.12rc client.

@Punkoivan
Copy link
Author

We're still on 2.10 =( I will not make promises since it hard to say when we're going to update to 2.11.X

@Duffkess
Copy link

We have the same problem as well from time to time. We also have a 2 nodes master setup and we sign the certificates of the agents with the icinga2 ca sign command. We currently run on 2.10.7-1 (because I have another problem with the 2.11.x currently and wanted to wait for 2.12).
I see a lot of reconnections in the log but it doesnt matter if the client certificate is already signed or not (other then mentioned in #7680 ).
For example I have one host which currently has 3000 TCP Sessions to one master server (inbound). I dont know why this one has so much, its not larger or more often checked then other hosts actually.
Maybe the network connection to this side is just not that good.
I will use a script to track current tcp sessions on the master to have better troubleshooting.

@Al2Klimov
Copy link
Member

I have another problem with the 2.11.x currently

Which problem?

@Duffkess
Copy link

Duffkess commented Jul 25, 2020

I have another problem with the 2.11.x currently

Which problem?

Well, I tried to upgrade to 2.11.3 but it was not working. The clients where not able to connec to the master servers. I needed to get things up and running again quickly so I hadn't (sadly) a lot of time to troubleshoot. It looked like the RPC Issue to me, but thats just a guess.
I will try to upgrade to 2.11.4 on a maintenance over weekend to have a better troubleshooting if this issue persists.
Note: My setup might be a little bit special:
I have 2 HA Master and a lot of agents connecting without any satellites. 75% of my connection directions are master > client, the rest ist client > master.
I will update you in a view weeks how things where going with 2.11.4.

@sebek72
Copy link

sebek72 commented Sep 23, 2020

same problem here: too many open files
version: 2.11.2-1

@icinga-probot
Copy link
Contributor

icinga-probot bot commented Oct 13, 2020

Closed this issue because #8292 has been merged.

@icinga-probot icinga-probot bot closed this as completed Oct 13, 2020
@icinga-probot
Copy link
Contributor

icinga-probot bot commented Oct 13, 2020

Someone closed this issue just now, but due to a missing feature in the GitHub API and the high amount of comments here I can't figure out whether this issue was closed due to a PR merge. Please check by yourself whether this issue is on the correct milestone.

@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Sep 7, 2021
@janfickler
Copy link

Hi,

we have in our infrastructure the same issue. We running two Icinga2-servers (CentOS 7) with version icinga2-2.13.2-1 installed.

We increased the open file limit for the system to 65535
and under the /etc/sysconfig/icinga2 the RLIMIT also to 65535.

But increasing the file limits doesn´t solved the problem, because the openfiles only increase then to the next open file limit. After restart of icinga2-service the open files are cleaned, but only till the limit is reached again.

what we found out is that the established connections between the icinga-server and the clients are not closed.

for example:

sudo netstat -antp | grep icinga2

tcp        0      0 xx.xxx.xxx.xx:5665      xx.xxx.xxx.xxx:35146    ESTABLISHED 109943/icinga2
tcp        0      0 xx.xxx.xxx.xx:5665      xx.xxx.xxx.xxx:38910    ESTABLISHED 109943/icinga2
...

is there a possibilty to set a timeout for the connections or has to be fixed in the softwarepackage ?

@lippserd
Copy link
Member

Hi @janfickler,

Could please open a new issue with your findings. Make sure to share some details about your zone setup, e.g. versions, connection directions, number of satellites and/or agents. Logs of an example node which has more than one connection would be great as well.

All the best,
Eric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.