Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #9295] 100 Million Records in Log file: Read from event FD failed #3029

Closed
icinga-migration opened this issue May 19, 2015 · 14 comments
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/windows Windows agent and plugins bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/9295

Created by TechIsCool on 2015-05-19 23:49:12 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-07-26 22:51:06 +00:00 (in Redmine)

Icinga Version: 3.2.4
Backport?: Not yet backported
Include in Changelog: 1

So today I had a virtual NFS disk lag due to external issues not related to Icinga, On one of my Windows Servers I experienced an issue where I almost ran out of Disk space. icinga notified me correctly but it was Icinga that was consuming all the space with its log file.

I have a copy of the Data but I don't think anyone wants the 10+GB files so I have trimmed it down to just the errors that occurred with the counts of them. If you want to see a section let me know and I will rip it from the file.

This is from the icinga2.log not debug.log which I would have expected to see this size file.

2 critical/TcpSocket: getaddrinfo() failed with error code 11001, "No such host is known. "
2 information/Application: Received request to shut down.
2 information/Application: Shutting down...
2 information/ConfigItem: Activated all objects.
3 critical/ApiListener: Cannot connect to host 'abydos.domainname.com' on port '5665'
3 information/ApiClient: No messages for identity 'vault.domainname.com' have been received in the last 60 seconds.
8 critical/ThreadPool: Exception thrown in event handler:
8 rogram Files (x86)\ICINGA2\var/lib/icinga2/icinga2.state.tmp' failed with error code 13, 'Permission denied'
9 warning/ApiClient: Error while sending JSON-RPC message for identity 'vault.domainname.com'
11 critical/TcpSocket: Invalid socket: 10061, "No connection could be made because the target machine actively refused it."
11 warning/ApiListener: Removing API client for endpoint 'vault.domainname.com'. 0 API clients left.
12 critical/ApiListener: Cannot connect to host 'vault.domainname.com' on port '5665'
12 information/ApiListener: New client connection for identity 'vault.domainname.com'
12 warning/ApiClient: API client disconnected for identity 'vault.domainname.com'
14 information/ApiClient: Not sending heartbeat for endpoint 'vault.domainname.com' because we're replaying the log for it.
19 warning/ApiClient: API client disconnected for identity 'abydos.domainname.com'
19 warning/ApiClient: Error while sending JSON-RPC message for identity 'abydos.domainname.com'
19 warning/ApiListener: Removing API client for endpoint 'abydos.domainname.com'. 0 API clients left.
23 information/ApiClient: Reconnecting to API endpoint 'vault.domainname.com' via host 'vault.domainname.com' and port '5665'
24 information/ApiListener: New client connection for identity 'abydos.domainname.com'
25 information/ApiClient: Reconnecting to API endpoint 'abydos.domainname.com' via host 'abydos.domainname.com' and port '5665'
825 information/ApiClient: Not sending heartbeat for endpoint 'abydos.domainname.com' because we're replaying the log for it.
2,178 information/DynamicObject: Dumping program state to file 'C:\Program Files (x86)\ICINGA2\var/lib/icinga2/icinga2.state'
108,616,317 critical/SocketEvents: Read from event FD failed.

This log file Starts at 5-12-2015 around 2PM and Ends 5-19-2015 at 3PM

The Error that continually appears started at 5-19-2015 11:45 Ends 5-19-2015 at 3PM

So just in about 3 Hours I had 10GB consumed


Relations:

@icinga-migration
Copy link
Author

Updated by TechIsCool on 2015-05-19 23:50:16 +00:00

The server rebooted and this was cleared so its not a production issue for me I just want it as a bug since logs should not consume the whole drive.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-18 11:47:35 +00:00

  • Category changed from API to Cluster

@icinga-migration
Copy link
Author

Updated by TechIsCool on 2015-08-19 06:47:52 +00:00

Consumed another 20GB on a different host so this is still a problem.

242,603,597 critical/SocketEvents: Read from event FD failed.
117535 critical/ApiListener: Cannot accept new connection.
117531 critical/Socket: accept() failed with error code 10093, "10093, "Either the application has not called WSAStartup, or WSAStartup failed.""

@icinga-migration
Copy link
Author

Updated by TechIsCool on 2015-08-19 06:48:26 +00:00

Still both windows hosts and the one mentioned is running the latest version.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2015-11-14 18:19:16 +00:00

  • Relates set to 9569

@icinga-migration
Copy link
Author

Updated by rafael.voss on 2016-02-25 09:03:38 +00:00

this bug appears on my Windows server after the update form 2.4.1 to 2.4.3, when i start "icinga2.exe daemon" from commandline and kill it with ctrl+c. Never happend on 2.4.1

[2016-02-24 22:01:40 W. Europe Standard Time] information/Application: Received request to shut down.
[2016-02-24 22:01:41 W. Europe Standard Time] information/Application: Shutting down...
[2016-02-24 22:01:42 W. Europe Standard Time] information/CheckerComponent: Checker stopped.
[2016-02-24 22:01:42 W. Europe Standard Time] warning/TlsStream: TLS stream was disconnected.
[2016-02-24 22:01:44 W. Europe Standard Time] warning/JsonRpcConnection: API client disconnected for identity 'master'
[2016-02-24 22:01:44 W. Europe Standard Time] warning/ApiListener: Removing API client for endpoint 'master'. 0 API clients left.
[2016-02-24 22:01:44 W. Europe Standard Time] critical/SocketEvents: Read from event FD failed.
[2016-02-24 22:01:44 W. Europe Standard Time] critical/SocketEvents: Read from event FD failed.
[...]
[2016-02-24 22:01:44 W. Europe Standard Time] critical/SocketEvents: Read from event FD failed.

@icinga-migration
Copy link
Author

Updated by ZianAtFirstWatch on 2016-07-26 22:51:06 +00:00

I also experienced this problem on 2 Windows Server 2012 R2 hosts. They both have the Icinga 2 version 2.4.8 client installed. The main Icinga 2 program runs on a Debian server running Icinga 2.

My log file says something like this:

[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...
[2016-07-25 13:13:33 -0700] critical/SocketEvents: Read from event FD failed...

The log is named icinga02.log and I found it at C:\ProgramData\icinga2\var\log\icinga2\icinga2.log

I tried restarting the Icinga service on both Windows computers and the problem went away on one of them. After restarting the remaining computer, the problem no longer recurred.

@icinga-migration icinga-migration added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@dnsmichi dnsmichi added the area/windows Windows agent and plugins label Aug 17, 2017
@t-rex2
Copy link

t-rex2 commented Aug 24, 2018

This problem also occured to some of my Windows Server 2008 R2 with installed icinga2-agent version 2.8.0:

[2018-08-13 09:13:38 +0200] critical/SocketEvents: Read from event FD failed.
[2018-08-13 09:13:38 +0200] critical/SocketEvents: Read from event FD failed.
[2018-08-13 09:13:38 +0200] critical/SocketEvents: Read from event FD failed.
[2018-08-13 09:13:38 +0200] critical/SocketEvents: Read from event FD failed.

One log was about 7 GB and on another server it was about 13 GB. As a result of this, the C:-Partition run out of disk space.

@dnsmichi
Copy link
Contributor

I could silence the logging, but actually this is a real error when the FD is gone inside the socket IO thread. I haven't found it yet why this only happens on Windows with poll.

@Al2Klimov
Copy link
Member

Hello @t-rex2!

Do I understand you right that your Windows agent produces gigs of log with the most messages being the ones you posted?

Best,
AK

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Mar 8, 2019
@Al2Klimov Al2Klimov self-assigned this Mar 8, 2019
@t-rex2
Copy link

t-rex2 commented Mar 11, 2019

Hi Al2Klimov,

yes that's right.

Kind regards

@Al2Klimov
Copy link
Member

This issue seems to have been addressed by #7005.

@dnsmichi
Copy link
Contributor

Hi @t-rex2,

snapshot packages for Windows are available for testing on https://packages.icinga.com
Would be awesome if you can test them :)

Thanks,
Michael

@dnsmichi
Copy link
Contributor

I consider this being resolved. Please test the snapshot packages either way prior to the release not to run into any other pitfalls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/windows Windows agent and plugins bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants