-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API using "Connection: close" header results in infinite threads #6514
Comments
I can confirm the behavior, we are also seeing this. We have a Server running 2 Icinga2 instances in docker containers, but Icinga2 hat around 16.000 threads open at the moment, sometimes hitting the nproc ulimit. I can also confirm that sometimes we see threads with 100% cpu time. We are also running r2.9.1-1, but with Centos 7 as the host and image version. I did not test if the header is the problem, but since we are doing readiness and liveness checks with an authorized user to the Icinga2 api port in kubernetes, and kubernetes seems to send connection: close in this requests, this should be the problem. @dnsmichi, could you have a look at this please? |
One problem I do see is that we don't handle the |
Is that an security issue, because it can abused for DoS-attacks? |
You could also just leave the connection open on the client, and wait for the server to close the connection, this puts more (thread) load on the server too. Either way, you shouldn't expose the REST API to the public domain and filter away HTTP requests (e.g. with a proxy). For the problem itself, I am looking for a fix (vacation came in between). |
The problem is not the header itself, that's respected and triggered. The problem lies within
|
I think I've found something. Note: I am testing with pre 2.10 and the dynamic thread pool, not 2.9.x. AnalysisHttpResponse->Finish() calls m_Stream->Shutdown() early, if either
is sent. This breaks further stream handling, assuming that m_Stream->Shutdown() may be called later again in HttpServerConnection handlers. This essentially doesn't close the socket, but changes the polling events and signals the shutdown to a different thread freeing the lock. Highly likely this locks then, even if the stack trace says otherwise. The stack trace itself comes from the WQ where the current task is cleared and there's an attempt to free all remaining resources, i.e. m_RequestQueue which holds a reference to HttpServerConnection. The destructor just hangs in there with a circular reference.
|
Previously this was inside the debug log, with the new socket printers we can enhance checking for proper connects and disconnects. refs #6514
Solution?When a request is finished, we need to explicitly call TestsThread-watch on macOS:
Requests with
Requests with
Requests with
ResultsTypically my Macbook spawns 31 threads for Icinga 2. HTTP/1.1When starting the curl requests, this goes up from 50 to 75, but then drops to 71. HTTP/1.0Requests cause 70-80 threads, you can watch them go down and up a little. Good times drop to just 60 threads. Connection: CloseGoes up to 63 threads, down to 57, up to 80, drops to 49. ClientsLocal dashing-icinga2 instance which connects to the local icinga2 daemon. Since the scheduler doesn't do anything in a 10 second interval, threads stay at 31. Firing another bunch of close requests doesn't harm the behaviour, peaks at 60 but still processes the dashing GET request. |
…necting the client Test results: #6514 (comment) fixes #6514
Note: This depends on #6633 and as such we cannot backport this, the PR targets 2.10 only. |
@neubi4 please test this once merged/inside snapshot packages. This "costs" you an entry on https://www.icinga.com/about/customers/#shareyourstory :-) |
Thank you for fixing this @dnsmichi. I'll test this as well. In the meantime we had to downgrade, since we are using the API on the client side as well and multiple customer servers went down. |
Triggered snapshot builds for both, Debian 9 and CentOS 7. |
@dnsmichi deal, and thanks for all your hard work :) I tried to test the snapshot packages using the repo (https://packages.icinga.com/epel/7/snapshot/x86_64/), but could not get it to work in our dev environment. Our two masters fail to connect together with this messages:
It seems it fails to handle the client connections, i see the same messages when director tries to talk to the api. Same config works without problems with 2.9.1. icinga2 --version
icinga2 feature list
/etc/icinga2/zones.conf
Log
|
Hmmm, I've found and fixed something similar some minutes ago in git master, wait for the new builds with e6eb703 |
I've started my local HA master setup, TLS handshake works fine. Probably this happens either from the mentioned bug, or the connection is busy. |
I've tried the new builds, it still happens. A short test with curl shows that the connection to the is accepted, but hangs on transferring any data:
This is all i get, waiting a few minutes does not change anything. I dont have any other configuration like agents, satellites or other nodes in the dev setup, so the connection should not be busy with something like that. Interesting thing in my logs is that after start, both of the masters are logging the TLS timeout message above, every one the two messages with the connection attempt from and to the other master. After that, there is no attempt on any of the masters to connect to the other node over the next minutes (i waited 15m now). I'm on vacation (or Überstundenabbau) the rest of the week, so i think the ealiest next test (running with debug log for example) i can pomise to do are on monday, but i will try to do it earlier. |
Ah, this is the setup with daemonize/-d, isn't it? Can you attach to the running processes and generate a thread list please? I'm doing my best to not harm 2.10 this week, still I think we've found something here :) |
Note to self for tomorrow: Check #6630 and the apilistener thread pool whether the threads are gone after daemonize(). |
Ok, I'm sitting with my Macbook on the balcony, I can see it. With daemonize(), |
I just tested the snapshots repo, cluster works fine, thanks :). I will remeber to note that we are running with "-d" when opening the next issue. Also testet the thread issue, with and without connection: close and with http 1.0, everything is fine, threads count is not going higher as it should be. Only issue i see now is that the deployment after director kickstart is not working:
but i guess its only director v1.5.1 being not compatible with the new namespace feature, so i think we can close the issue here. |
Aaah, saves my day, thank you :-) The Director needs this small patch for namespaces: Icinga/icingaweb2-module-director#1654 - I'll mark this in the release blog post plus extra changelog entry. |
In terms of the namespace error with globals, here's a PR for 2.10 with #6669 since this also affects a global zone deployment. |
@dnsmichi @neubi4 Sorry to bring that nasty piece up again, but it seems issue is only half fixed. Using the sequential curl requests with "Connection: close" header seems to work, but after firing up our API client which uses threaded job handling results in some errors again. If I don't send the "Connection: close" header, everything is fine (at least for 2.10.2, on some 2.8.x it was the exact opposite, that's why we changed the behaviour). Here is the debug log:
Sometimes there are two entries for a disconnect with the same source port:
Log from client side:
I see some strange behaviour there:
Unfortunately the client is not opensource, but the only interaction with Icinga it does is pushing check results and get services from a host by API in worker threads (written in GO using http.Client). Environment:
PS: Cannot reopen it, since you closed it ;) |
in our configuration with Version 2.10.2 have the same Problems like @cubic3d. So if you want have a new issue pls tell us. So that we know where we should write what we find out. |
@dnsmichi ping |
I just debugged this issue with our icinga2 setup and my finding is, that with "Connection: close" it makes a difference if the client sends the headers and body in one go or not. |
This was an issue targeting 2.10.0 so it cannot be re-opened. Please move your findings in a new issue. Please do note that I'm working on other projects now, so best would be an escalation via support contract from an Icinga partner. |
@stevie-sy @swegener opened #6799. Would you guys transfer your findings over there? |
Expected Behavior
Kill / limit threads after API requests has been handled.
Current Behavior
Multiple fast API requests create threads which will never be terminated, thousands of them consume server resources and cause stalls from time to time consuming 100% load of a CPU core. Happens here easily over night causing the server to stall and unable to handle any additional TCP connections.
Possible Solution
Steps to Reproduce (for bugs)
Simple to reproduce using Docker:
docker run --rm -p 80:80 -p 5665:5665 -h icinga2 --name icinga -t jordan/icinga2:latest
latest refers currently to r2.9.1-1 - unfortunately 2.9.1 is not explicitly tagged by jordan on Dockerhub.
Get the generated root account for the API from the container and create some traffic:
curl --insecure --user root:8b46e7f992c24208 -H 'Connection: close' https://localhost:5665/v1#[1-100000]
Run 3-4 of the curl in parallel to make things faster.
Context
The way we are using Icinga2 is kinda odd, to explain and make it short as possible: We have several master instances running on a host in pure LXC containers. Every instance responsible for a subset of "client icinga instances". We got about 1500 client servers connected via VPN to the host pushing their own container monitoring data (hosts and services - problems only) to a RabbitMQ instance. A custom "collector" consumes the RabbitMQ queue, aggregates the services into predefined categories on the master instance and pushes the updates via the Icinga2 API to the master instance.
The structure for every of the masters looks like:
Client-Icinga<->Publisher<->VPN<->RabbitMQ<->Collector<->Master-Icinga
Using this constellation with 2.7 was no problem at all. Starting with 2.8 the missing "Connection: close" led to problems with the API not responding properly on first request.
Yes, I know we need to restructure our monitoring, but as always, too much dependencies, unable overnight ;)
Your Environment
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.The text was updated successfully, but these errors were encountered: