-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaeger-Collector: too many open files #3156
Comments
Interesting, that's the admin port, the same we serve the health checks. Are you able to tell what's on the other side of the connection? Perhaps your client isn't properly closing the HTTP connections? Your instructions to reproduce the problem could also be improved. Are you able to reproduce it by running Jaeger with, say, |
It appears you have too many clients that keep opening new connections to the backend instead of reusing an existing connection. You should also check what file descriptor limit you have. The log seems appropriate, it says that accept of an incoming connection was unsuccessful. What more information do you expect to see? |
Unfortunately, the log entry is not created in json format like all other log events of the collector. Also the specification of an event type like error or warning etc. is missing. On our side we process the log events with Fluentbit. Therefore we always expect a Json string from the Jaegercollector for logging. But this message unfortunately breaks the Json pattern. Therefore we did not notice this error for months, because it was not identified as an error on our systems. As this has nothing to do with the problem "Too many open files". Maybe I should post a separate ticket because of the incorrect log output ? |
Looks like the log is coming from https://github.com/golang/go/blob/d77f4c0c5c966c37960cd691656fba184ae770ff/src/net/http/server.go#L3017. We're not defining Also, since this is a health check port, we should double-check that we're properly closing the req/resp objects. |
I just tried hitting the admin port with
Once
I could really use more information on how to reproduce this. |
I did the same and I can confirm; the root cause seems to be in one of our clients. As mentioned a connection were not closed correctly. Then why does the health check port appear so often in the list of Connections ? The display is correct, but the interpretation is probably a bit more complicated; all HTTP requests end up in a loop and can only be processed when a connection becomes free. Since the health check is also executed every n-seconds, it also ends up in the loop and thus in the log output. It does not block a connection but waits for a connection. I'm still testing a new version of the client today. I expect that the misbehavior will disappear. Then I will close the issue. Thank you very much for your great support. And even if it was due to the client, we got a better logging in place as a spin-off :-) |
What were wrong; We work on collectorClient and forgot to close the connection...
Just add:
Everything looks fine, but you get some new logs entries in jaeger-collector;
transport.controlbuf.run() cause this info. Comment in code marks this a fine. |
@yurishkuro we are hitting the same issue described above it's seems like jaeger-collector scaling issue on our end as we see high traffic when the issue occurs. To fix: we are considering increasing file descriptor limit - will that alone fix the issue? are there any other parameters that we can tweak to help fix the issue? env: jaeger version v1.37 |
@mulpk thanks for asking on the existing issue.
|
@yurishkuro thanks for the response setup:
we have FDs at 1024 before and bumped it to 16k which did not help. I will try high FD number like 65k (current CPU/memory stats on the hosts show that it's using less than 30% of system resources) |
Jaeger itself does not expose these, usually metrics like that are surfaced by your orchestration system for all workloads. |
increasing queue size will not help with this number. Increasing number of workers might help, but only to a certain point - the bottleneck is usually the storage. |
right, makes sense.. I see high numbers for jaeger_collector_save_latency_bucket : 10467444 right now the main problem is - the jaeger-collector process is getting stuck when too many files error occurs. is there a way to discard the extra spans and make sure jaeger collector is operational? |
We do not have that capability right now, but it's a good idea. I don't know if we have too much control around FDs, but it's feasible for collector to monitor this number, as well as the internal queue length, and start responding with "I AM BUSY" and closing incoming connections |
@yurishkuro we increased the FDs to 65535 and it's using under 50 FDs for the entire week except the peak that you see in the graph. we anticipated it might be connection close issue and made sure connections are handled properly but still see the issue. lsof on the jaeger host returns the 65535 TCP connections hanging around jaeger-co 1098899 build 15u sock 0,8 0t0 1050943012 protocol: TCP I'm not sure if it's the load that we are not able to handle or any other issue.. any pointers would be helpful? ![]() |
How many clients do you have connecting to collector? Are those incoming connections? |
i'm using ThreadPoolExecutor from the java code to send parallel span requests to jaeger-collector. i'm batching 1000 requests and each request to jaeger-collector consists of 100 spans. here is the http client connection configuration I'm using
these requests sent to jaeger-collector load balancer that I created and the load is split between 3 instances running jaeger-collector. each jaeger-collector instance uses below configuration |
This doesn't answer my question, but it looks like you are forcing a synthetic load that creates lots of new client connections thus causing an issue. You either need to size Jaeger collector fleet to handle your peak concurrent traffic or use better connection management on the client, such as
|
Thanks for the response. I'm more inclined toward sizing Jaeger collector fleet to handle your peak concurrent traffic - currently using /api/v2/spans endpoint for submitting spans in Zipkin JSON v2. As mentioned above, we are sending like 100 spans in each request - do you have any recommendation here? when you say size jaeger-collector does that mean horizontal or vertical scaling? |
I meant horizontal scaling, to distribute connections across instances. You will have other bottlenecks if you are creating huge traffic spikes - your storage may not be able to handle the load. |
Jaeger-Collector: too many open files
We have a set of collectors in use. In the console, in the log output of the collector (which is created in JSON format), we suddenly see a large number of messages that always have this structure (and are not JSON);
2021/07/21 13:24:45 http: Accept error: accept tcp [::]:14269: accept4: too many open files; retrying in 1s
Since these messages have no information about their severity, it is unclear whether an error event is logged here without it being just a debug information.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Thiese message should be visible
Version (please complete the following information):
What troubleshooting steps did you try?
We modified ulimit for the Collector process - no effect
The text was updated successfully, but these errors were encountered: