Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger-Collector: too many open files #3156

Open
pbdger opened this issue Jul 21, 2021 · 20 comments
Open

Jaeger-Collector: too many open files #3156

pbdger opened this issue Jul 21, 2021 · 20 comments
Labels
bug help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@pbdger
Copy link

pbdger commented Jul 21, 2021

Jaeger-Collector: too many open files

We have a set of collectors in use. In the console, in the log output of the collector (which is created in JSON format), we suddenly see a large number of messages that always have this structure (and are not JSON);

2021/07/21 13:24:45 http: Accept error: accept tcp [::]:14269: accept4: too many open files; retrying in 1s

Since these messages have no information about their severity, it is unclear whether an error event is logged here without it being just a debug information.

To Reproduce
Steps to reproduce the behavior:

  1. Start Jaeger-Collector
  2. Create traffic

Expected behavior
Thiese message should be visible

Version (please complete the following information):

  • OS: Amazon Linux 2
  • Jaeger version: 1.24 (same behaviour with 1.22)
  • Deployment: bare metal

What troubleshooting steps did you try?
We modified ulimit for the Collector process - no effect

@pbdger pbdger added the bug label Jul 21, 2021
@jpkrohling
Copy link
Contributor

jpkrohling commented Jul 21, 2021

Interesting, that's the admin port, the same we serve the health checks. Are you able to tell what's on the other side of the connection? Perhaps your client isn't properly closing the HTTP connections?

Your instructions to reproduce the problem could also be improved. Are you able to reproduce it by running Jaeger with, say, podman run jaegertracing/all-in-one:1.24.0 on the server side and tracegen as the client?

@yurishkuro
Copy link
Member

It appears you have too many clients that keep opening new connections to the backend instead of reusing an existing connection. You should also check what file descriptor limit you have.

The log seems appropriate, it says that accept of an incoming connection was unsuccessful. What more information do you expect to see?

@pbdger
Copy link
Author

pbdger commented Jul 21, 2021

The log seems appropriate, it says that accept of an incoming connection was unsuccessful. What more information do you expect to see?

Unfortunately, the log entry is not created in json format like all other log events of the collector. Also the specification of an event type like error or warning etc. is missing.

On our side we process the log events with Fluentbit. Therefore we always expect a Json string from the Jaegercollector for logging. But this message unfortunately breaks the Json pattern. Therefore we did not notice this error for months, because it was not identified as an error on our systems.

As this has nothing to do with the problem "Too many open files". Maybe I should post a separate ticket because of the incorrect log output ?

@yurishkuro
Copy link
Member

Looks like the log is coming from https://github.com/golang/go/blob/d77f4c0c5c966c37960cd691656fba184ae770ff/src/net/http/server.go#L3017. We're not defining ErrorLog *log.Logger on the Server, so it uses std logger, which is not JSON-formatted.

Also, since this is a health check port, we should double-check that we're properly closing the req/resp objects.

@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Jul 21, 2021
@jpkrohling
Copy link
Contributor

I just tried hitting the admin port with ab, and couldn't reproduce the problem with 10 concurrent clients and 1 million requests:

$ ab -c 100 -n 1000000 http://localhost:14269/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            14269

Document Path:          /
Document Length:        102 bytes

Concurrency Level:      100
Time taken for tests:   57.174 seconds
Complete requests:      1000000
Failed requests:        718434
   (Connect: 0, Receive: 0, Length: 718434, Exceptions: 0)
Total transferred:      212137980 bytes
HTML transferred:       103138252 bytes
Requests per second:    17490.39 [#/sec] (mean)
Time per request:       5.717 [ms] (mean)
Time per request:       0.057 [ms] (mean, across all concurrent requests)
Transfer rate:          3623.41 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    3   0.8      3       7
Processing:     0    3   0.8      3       9
Waiting:        0    2   0.8      2       7
Total:          2    6   0.5      6      14

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      6
  80%      6
  90%      6
  95%      7
  98%      7
  99%      7
 100%     14 (longest request)

Once ab finishes, the state seems sane to me:

$ lsof -p 1297434
COMMAND       PID        USER   FD      TYPE  DEVICE SIZE/OFF    NODE NAME
collector 1297434 jpkroehling  cwd       DIR    0,43      820   58547 /home/jpkroehling/Projects/src/github.com/jaegertracing/jaeger
collector 1297434 jpkroehling  rtd       DIR    0,35      176     256 /
collector 1297434 jpkroehling  txt       REG    0,39 31259320  114714 /tmp/go-build4254053393/b001/exe/collector
collector 1297434 jpkroehling  mem       REG    0,34          4327015 /var/lib/sss/mc/passwd (path dev=0,35)
collector 1297434 jpkroehling  mem       REG    0,34          4173794 /usr/lib64/libnss_sss.so.2 (path dev=0,35)
collector 1297434 jpkroehling  mem       REG    0,34          4308749 /usr/lib64/libc-2.33.so (path dev=0,35)
collector 1297434 jpkroehling  mem       REG    0,34          4308763 /usr/lib64/libpthread-2.33.so (path dev=0,35)
collector 1297434 jpkroehling  mem       REG    0,34          4308742 /usr/lib64/ld-2.33.so (path dev=0,35)
collector 1297434 jpkroehling    0u      CHR   136,0      0t0       3 /dev/pts/0
collector 1297434 jpkroehling    1u      CHR   136,0      0t0       3 /dev/pts/0
collector 1297434 jpkroehling    2u      CHR   136,0      0t0       3 /dev/pts/0
collector 1297434 jpkroehling    3r      REG    0,35  9253600 4327015 /var/lib/sss/mc/passwd
collector 1297434 jpkroehling    4u  a_inode    0,14        0   11556 [eventpoll:5,7,8,9]
collector 1297434 jpkroehling    5r     FIFO    0,13      0t0 5921185 pipe
collector 1297434 jpkroehling    6w     FIFO    0,13      0t0 5921185 pipe
collector 1297434 jpkroehling    7u     IPv6 5921192      0t0     TCP *:14269 (LISTEN)
collector 1297434 jpkroehling    8u     IPv6 5920365      0t0     TCP *:cps (LISTEN)
collector 1297434 jpkroehling    9u     IPv6 5920366      0t0     TCP *:14268 (LISTEN)

I could really use more information on how to reproduce this.

@pbdger
Copy link
Author

pbdger commented Jul 22, 2021

I just tried hitting the admin port with ab, and couldn't reproduce the problem with 10 concurrent clients and 1 million requests:
I could really use more information on how to reproduce this.

I did the same and I can confirm; the root cause seems to be in one of our clients. As mentioned a connection were not closed correctly.

Then why does the health check port appear so often in the list of Connections ?

The display is correct, but the interpretation is probably a bit more complicated; all HTTP requests end up in a loop and can only be processed when a connection becomes free. Since the health check is also executed every n-seconds, it also ends up in the loop and thus in the log output. It does not block a connection but waits for a connection.

I'm still testing a new version of the client today. I expect that the misbehavior will disappear. Then I will close the issue.

Thank you very much for your great support. And even if it was due to the client, we got a better logging in place as a spin-off :-)

@pbdger pbdger closed this as completed Jul 22, 2021
@pbdger
Copy link
Author

pbdger commented Jul 22, 2021

What were wrong;

We work on collectorClient and forgot to close the connection...

collectorClient := api_v2.NewCollectorServiceClient(connection)
response, err = collectorClient.PostSpans(ctx, r)

Just add:

connection.Close()

Everything looks fine, but you get some new logs entries in jaeger-collector;

{"level":"info","ts":1626955277.52554,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}

transport.controlbuf.run() cause this info. Comment in code marks this a fine.

@mulpk
Copy link

mulpk commented Jul 29, 2024

@yurishkuro we are hitting the same issue described above
2021/07/21 13:24:45 http: Accept error: accept tcp [::]:14269: accept4: too many open files; retrying in 1s

it's seems like jaeger-collector scaling issue on our end as we see high traffic when the issue occurs.

To fix: we are considering increasing file descriptor limit - will that alone fix the issue? are there any other parameters that we can tweak to help fix the issue?

env: jaeger version v1.37

@yurishkuro
Copy link
Member

@mulpk thanks for asking on the existing issue.

  • Are you monitoring the number of open file descriptors in the collector process? What are the levels and do they match your understanding of the number of clients? As mentioned earlier, clients are supposed to use persistent connections, but you're saying this issue is related to "high traffic" - is high traffic a result of more clients (e.g. maybe you're spawning off many lambda functions), or just more QPS? The latter should not affect FDs.
  • IIRC we used to run with about 120k FDs per Jaeger collector at Uber
  • FDs alone may not help to scale, you have to look at other health metrics like memory usage, CPU utilization.

@mulpk
Copy link

mulpk commented Jul 30, 2024

@yurishkuro thanks for the response

setup:
we have a load balancer which distributes load to multiple hosts and each one runs jaeger collector as system process. I was looking at LB stats and see high inflow of span data when the too many files error occurs (that's what I was referring to high traffic before)

  • how do I monitor open file descriptors in the collector process? I don't see it in the metrics I have - do I need to enable some flag on the collector process to monitor open file descriptors?

  • do you have recommendation on below flags as well? I see they can help scale (the current value of jaeger_collector_in_queue_latency_bucket is 10291739 - it should be lower value right? guess I need to increase the workers)
    collector.queue-size: 6000
    collector.num-workers: 200

we have FDs at 1024 before and bumped it to 16k which did not help. I will try high FD number like 65k (current CPU/memory stats on the hosts show that it's using less than 30% of system resources)

@yurishkuro
Copy link
Member

how do I monitor open file descriptors in the collector process?

Jaeger itself does not expose these, usually metrics like that are surfaced by your orchestration system for all workloads.

@yurishkuro
Copy link
Member

jaeger_collector_in_queue_latency_bucket

increasing queue size will not help with this number. Increasing number of workers might help, but only to a certain point - the bottleneck is usually the storage.

@mulpk
Copy link

mulpk commented Jul 30, 2024

right, makes sense.. I see high numbers for jaeger_collector_save_latency_bucket : 10467444

right now the main problem is - the jaeger-collector process is getting stuck when too many files error occurs. is there a way to discard the extra spans and make sure jaeger collector is operational?

@yurishkuro
Copy link
Member

We do not have that capability right now, but it's a good idea. I don't know if we have too much control around FDs, but it's feasible for collector to monitor this number, as well as the internal queue length, and start responding with "I AM BUSY" and closing incoming connections

@yurishkuro yurishkuro reopened this Jul 30, 2024
@mulpk
Copy link

mulpk commented Aug 26, 2024

@yurishkuro we increased the FDs to 65535 and it's using under 50 FDs for the entire week except the peak that you see in the graph. we anticipated it might be connection close issue and made sure connections are handled properly but still see the issue.

lsof on the jaeger host returns the 65535 TCP connections hanging around

jaeger-co 1098899 build 15u sock 0,8 0t0 1050943012 protocol: TCP
jaeger-co 1098899 build 16u sock 0,8 0t0 1050898799 protocol: TCP

I'm not sure if it's the load that we are not able to handle or any other issue.. any pointers would be helpful?

Screenshot 2024-08-26 at 1 28 16 PM

@yurishkuro
Copy link
Member

How many clients do you have connecting to collector? Are those incoming connections?

@mulpk
Copy link

mulpk commented Aug 26, 2024

i'm using ThreadPoolExecutor from the java code to send parallel span requests to jaeger-collector. i'm batching 1000 requests and each request to jaeger-collector consists of 100 spans.

here is the http client connection configuration I'm using

PoolingHttpClientConnectionManager poolingHttpClientConnectionManager = new PoolingHttpClientConnectionManager();
poolingHttpClientConnectionManager.setDefaultMaxPerRoute(60);
poolingHttpClientConnectionManager.setMaxTotal(240);
RequestConfig rc = RequestConfig.custom().setConnectTimeout((int)Duration.ofSeconds(5L).toMillis()).setSocketTimeout((int)Duration.ofSeconds(20L).toMillis()).build();
CloseableHttpClient client = HttpClients.custom().setConnectionManager(connectionManager).setDefaultRequestConfig(rc).build();

these requests sent to jaeger-collector load balancer that I created and the load is split between 3 instances running jaeger-collector.

each jaeger-collector instance uses below configuration
collector.queue-size: 50000
collector.num-workers: 4000

@yurishkuro
Copy link
Member

This doesn't answer my question, but it looks like you are forcing a synthetic load that creates lots of new client connections thus causing an issue. You either need to size Jaeger collector fleet to handle your peak concurrent traffic or use better connection management on the client, such as

  • multiple OTLP requests from a single client could be sent through a single gRPC connection
  • clients should maintain long lived connections, not open/close them per request
  • clients should not leave many idle connections not closed

@mulpk
Copy link

mulpk commented Aug 26, 2024

Thanks for the response. I'm more inclined toward sizing Jaeger collector fleet to handle your peak concurrent traffic - currently using /api/v2/spans endpoint for submitting spans in Zipkin JSON v2. As mentioned above, we are sending like 100 spans in each request - do you have any recommendation here?

when you say size jaeger-collector does that mean horizontal or vertical scaling?

@yurishkuro
Copy link
Member

I meant horizontal scaling, to distribute connections across instances. You will have other bottlenecks if you are creating huge traffic spikes - your storage may not be able to handle the load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

4 participants