Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files #5460

Closed
ture-karlsson opened this issue May 16, 2022 · 10 comments
Closed

Too many open files #5460

ture-karlsson opened this issue May 16, 2022 · 10 comments

Comments

@ture-karlsson
Copy link

ture-karlsson commented May 16, 2022

Bug Report

My td-agent-bit is continuously using more and more file descriptors and eventually stops working. I have increased the file limits on the system:

# sysctl fs.file-max
fs.file-max = 2000000
# ulimit -Hn
1000000
# ulimit -Sn
1000000

After a couple of hours of running the agent, it starts to produce these log messages over and over:

May 16 13:16:32 <hostname> td-agent-bit[74459]: accept4: Too many open files
May 16 13:16:32 <hostname> td-agent-bit[74459]: [log] error opening log file /var/log/fluentbit.log. Using stderr.
May 16 13:16:32 <hostname> td-agent-bit[74459]: [2022/05/16 13:16:32] [error] [input:tcp:tcp.4] could not accept new connection

Even though the limits above aren't reached:

# lsof | grep $(pidof td-agent-bit) | wc -l
13847

However, the number when it stops is always somewhere around that (13000-14000). Is it normal that it is using this many files? I get the feeling that it is just growing and growing and old ones are not closed.

To Reproduce

  • Steps to reproduce the problem:
    This is my config:
# cat /etc/td-agent-bit/td-agent-bit.conf
[SERVICE]
    Flush        5
    Daemon       Off
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    Plugins_File plugins.conf
    Log_Level    info
    Log_File     /var/log/fluentbit.log

[INPUT]
    Name        tcp
    Tag         tag1
    Listen      0.0.0.0
    Port        5170
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[INPUT]
    Name        tcp
    Tag         tag2
    Listen      0.0.0.0
    Port        5171
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[INPUT]
    Name        tcp
    Tag         tag3
    Listen      0.0.0.0
    Port        5172
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[INPUT]
    Name        tcp
    Tag         tag4
    Listen      0.0.0.0
    Port        5173
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[INPUT]
    Name        tcp
    Tag         tag5
    Listen      0.0.0.0
    Port        5174
    Chunk_Size  32
    Buffer_Size 64
    Format      json

[FILTER]
    Name modify
    Match tag1 
    Set role tag1
    Set env tag1
    Remove save_at_master
    Remove pid
    Remove file
    Remove line
    Remove process_guid
    Remove logger_name
    Remove @timestamp

[FILTER]
    Name modify
    Match tag2
    Remove save_at_master
    Remove pid
    Remove file
    Remove process_guid
    Remove logger_name
    Remove @timestamp

[FILTER]
    Name modify
    Match tag3
    Remove pid
    Remove file
    Remove process_guid
    Remove logger_name
    Remove @timestamp

[FILTER]
    Name modify
    Match tag4
    Remove @timestamp

[FILTER]
    Name modify
    Match tag5
    Remove @timestamp

[OUTPUT]
    Name es
    Logstash_Format True
    Logstash_Prefix tag1
    Match tag1
    Host es.example.com
    Port 443
    Type _doc
    tls On
    tls.verify On
    net.keepalive Off

[OUTPUT]
    Name  es
    Logstash_Format True
    Logstash_Prefix tag2
    Match tag2
    Host es.example.com
    Port 443
    Type _doc
    tls On
    tls.verify On
    net.keepalive Off

[OUTPUT]
    Name  es
    Logstash_Format True
    Logstash_Prefix tag3
    Match tag3
    Host es.example.com
    Port 443
    Type _doc
    tls On
    tls.verify On
    net.keepalive Off

[OUTPUT]
    Name  es
    Logstash_Format True
    Logstash_Prefix tag4
    Match tag4
    Host es.example.com
    Port 443
    Type _doc
    tls On
    tls.verify On
    net.keepalive Off

[OUTPUT]
    Name  es
    Logstash_Format True
    Logstash_PRefix tag5
    Match tag5
    Host es.example.com
    Port 443
    Type _doc
    tls On
    tls.verify On
    net.keepalive Off

When the agent is stuck, I can get it running again by restarting the service. When it starts up again it is using much less file descriptors but the number is increasing:

# lsof | grep $(pidof td-agent-bit) | wc -l
13949
# systemctl restart td-agent-bit
# lsof | grep $(pidof td-agent-bit) | wc -l
2444
# lsof | grep $(pidof td-agent-bit) | wc -l
2652
# lsof | grep $(pidof td-agent-bit) | wc -l
2782

Expected behavior
td-agent-bit should not stop working.

Your Environment

  • Version used: td-agent-bit.x86_64 1.9.3-1
  • Configuration: see above
  • Environment name and version (e.g. Kubernetes? What version?): it is running as a normal systemd service
  • Server type and version: VM
  • Operating System and version: CentOS 7.9
  • Filters and plugins:

Additional context
Is there some other file limit that I'm not aware of? Is it normal that it is using this many open files? It feels like they are not closed properly.

@patrick-stephens
Copy link
Contributor

It looks like it is all socket comms so are there a load of open connections in an error or wait state?

What does the Fluent Bit log show, any connection issues?

@ture-karlsson
Copy link
Author

ture-karlsson commented May 16, 2022

this keeps showing up in the logs at the same time:

[2022/05/16 13:26:41] [error] [input:tcp:tcp.4] could not accept new connection
[2022/05/16 13:26:41] [error] [sched] cannot do timeout_create()
[2022/05/16 13:26:41] [ warn] [net] getaddrinfo(host='es.example.com', err=24): DNS query cancelled
[2022/05/16 13:26:41] [error] [sched]  'retry request' could not be created. the system might be running out of memory or file descriptors.
[2022/05/16 13:26:41] [ warn] [engine] retry for chunk '74459-1652700396.997021478.flb' could not be scheduled: input=tcp.4 > output=es.4
[2022/05/16 13:27:11] [error] [sched] cannot do timeout_create()
[2022/05/16 13:27:11] [ warn] [net] getaddrinfo(host='es.example.com', err=24): DNS query cancelled
[2022/05/16 13:27:11] [error] [sched]  'retry request' could not be created. the system might be running out of memory or file descriptors.
[2022/05/16 13:27:11] [ warn] [engine] retry for chunk '74459-1652700426.915530723.flb' could not be scheduled: input=tcp.4 > output=es.4
[2022/05/16 13:27:16] [error] [sched] cannot do timeout_create()
[2022/05/16 13:27:16] [ warn] [net] getaddrinfo(host='es.example.com', err=24): DNS query cancelled
[2022/05/16 13:27:16] [error] [sched]  'retry request' could not be created. the system might be running out of memory or file descriptors.
[2022/05/16 13:27:16] [ warn] [engine] retry for chunk '74459-1652700434.82784915.flb' could not be scheduled: input=tcp.2 > output=es.2
[2022/05/16 13:27:21] [error] [sched] cannot do timeout_create()

how can I see if they are in an error or wait state?

@patrick-stephens
Copy link
Contributor

Netstat or similar to see what connections you have, e.g. https://transang.me/check-for-listening-ports-in-linux/

Just wondering if failed or failing connections are not being cleaned up.

@ture-karlsson
Copy link
Author

ture-karlsson commented May 17, 2022

OK.. netstat shows td-agent-bit binding to the ports in our config:

# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:5170            0.0.0.0:*               LISTEN      88553/td-agent-bit  
tcp        0      0 0.0.0.0:5171            0.0.0.0:*               LISTEN      88553/td-agent-bit  
tcp        0      0 0.0.0.0:5172            0.0.0.0:*               LISTEN      88553/td-agent-bit  
tcp        0      0 0.0.0.0:5173            0.0.0.0:*               LISTEN      88553/td-agent-bit  
tcp        0      0 0.0.0.0:5174            0.0.0.0:*               LISTEN      88553/td-agent-bit 

I'm not very familiar with how fluentbit is handling these connections, but I have identified that it's the last input, port 5174, that is causing the problem. For some reason I get a bunch of open files per connection:

# lsof -i:5174 | wc -l
841
# lsof | grep 5174 | wc -l
10920

Is this common?

If I sort it out on unique client hostnames, I get the following numbers:

# lsof -i:5174 | awk '{print $9}' | awk -F '->' '{print $2}' | wc -l
832
# lsof -i:5174 | awk '{print $9}' | awk -F '->' '{print $2}' | sort -u | wc -l
825

# lsof | grep 5174 | awk '{print $9}' | awk -F '->' '{print $2}' | wc -l
10829
# lsof | grep 5174 | awk '{print $9}' | awk -F '->' '{print $2}' | sort -u | wc -l
829

I don't know if I'm barking up the wrong tree here.

@ture-karlsson
Copy link
Author

@patrick-stephens do you have any idea what to do here?

@patrick-stephens
Copy link
Contributor

Does netstat show any other connections in WAIT states, e.g. TIME_WAIT or CLOSE_WAIT?

@ture-karlsson
Copy link
Author

Nope they are all in LISTEN state.

@lecaros
Copy link
Contributor

lecaros commented May 19, 2022

Which user is running the agent?
Have you modified limits for that user?
Is it running as service?

Can you share output of the following commands?

grep "Max open files" /proc/<fluent-agent-id>/limits
ulimit -Sn
ulimit -Hn
systemctl show -p DefaultLimitNOFILE
systemctl show <agent-service-name> | grep LimitNOFILE

@merveyilmaz-netrd
Copy link

merveyilmaz-netrd commented May 23, 2022

You need to increase the ulimit size of fluent-bit service. Steps:

1.Copy the fluent-bit service file under /etc/systemd/system directory:

cp /usr/lib/systemd/system/fluent-bit.service /etc/systemd/system/

2.Insert the LimitNOFILE=20000(this number depends on your requires) option into the /etc/systemd/system/fluent-bit.service file on the service section like below:

[Unit]
Description=Fluent Bit
Documentation=https://docs.fluentbit.io/manual/
Requires=network.target
After=network.target

[Service]
Type=simple
ExecStart=/opt/fluent-bit/bin/fluent-bit -c //etc/fluent-bit/fluent-bit.conf
Restart=always
LimitNOFILE=20000
[Install]
WantedBy=multi-user.target

3.Restart the fluent-bit service:
systemctl restart fluent-bit

@ture-karlsson
Copy link
Author

ture-karlsson commented May 23, 2022

Yes, thanks guys, that was it. I forgot about the file limit in the service itself. It has been running for 3 days now without issues so it looks like it is working better now. I will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants