Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts after upgrading to 1.8.5 #4042

Closed
rmarchei opened this issue Sep 2, 2021 · 11 comments · Fixed by #4122
Closed

Timeouts after upgrading to 1.8.5 #4042

rmarchei opened this issue Sep 2, 2021 · 11 comments · Fixed by #4122

Comments

@rmarchei
Copy link

rmarchei commented Sep 2, 2021

Bug Report

After upgrading to 1.8.5 fluentbit time outs while connecting to fluentd:

Sep 02 00:27:20 admin01 fluent-bit[121051]: [2021/09/02 00:27:20] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 02 00:27:56 admin01 fluent-bit[121051]: [2021/09/02 00:27:56] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds

Output is configured as:

[OUTPUT]
    Name                 forward
    Match                fluentd.*
    Host                 fluentd.service.consul
    Port                 24224
    Require_ack_response True

fluentd.service.consul can be resolved without issues and it's even in /etc/hosts

Same behavior with 1.8.6. Downgrading to 1.8.4 solves the issue

Everything is running on Ubuntu 20.04.3.
Fluentbit version: td-agent-bit 1.8.5
Fluentd version: td-agent 4.2.0-1

Full log from last fluentbit restart:

Sep 02 00:27:05 admin01 systemd[1]: Started Fluent Bit.
Sep 02 00:27:05 admin01 fluent-bit[121051]: Fluent Bit v1.8.6
Sep 02 00:27:05 admin01 fluent-bit[121051]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Sep 02 00:27:05 admin01 fluent-bit[121051]: * Copyright (C) 2015-2018 Treasure Data
Sep 02 00:27:05 admin01 fluent-bit[121051]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Sep 02 00:27:05 admin01 fluent-bit[121051]: * https://fluentbit.io
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [engine] started (pid=121051)
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [storage] version=1.1.1, initializing...
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [storage] root path '/var/lib/fluentbit/storage'
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [storage] backlog input plugin: storage_backlog.1
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [cmetrics] version=0.2.1
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 4.8M
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [sp] stream processor started
Sep 02 00:27:05 admin01 fluent-bit[121051]: [2021/09/02 00:27:05] [ info] [input:tail:tail.user_action] inotify_fs_add(): inode=774739 watch_fd=1 name=/var/log/app/mylog.log
Sep 02 00:27:20 admin01 fluent-bit[121051]: [2021/09/02 00:27:20] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 02 00:27:56 admin01 fluent-bit[121051]: [2021/09/02 00:27:56] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
@capricornusx
Copy link

capricornusx commented Sep 7, 2021

Same case (connect to elasticsearch)

Ubuntu 20.04.3 LTS
Fluent Bit 1.8.6
Elasticsearch 7.14.0

Downgrading to 1.8.4 solves the issue

Sep 07 08:46:49 monitoring td-agent-bit[1550873]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: * https://fluentbit.io
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [engine] started (pid=1550873)
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [storage] version=1.1.1, initializing...
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [storage] root path '/storage'
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [storage] backlog input plugin: storage_backlog.1
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [cmetrics] version=0.2.1
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [input:forward:main_input] listening on 0.0.0.0:24224
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 244.1M
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2021
Sep 07 08:46:49 monitoring td-agent-bit[1550873]: [2021/09/07 08:46:49] [ info] [sp] stream processor started
Sep 07 08:47:18 monitoring td-agent-bit[1550873]: [2021/09/07 08:47:18] [error] [upstream] connection #-1 to localhost:9200 timed out after 10 seconds
Sep 07 08:47:18 monitoring td-agent-bit[1550873]: [2021/09/07 08:47:18] [error] [upstream] connection #-1 to localhost:9200 timed out after 10 seconds
Sep 07 08:47:18 monitoring td-agent-bit[1550873]: [2021/09/07 08:47:18] [error] [upstream] connection #-1 to localhost:9200 timed out after 10 seconds
Sep 07 08:47:18 monitoring td-agent-bit[1550873]: [2021/09/07 08:47:18] [error] [upstream] connection #-1 to localhost:9200 timed out after 10 seconds
Sep 07 08:47:39 monitoring td-agent-bit[1550873]: [2021/09/07 08:47:39] [error] [upstream] connection #-1 to localhost:9200 timed out after 10 seconds
...

@leonardo-albertovich
Copy link
Collaborator

Hello folks, I'm working on this issue and I have a few questions for some of you which would probably make it easier to figure out if there is something that needs to be fixed and what if that's the case.

  • Are the timeouts sporadic or constant?
  • Do the timeouts show as soon as you start fluent-bit?
  • If they do not show as soon as you start fluent-bit, does fluent-bit ever go back to working fine once they start appearing or does it get stuck failing on every single attempt until you restart it?
  • Would you be able to provide your configuration files? (With any sensitive information redacted of course)
  • Can you easily reproduce the problem?

I noticed that @rmarchei mentioned DNS resolution but I didn't see any DNS related errors in there, a DNS error should cause fluent-bit to print something like this :

[ warn] [net] getaddrinfo(host='xxxxxxxxxxxxx.xxx', err=12): Timeout while contacting DNS servers

Thanks for reporting the issue, any further assistance in reproduction and troubleshooting would be greatly appreciated.

@rmarchei
Copy link
Author

I think this issue has been fixed in #4050

@rmarchei
Copy link
Author

rmarchei commented Sep 20, 2021

actually it seems to be still broken in 1.8.7

* Are the timeouts sporadic or constant?

Constant:

Sep 20 12:26:10 front02 fluent-bit[1237001]: [2021/09/20 12:26:10] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 20 12:26:14 front02 fluent-bit[1237001]: [2021/09/20 12:26:14] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 20 12:26:19 front02 fluent-bit[1237001]: [2021/09/20 12:26:19] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 20 12:26:25 front02 fluent-bit[1237001]: [2021/09/20 12:26:25] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 20 12:26:29 front02 fluent-bit[1237001]: [2021/09/20 12:26:29] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
Sep 20 12:26:34 front02 fluent-bit[1237001]: [2021/09/20 12:26:34] [error] [upstream] connection #-1 to fluentd.service.consul:24224 timed out after 10 seconds
* Do the timeouts show as soon as you start fluent-bit?

yes. To be precise, they appear as soon as parsed logs have new entries

* Would you be able to provide your configuration files? (With any sensitive information redacted of course)

Here you have the conf:

[SERVICE]
    flush                     5
    daemon                    off
    log_level                 info
    storage.path              /var/lib/fluentbit/storage
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 5M

    Parsers_File parsers.conf
    Plugins_File plugins.conf

    HTTP_Server  Off
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020
[INPUT]
    Name             tail
    Alias            tail.user_action
    Tag              fluentd.rdb.archive.user_action
    Key              context
    Rotate_Wait      10
    Path             /var/log/test/fluentd/user_action.log
    Refresh_Interval 10
    Multiline        Off
    Parser           json
    DB               /var/lib/fluentbit/user_action.db
    DB.Sync          Extra
[FILTER]
    Name             record_modifier
    Match            fluentd.rdb.archive.user_action
    Whitelist_key    context
[FILTER]
    Name             nest
    Match            fluentd.rdb.archive.user_action
    Operation        lift
    Nested_under     context
[OUTPUT]
    Name                 forward
    Match                fluentd.*
    Host                 fluentd.service.consul
    Port                 24224
    Require_ack_response True
    Retry_Limit          False
* Can you easily reproduce the problem?

yes, I just have to install a version >1.8.4, restart fluentbit and the issue appears

@leonardo-albertovich
Copy link
Collaborator

Could you please message me in the fluent slack server to talk a little bit more about this?

@rmarchei
Copy link
Author

Could you please message me in the fluent slack server to talk a little bit more about this?

sure, no problem. Which username (or channel) should I message?

@leonardo-albertovich
Copy link
Collaborator

Leonardo Almiñana

@leonardo-albertovich
Copy link
Collaborator

After carefully verifying the issue we (@rmarchei and I) found that when a hostname used as endpoint by an output plugin is present in the /etc/hosts file the DNS client changes the socket type and protocol returned from TCP to UDP which in the end is what caused the connection timeouts for him.

We are still investigating this issue to find the root cause. In the meantime, any users affected by this issue can comment or remove the /etc/hosts entry for the hostname in order to overcome it.

@rolandjitsu
Copy link

rolandjitsu commented Oct 25, 2021

I'm seeing this in > v1.8.3 too with the forward output and even after #4122 was released in v1.8.8.

@leonardo-albertovich
Copy link
Collaborator

@rolandjitsu :

According to the source code v1.8.3 shouldn't be impacted by this particular issue as it was related to a change in how query sockets are created which was changed in v1.8.5 so I'd encourage you to submit a new issue with the appropriate information.

Regards.

@rolandjitsu
Copy link

@rolandjitsu :

According to the source code v1.8.3 shouldn't be impacted by this particular issue as it was related to a change in how query sockets are created which was changed in v1.8.5 so I'd encourage you to submit a new issue with the appropriate information.

Regards.

Ok. Thanks. One interesting behaviour I noticed is that if the network conn timeout is too low (e.g 5s), then the DNS issue ends up in no logs being sent. But if I set it to something like 30s, even if the DNS issue is reported, logs are still going through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants