Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution timeout/failure in >= 1.8.5 #4050

Closed
remisauvat opened this issue Sep 3, 2021 · 31 comments
Closed

DNS resolution timeout/failure in >= 1.8.5 #4050

remisauvat opened this issue Sep 3, 2021 · 31 comments

Comments

@remisauvat
Copy link

Bug Report

Describe the bug
Hi, I am facing a DNS resolution timeout/failure since upgrading to >= 1.8.5 with the forward module to a fluentd instance. It is working fine with 1.8.4. I am running on ubuntu 20.04 and the local resolver accept UDP and TCP requests. I tried to set net.dns.mode UDP but it changes nothing. I am guessing there might be an issue with 1.8.5 and the changes to DNS resolution library. I still have the same error when setting the upstream to www.google.com.

To Reproduce
I have replaced the real fluentd hostname with fluentd.example.org in this log

[2021/09/03 11:27:49] [ info] [engine] started (pid=2320367)
[2021/09/03 11:27:49] [ info] [storage] version=1.1.1, initializing...
[2021/09/03 11:27:49] [ info] [storage] root path '/var/td-agent-bit/storage'
[2021/09/03 11:27:49] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/09/03 11:27:49] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/09/03 11:27:49] [ info] [cmetrics] version=0.2.1
[2021/09/03 11:27:49] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 95.4M
[2021/09/03 11:27:55] [ info] [http_server] listen iface=127.0.0.1 tcp_port=2020
[2021/09/03 11:27:55] [ info] [sp] stream processor started
[2021/09/03 11:27:55] [ info] [input:tail:tail_proftpd_log] inotify_fs_add(): inode=516227 watch_fd=1 name=/var/log/proftpd/commandsAsJson.log
[2021/09/03 11:27:55] [ info] [input:tail:tail_history_log] inotify_fs_add(): inode=260310 watch_fd=1 name=/var/td-agent-bit/input/commandsAsJson.history.log
[2021/09/03 11:30:34] [ warn] [net] getaddrinfo(host='fluentd.example.org', err=12): Timeout while contacting DNS servers
[2021/09/03 11:30:34] [error] [output:forward:forward_to_fluentd] no upstream connections available
[2021/09/03 11:30:34] [ warn] [engine] failed to flush chunk '2320367-1630661431.54171690.flb', retry in 10 seconds: task_id=0, input=tail_proftpd_log > output=forward_to_fluentd (out_id=0)
  • Steps to reproduce the problem:
# Output all logs to fluentd instances
[OUTPUT]
    Name forward
    Alias forward_to_fluentd
    Match das.scanner.*
    Upstream upstream.conf
    Retry_Limit False
    tls on
[UPSTREAM]
    name    forward-balancing
[NODE]
    name    fluentd
    host    fluentd.example.org
    port    24224
    tls     on

Expected behavior
Messages should be sent to the upstream fluentd service.

Your Environment

  • Version used: Failed in 1.8.5 and 1.8.6. Works in 1.8.4
  • Server type and version: Public cloud VM
  • Operating System and version: Ubuntu 20.04
  • Filters and plugins: grep and nest
@rmarchei
Copy link

rmarchei commented Sep 3, 2021

I guess it's the same issue I hit after upgrading to 1.8.5

@magichair
Copy link

magichair commented Sep 3, 2021

I am also seeing this and previously reported against aws/aws-for-fluent-bit#233 when they bumped their version from fluent-bit 1.8.3 -> 1.8.6. I'll add to this report that it is definitely not a host networking issue as nslookup is able to resolve the hostnames with no issues when SSH'd directly to these containers. Thanks.

@zhonghui12
Copy link
Contributor

@edsiper, many of our customers are facing this issue and they need to downgrade to 1.8.3 for now. Could you please take a look at this issue? Thanks.

@branttaylor
Copy link

we are seeing this in our environment as well. as others have mentioned, downgrading to 1.8.4 fixes the problem.

@pnl0dg7k
Copy link

pnl0dg7k commented Sep 8, 2021

how do you install a specfic version
is it possible to do it as mentioned bleow ?
apt-get -y install td-agent-bit-1.8.4 ?

@NathanNZ
Copy link

NathanNZ commented Sep 8, 2021

Close, apt-get -y install td-agent-bit=1.8.4 will do the trick!

@remisauvat
Copy link
Author

I'll add to this report that it is definitely not a host networking issue as nslookup is able to resolve the hostnames with no issues
Same for me host resolution is working correctly.

getent hosts  fluentd.company.org

@PettitWesley
Copy link
Contributor

We have multiple different reports of this bug in the AWS distro repo: aws/aws-for-fluent-bit#233

We're working on repro'ing and some preliminary investigation.

@PettitWesley
Copy link
Contributor

Hey folks, here is the results of my repro attempts. I was able to confirm this issue report, at least for the datadog output.

Base Config

I tested 4 different outputs:

  • cloudwatch_logs
  • es (sending to Amazon OpenSearch, and I also sourced credentials from the STS API, so two AWS service endpoints are called)
  • http (sending to google.com, because I'm just trying to test DNS resolution)
  • Datadog
    Starting with this configuration I made some simple modifications in each test (results below, the only modification was adding the net.dns.mode setting in versions that support it).
[INPUT]
    Name dummy
    Tag dummy

[OUTPUT]
    Name datadog
    Match *
    Host http-intake.logs.datadoghq.com
    TLS On
    apikey  REDACTED
    dd_service my-test-service-dns-issue
    dd_source fluent-bit
    dd_tags project:example
    provider ecs


[OUTPUT]
    Name  http
    Match *
    Host  google.com
    Port  80
    URI   /

[OUTPUT]
    Name  es
    Match *
    Host  REDACTED
    Port  443
    Index my_index
    Type  my_type
    AWS_Auth On
    AWS_Region us-west-2
    tls     On
    AWS_Role_Arn REDACTED

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    net.dns.mode TCP

Testing Env

Just my Mac on my local home network, running fluent bit in Docker.

Versions Tested

This confirms the issue is in 1.8.5+, at least for the datadog output.

For AWS for Fluent Bit customers, see our release page to map fluent bit versions to our versions: https://github.com/aws/aws-for-fluent-bit/releases

✅ == DNS resolution works.
❌ == DNS resolution failed. I specifically saw this message:

[2021/09/14 01:23:09] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers
cloudwatch_logs datadog http (google.com) es (send to Amazon OpenSearch)
1.8.0
1.8.1
1.8.2
1.8.3
1.8.4
1.8.5
1.8.5 with TCP DNS
1.8.5 with UDP DNS
1.8.6
1.8.6 with TCP DNS
1.8.6 with UDP DNS

@PettitWesley
Copy link
Contributor

PettitWesley commented Sep 14, 2021

there are 3 commits which together add a new feature that each output supports a net.dns.mode option where the value can either be TCP or UDP which controls how fluent bit contacts DNS servers:

These changes seems to have been introduced right before this issue started.

@branttaylor
Copy link

Thanks @PettitWesley, since you've mentioned some differences in outputs I'll add that we're using the gelf output and seeing the failures in v1.8.5+ like you're seeing with the datadog output.

@PettitWesley
Copy link
Contributor

@edsiper We now also have seen one instance of this error with kinesis_firehose:

[2021/09/02 01:10:43] [ warn] [net] getaddrinfo(host='firehose.us-east-1.amazonaws.com', err=12): Timeout while contacting DNS servers

It still remains most easily reproducible in datadog for me.

@leonardo-albertovich leonardo-albertovich linked a pull request Sep 14, 2021 that will close this issue
@leonardo-albertovich
Copy link
Collaborator

@PettitWesley In the end I traced the issue to plugins that did not implement the config_map interface which caused their net_setup not to be initialized. This issue was not exposed because create_conn handled the case where timeout was set to zero by marking the connection so the timeout handler ignored it but the async DNS client did not apply such logic which is why it started showing up.

There are a few more plugins (such as gelf and bigquery) that do not implement the interface but kinesis_firehose seems to do it so I think we'll have to talk about that one a bit more to determine if it's the same issue or not.

@lubingfeng
Copy link

Thanks Leonardo. A few questions:

  1. When was config_map interface introduced?
  2. As Fluent Bit 1.8.4 works with DataDog but not Fluent Bit 1.8.5, I am wondering what changes in Fluent Bit 1.8.5 triggered this issue (if any)?
  3. To address this issue, does this require DataDog to implement the config_map interface or something else? What about other plugins?

edsiper added a commit that referenced this issue Sep 14, 2021
For plugins that do not implement a config map interface, the networking
setup was missing, leading to connect_timeout=0, no keep alive, plus
others.

In this patch we always initialize the plugin instance network defaults,
but this becomes a fixed value that cannot be changed through the
configuration.

The long term solution is to migrate plugins to use config maps.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Sep 15, 2021
For plugins that do not implement a config map interface, the networking
setup was missing, leading to connect_timeout=0, no keep alive, plus
others.

In this patch we always initialize the plugin instance network defaults,
but this becomes a fixed value that cannot be changed through the
configuration.

The long term solution is to migrate plugins to use config maps.

Signed-off-by: Eduardo Silva <[email protected]>
@edsiper
Copy link
Member

edsiper commented Sep 15, 2021

Fixed in #4088

@edsiper edsiper closed this as completed Sep 15, 2021
edsiper added a commit that referenced this issue Sep 16, 2021
For plugins that do not implement a config map interface, the networking
setup was missing, leading to connect_timeout=0, no keep alive, plus
others.

In this patch we always initialize the plugin instance network defaults,
but this becomes a fixed value that cannot be changed through the
configuration.

The long term solution is to migrate plugins to use config maps.

Signed-off-by: Eduardo Silva <[email protected]>
pwhelan pushed a commit to pwhelan/fluent-bit that referenced this issue Sep 16, 2021
fluent#4088)

For plugins that do not implement a config map interface, the networking
setup was missing, leading to connect_timeout=0, no keep alive, plus
others.

In this patch we always initialize the plugin instance network defaults,
but this becomes a fixed value that cannot be changed through the
configuration.

The long term solution is to migrate plugins to use config maps.

Signed-off-by: Eduardo Silva <[email protected]>
@paynejacob
Copy link

We are still seeing this issue in 1.8.7 rancher/rancher#34772

@rmarchei
Copy link

rmarchei commented Oct 5, 2021

We are still seeing this issue in 1.8.7 rancher/rancher#34772

maybe you hit this? in case that should be fixed in the next release

@paynejacob
Copy link

That looks like it could be our issue, we will test with the next release.

@liyanhui1228
Copy link

We are still seeing this DNS issue for >= 1.8.5 on Windows. Is there a similar fix also needed for fluent-bit-windows? Or is there any workaround?

@leonardo-albertovich
Copy link
Collaborator

@liyanhui1228 It's fixed in 1.8.9.

@liyanhui1228
Copy link

I tried 1.8.9 as well and encountered the same error though.

@PettitWesley
Copy link
Contributor

@liyanhui1228 Do you see it repeatedly or sporadically?

Also make sure you test the net.dns.mode setting as UDP and as TCP

The DNS error handling may be could be improved to reduce sporadic errors, or we need some sort of auto-retry: #4257

@liyanhui1228
Copy link

Thanks @PettitWesley ! Setting the net.dns.mode to TCP worked for me, while UDP doesn't work. Without setting net.dns.mode, the DNS issue is happening all the time and no logs can be exported. Do you know if this setting is only required for 1.8.9 and ongoing version? We are currently using 1.8.3 and we didn't need to set it for DNS to work.

@leonardo-albertovich
Copy link
Collaborator

That setting is not available in 1.8.3 because the asynchronous DNS client was introduced in 1.8.5, however, there were some issues that were polished in the subsequent versions which is why I suggested updating to 1.8.9.

Would you be able to test if you are able to issue DNS lookups through UDP where you are running fluent-bit? It would be odd for you not to be able to since that's actually the mainstream way of doing it but I'm asking because you mentioned that switching to TCP helped.

Would you be able to share the configuration you're using and context required to reproduce the issue?

@ssplatt
Copy link

ssplatt commented Nov 11, 2021

I don't think I commented this in this thread but TCP did not fix this 100% for me in my environment. It did decrease it significantly. The main thing I needed to configure was the rety_limit which defaults to 2 and is severely low in these cases. In testing I found that 4 was the maximum amount of times that a chunk would require to be successful so I landed on a setting of 10 to be safe. In cases where I absolutely need the logs to go through I set 0 but I also found that disables the log output showing that chunks are retrying.

MiniCodeMonkey added a commit to Geocodio/docker-fluentbit-docker-client that referenced this issue Feb 23, 2022
@morampudisouji
Copy link

With the 1.8.15 version of fluent bit image, we are seeing the DNS errors with the es output plugin. Below is the sample log from fluent bit pod.

[2022/04/27 17:47:00] [ warn] [net] getaddrinfo(host=xxxx', err=12): Timeout while contacting DNS servers
[2022/04/27 17:47:01] [ warn] [http_client] cannot increase buffer: current=512000 requested=544768 max=512000
[2022/04/27 17:47:07] [ info] [input:tail:tail.1] inode=188774993 handle rotation(): /var/log/containers/fluent-bit-5lddk_istio-system_istio-proxy-d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b.log => /var/lib/docker/containers/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b-json.log.4

When I tried with the 1.8.4 fluent bit image, didn't see any errors related to DNS.

@PettitWesley
Copy link
Contributor

@morampudisouji Have you tried net.dns.mode UDP and net.dns.mode TCP?

@morampudisouji
Copy link

Thanks for reply,
yes, I tried by putting net.dns.mode TCP and net.dns.mode UDP
With net.dns.mode UDP, logs are not sending to elasticsearch
With net.dns.mode TCP, logs are flowing Elasticsearch, but seeing dns error in the fluentbit log.

@PettitWesley
Copy link
Contributor

@morampudisouji So with TCP, you see that the logs are there- so it retries and DNS succeeds? Is the failure only one time on startup or something? It may be best to open a new issue for this.

@morampudisouji
Copy link

It's not startup error, the errors popping continuously

@AshutoshNirkhe
Copy link

We are hit by same timeout issues in 1.8.15 like @morampudisouji
I tried adding net.dns.mode TCP It did help a bit, but not significantly.

Logs do flow on retry, but may be some of them could be getting lost.
Do we have any fix for this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.