DNS resolution timeout/failure in >= 1.8.5 #4050

remisauvat · 2021-09-03T10:12:11Z

Bug Report

Describe the bug
Hi, I am facing a DNS resolution timeout/failure since upgrading to >= 1.8.5 with the forward module to a fluentd instance. It is working fine with 1.8.4. I am running on ubuntu 20.04 and the local resolver accept UDP and TCP requests. I tried to set net.dns.mode UDP but it changes nothing. I am guessing there might be an issue with 1.8.5 and the changes to DNS resolution library. I still have the same error when setting the upstream to www.google.com.

To Reproduce
I have replaced the real fluentd hostname with fluentd.example.org in this log

[2021/09/03 11:27:49] [ info] [engine] started (pid=2320367)
[2021/09/03 11:27:49] [ info] [storage] version=1.1.1, initializing...
[2021/09/03 11:27:49] [ info] [storage] root path '/var/td-agent-bit/storage'
[2021/09/03 11:27:49] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/09/03 11:27:49] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/09/03 11:27:49] [ info] [cmetrics] version=0.2.1
[2021/09/03 11:27:49] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 95.4M
[2021/09/03 11:27:55] [ info] [http_server] listen iface=127.0.0.1 tcp_port=2020
[2021/09/03 11:27:55] [ info] [sp] stream processor started
[2021/09/03 11:27:55] [ info] [input:tail:tail_proftpd_log] inotify_fs_add(): inode=516227 watch_fd=1 name=/var/log/proftpd/commandsAsJson.log
[2021/09/03 11:27:55] [ info] [input:tail:tail_history_log] inotify_fs_add(): inode=260310 watch_fd=1 name=/var/td-agent-bit/input/commandsAsJson.history.log
[2021/09/03 11:30:34] [ warn] [net] getaddrinfo(host='fluentd.example.org', err=12): Timeout while contacting DNS servers
[2021/09/03 11:30:34] [error] [output:forward:forward_to_fluentd] no upstream connections available
[2021/09/03 11:30:34] [ warn] [engine] failed to flush chunk '2320367-1630661431.54171690.flb', retry in 10 seconds: task_id=0, input=tail_proftpd_log > output=forward_to_fluentd (out_id=0)

Steps to reproduce the problem:

# Output all logs to fluentd instances
[OUTPUT]
    Name forward
    Alias forward_to_fluentd
    Match das.scanner.*
    Upstream upstream.conf
    Retry_Limit False
    tls on
[UPSTREAM]
    name    forward-balancing
[NODE]
    name    fluentd
    host    fluentd.example.org
    port    24224
    tls     on

Expected behavior
Messages should be sent to the upstream fluentd service.

Your Environment

Version used: Failed in 1.8.5 and 1.8.6. Works in 1.8.4
Server type and version: Public cloud VM
Operating System and version: Ubuntu 20.04
Filters and plugins: grep and nest

The text was updated successfully, but these errors were encountered:

rmarchei · 2021-09-03T12:09:58Z

I guess it's the same issue I hit after upgrading to 1.8.5

magichair · 2021-09-03T13:23:28Z

I am also seeing this and previously reported against aws/aws-for-fluent-bit#233 when they bumped their version from fluent-bit 1.8.3 -> 1.8.6. I'll add to this report that it is definitely not a host networking issue as nslookup is able to resolve the hostnames with no issues when SSH'd directly to these containers. Thanks.

zhonghui12 · 2021-09-03T16:39:08Z

@edsiper, many of our customers are facing this issue and they need to downgrade to 1.8.3 for now. Could you please take a look at this issue? Thanks.

branttaylor · 2021-09-07T16:58:50Z

we are seeing this in our environment as well. as others have mentioned, downgrading to 1.8.4 fixes the problem.

pnl0dg7k · 2021-09-08T08:36:41Z

how do you install a specfic version
is it possible to do it as mentioned bleow ?
apt-get -y install td-agent-bit-1.8.4 ?

NathanNZ · 2021-09-08T09:24:49Z

Close, apt-get -y install td-agent-bit=1.8.4 will do the trick!

remisauvat · 2021-09-09T09:22:26Z

I'll add to this report that it is definitely not a host networking issue as nslookup is able to resolve the hostnames with no issues
Same for me host resolution is working correctly.

getent hosts  fluentd.company.org

PettitWesley · 2021-09-13T23:19:17Z

We have multiple different reports of this bug in the AWS distro repo: aws/aws-for-fluent-bit#233

We're working on repro'ing and some preliminary investigation.

PettitWesley · 2021-09-14T01:31:02Z

Hey folks, here is the results of my repro attempts. I was able to confirm this issue report, at least for the datadog output.

Base Config

I tested 4 different outputs:

cloudwatch_logs
es (sending to Amazon OpenSearch, and I also sourced credentials from the STS API, so two AWS service endpoints are called)
http (sending to google.com, because I'm just trying to test DNS resolution)
Datadog
Starting with this configuration I made some simple modifications in each test (results below, the only modification was adding the net.dns.mode setting in versions that support it).

[INPUT]
    Name dummy
    Tag dummy

[OUTPUT]
    Name datadog
    Match *
    Host http-intake.logs.datadoghq.com
    TLS On
    apikey  REDACTED
    dd_service my-test-service-dns-issue
    dd_source fluent-bit
    dd_tags project:example
    provider ecs


[OUTPUT]
    Name  http
    Match *
    Host  google.com
    Port  80
    URI   /

[OUTPUT]
    Name  es
    Match *
    Host  REDACTED
    Port  443
    Index my_index
    Type  my_type
    AWS_Auth On
    AWS_Region us-west-2
    tls     On
    AWS_Role_Arn REDACTED

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    net.dns.mode TCP

Testing Env

Just my Mac on my local home network, running fluent bit in Docker.

Versions Tested

This confirms the issue is in 1.8.5+, at least for the datadog output.

For AWS for Fluent Bit customers, see our release page to map fluent bit versions to our versions: https://github.com/aws/aws-for-fluent-bit/releases

✅ == DNS resolution works.
❌ == DNS resolution failed. I specifically saw this message:

[2021/09/14 01:23:09] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers

	cloudwatch_logs	datadog	http (google.com)	es (send to Amazon OpenSearch)
1.8.0	✅	✅	✅	✅
1.8.1	✅	✅	✅	✅
1.8.2	✅	✅	✅	✅
1.8.3	✅	✅	✅	✅
1.8.4	✅	✅	✅	✅
1.8.5	✅	❌	✅	✅
1.8.5 with TCP DNS	✅	❌	✅	✅
1.8.5 with UDP DNS	✅	❌	✅	✅
1.8.6	✅	❌	✅	✅
1.8.6 with TCP DNS	✅	❌	✅	✅
1.8.6 with UDP DNS	✅	❌	✅	✅

PettitWesley · 2021-09-14T01:53:30Z

there are 3 commits which together add a new feature that each output supports a net.dns.mode option where the value can either be TCP or UDP which controls how fluent bit contacts DNS servers:

These changes seems to have been introduced right before this issue started.

branttaylor · 2021-09-14T01:59:14Z

Thanks @PettitWesley, since you've mentioned some differences in outputs I'll add that we're using the gelf output and seeing the failures in v1.8.5+ like you're seeing with the datadog output.

PettitWesley · 2021-09-14T18:00:22Z

@edsiper We now also have seen one instance of this error with kinesis_firehose:

[2021/09/02 01:10:43] [ warn] [net] getaddrinfo(host='firehose.us-east-1.amazonaws.com', err=12): Timeout while contacting DNS servers

It still remains most easily reproducible in datadog for me.

leonardo-albertovich · 2021-09-14T20:20:00Z

@PettitWesley In the end I traced the issue to plugins that did not implement the config_map interface which caused their net_setup not to be initialized. This issue was not exposed because create_conn handled the case where timeout was set to zero by marking the connection so the timeout handler ignored it but the async DNS client did not apply such logic which is why it started showing up.

There are a few more plugins (such as gelf and bigquery) that do not implement the interface but kinesis_firehose seems to do it so I think we'll have to talk about that one a bit more to determine if it's the same issue or not.

lubingfeng · 2021-09-14T20:41:13Z

Thanks Leonardo. A few questions:

When was config_map interface introduced?
As Fluent Bit 1.8.4 works with DataDog but not Fluent Bit 1.8.5, I am wondering what changes in Fluent Bit 1.8.5 triggered this issue (if any)?
To address this issue, does this require DataDog to implement the config_map interface or something else? What about other plugins?

For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>

edsiper · 2021-09-15T15:28:58Z

Fixed in #4088

For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>

fluent#4088) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>

paynejacob · 2021-10-05T20:30:10Z

We are still seeing this issue in 1.8.7 rancher/rancher#34772

rmarchei · 2021-10-05T20:43:32Z

We are still seeing this issue in 1.8.7 rancher/rancher#34772

maybe you hit this? in case that should be fixed in the next release

paynejacob · 2021-10-06T05:28:54Z

That looks like it could be our issue, we will test with the next release.

liyanhui1228 · 2021-11-11T20:02:08Z

We are still seeing this DNS issue for >= 1.8.5 on Windows. Is there a similar fix also needed for fluent-bit-windows? Or is there any workaround?

leonardo-albertovich · 2021-11-11T20:31:23Z

@liyanhui1228 It's fixed in 1.8.9.

liyanhui1228 · 2021-11-11T21:40:03Z

I tried 1.8.9 as well and encountered the same error though.

PettitWesley · 2021-11-11T21:52:46Z

@liyanhui1228 Do you see it repeatedly or sporadically?

Also make sure you test the net.dns.mode setting as UDP and as TCP

The DNS error handling may be could be improved to reduce sporadic errors, or we need some sort of auto-retry: #4257

liyanhui1228 · 2021-11-11T22:19:33Z

Thanks @PettitWesley ! Setting the net.dns.mode to TCP worked for me, while UDP doesn't work. Without setting net.dns.mode, the DNS issue is happening all the time and no logs can be exported. Do you know if this setting is only required for 1.8.9 and ongoing version? We are currently using 1.8.3 and we didn't need to set it for DNS to work.

leonardo-albertovich · 2021-11-11T22:45:49Z

That setting is not available in 1.8.3 because the asynchronous DNS client was introduced in 1.8.5, however, there were some issues that were polished in the subsequent versions which is why I suggested updating to 1.8.9.

Would you be able to test if you are able to issue DNS lookups through UDP where you are running fluent-bit? It would be odd for you not to be able to since that's actually the mainstream way of doing it but I'm asking because you mentioned that switching to TCP helped.

Would you be able to share the configuration you're using and context required to reproduce the issue?

ssplatt · 2021-11-11T23:06:24Z

I don't think I commented this in this thread but TCP did not fix this 100% for me in my environment. It did decrease it significantly. The main thing I needed to configure was the rety_limit which defaults to 2 and is severely low in these cases. In testing I found that 4 was the maximum amount of times that a chunk would require to be successful so I landed on a setting of 10 to be safe. In cases where I absolutely need the logs to go through I set 0 but I also found that disables the log output showing that chunks are retrying.

Due to fluent/fluent-bit#4050

morampudisouji · 2022-04-29T19:27:35Z

With the 1.8.15 version of fluent bit image, we are seeing the DNS errors with the es output plugin. Below is the sample log from fluent bit pod.

[2022/04/27 17:47:00] [ warn] [net] getaddrinfo(host=xxxx', err=12): Timeout while contacting DNS servers
[2022/04/27 17:47:01] [ warn] [http_client] cannot increase buffer: current=512000 requested=544768 max=512000
[2022/04/27 17:47:07] [ info] [input:tail:tail.1] inode=188774993 handle rotation(): /var/log/containers/fluent-bit-5lddk_istio-system_istio-proxy-d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b.log => /var/lib/docker/containers/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b/d9a45bb8902ccc4688952ec360a4debbf22952c9b3541ff8e5e935acac99920b-json.log.4

When I tried with the 1.8.4 fluent bit image, didn't see any errors related to DNS.

PettitWesley · 2022-04-29T19:32:54Z

@morampudisouji Have you tried net.dns.mode UDP and net.dns.mode TCP?

morampudisouji · 2022-04-29T19:37:22Z

Thanks for reply,
yes, I tried by putting net.dns.mode TCP and net.dns.mode UDP
With net.dns.mode UDP, logs are not sending to elasticsearch
With net.dns.mode TCP, logs are flowing Elasticsearch, but seeing dns error in the fluentbit log.

PettitWesley · 2022-04-29T19:41:57Z

@morampudisouji So with TCP, you see that the logs are there- so it retries and DNS succeeds? Is the failure only one time on startup or something? It may be best to open a new issue for this.

morampudisouji · 2022-04-29T21:22:34Z

It's not startup error, the errors popping continuously

AshutoshNirkhe · 2022-10-04T10:00:58Z

We are hit by same timeout issues in 1.8.15 like @morampudisouji
I tried adding net.dns.mode TCP It did help a bit, but not significantly.

Logs do flow on retry, but may be some of them could be getting lost.
Do we have any fix for this issue ?

PettitWesley mentioned this issue Sep 13, 2021

Seeing Timeout while contacting DNS servers with latest v2.19.1 aws/aws-for-fluent-bit#233

Closed

leonardo-albertovich linked a pull request Sep 14, 2021 that will close this issue

network: added default timeout for corner cases #4087

Closed

edsiper closed this as completed Sep 15, 2021

edsiper added bug fixed labels Sep 15, 2021

rmarchei mentioned this issue Sep 16, 2021

Timeouts after upgrading to 1.8.5 #4042

Closed

PettitWesley mentioned this issue Sep 18, 2021

DNS query cancelled error on v1.8.3 when sending to http-intake.logs.datadoghq.eu (works in v1.8.2) #3944

Closed

paynejacob mentioned this issue Sep 23, 2021

fixing dns resolution issues in fluent-bit rancher/charts#1493

Merged

tai-acall mentioned this issue Sep 30, 2021

(ecs): specify the Firelens container image version aws/aws-cdk#16733

Closed

2 tasks

cmurphy mentioned this issue Oct 5, 2021

revert to last known working fluentbit image rancher/charts#1527

Merged

stevenarvar mentioned this issue Oct 29, 2021

DNS resolution timeout/failure in 1.8.9 #4260

Closed

MiniCodeMonkey added a commit to Geocodio/docker-fluentbit-docker-client that referenced this issue Feb 23, 2022

fix: Downgrade to fluenbit 1.8.4

c9a88d7

Due to fluent/fluent-bit#4050

AshutoshNirkhe mentioned this issue Oct 6, 2022

err 12 timeout while contacting dns servers fluent/helm-charts#264

Open

gavenkoa mentioned this issue Apr 2, 2023

New ASYNC net.dns.resolver fails with getaddrinfo(err=12): Timeout while contacting DNS servers with Elasticsearch shipper #7105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS resolution timeout/failure in >= 1.8.5 #4050

DNS resolution timeout/failure in >= 1.8.5 #4050

remisauvat commented Sep 3, 2021

rmarchei commented Sep 3, 2021

magichair commented Sep 3, 2021 •

edited

Loading

zhonghui12 commented Sep 3, 2021

branttaylor commented Sep 7, 2021

pnl0dg7k commented Sep 8, 2021

NathanNZ commented Sep 8, 2021

remisauvat commented Sep 9, 2021

PettitWesley commented Sep 13, 2021

PettitWesley commented Sep 14, 2021

PettitWesley commented Sep 14, 2021 •

edited

Loading

branttaylor commented Sep 14, 2021

PettitWesley commented Sep 14, 2021

leonardo-albertovich commented Sep 14, 2021

lubingfeng commented Sep 14, 2021

edsiper commented Sep 15, 2021

paynejacob commented Oct 5, 2021

rmarchei commented Oct 5, 2021

paynejacob commented Oct 6, 2021

liyanhui1228 commented Nov 11, 2021

leonardo-albertovich commented Nov 11, 2021

liyanhui1228 commented Nov 11, 2021

PettitWesley commented Nov 11, 2021

liyanhui1228 commented Nov 11, 2021

leonardo-albertovich commented Nov 11, 2021

ssplatt commented Nov 11, 2021

morampudisouji commented Apr 29, 2022

PettitWesley commented Apr 29, 2022

morampudisouji commented Apr 29, 2022

PettitWesley commented Apr 29, 2022

morampudisouji commented Apr 29, 2022

AshutoshNirkhe commented Oct 4, 2022

DNS resolution timeout/failure in >= 1.8.5 #4050

DNS resolution timeout/failure in >= 1.8.5 #4050

Comments

remisauvat commented Sep 3, 2021

Bug Report

rmarchei commented Sep 3, 2021

magichair commented Sep 3, 2021 • edited Loading

zhonghui12 commented Sep 3, 2021

branttaylor commented Sep 7, 2021

pnl0dg7k commented Sep 8, 2021

NathanNZ commented Sep 8, 2021

remisauvat commented Sep 9, 2021

PettitWesley commented Sep 13, 2021

PettitWesley commented Sep 14, 2021

Base Config

Testing Env

Versions Tested

PettitWesley commented Sep 14, 2021 • edited Loading

branttaylor commented Sep 14, 2021

PettitWesley commented Sep 14, 2021

leonardo-albertovich commented Sep 14, 2021

lubingfeng commented Sep 14, 2021

edsiper commented Sep 15, 2021

paynejacob commented Oct 5, 2021

rmarchei commented Oct 5, 2021

paynejacob commented Oct 6, 2021

liyanhui1228 commented Nov 11, 2021

leonardo-albertovich commented Nov 11, 2021

liyanhui1228 commented Nov 11, 2021

PettitWesley commented Nov 11, 2021

liyanhui1228 commented Nov 11, 2021

leonardo-albertovich commented Nov 11, 2021

ssplatt commented Nov 11, 2021

morampudisouji commented Apr 29, 2022

PettitWesley commented Apr 29, 2022

morampudisouji commented Apr 29, 2022

PettitWesley commented Apr 29, 2022

morampudisouji commented Apr 29, 2022

AshutoshNirkhe commented Oct 4, 2022

magichair commented Sep 3, 2021 •

edited

Loading

PettitWesley commented Sep 14, 2021 •

edited

Loading