DNS resolution issues still occur in 2.21.0 #253

ssplatt · 2021-10-06T18:05:52Z

Copy of issue from upstream fluent/fluent-bit#4157

Fluent-bit 1.8.7, AWS for fluent 2.20.0. Running in AWS Fargate using Firelens.

Datadog output does deliver some logs but drops many.

Many messages per minute like [2021/10/04 13:56:07] [ warn] [engine] failed to flush chunk '1-1633355763.278469126.flb', retry in 10 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1) followed by [2021/10/04 13:56:17] [ warn] [engine] chunk '1-1633355763.278469126.flb' cannot be retried: task_id=0, input=forward.1 > output=datadog.1. All of our services running fb 1.8.7/awsffb 2.20.0 are throwing this same error, in multiple environments.

cloudformation example:

...
        LogConfiguration:
            LogDriver: awsfirelens
            Options:
              Name: datadog
              TLS: 'on'
              compress: gzip
              apikey: '${DD_API_KEY}'
              dd_service: !Ref ServiceName
              dd_source: nginx
              Log_Level: warning
...
       - Name: log_router
          Image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.20.0
          Essential: true
          MemoryReservation: 50
          Secrets:
            - Name: DD_API_KEY
              ValueFrom: DdApiKey
          FirelensConfiguration:
            Type: fluentbit
            Options:
              enable-ecs-log-metadata: 'true'
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: !Ref AWS::Region
              awslogs-group: !Ref LogGroup
              awslogs-stream-prefix: ecs

I do have a lot of occurrences of [2021/10/04 18:31:01] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers as well.

The text was updated successfully, but these errors were encountered:

ssplatt · 2021-10-06T18:06:44Z

CFN has been updated to use the LogConfiguration SecretOptions but the issue persists.

PettitWesley · 2021-10-06T18:17:40Z

Saw a report of DNS issues with 2.20.0 and ES and CW outputs: #235 (comment)

ssplatt · 2021-10-06T18:26:52Z

memory doesn't appear to be the issue based on the graphs in datadog showing ecs.fargate.mem.usage. I'm using the default suggested 50 as the memoryreservation and I'm seeing those containers staying around 24mb used. I'm going to test increasing that value on a few tasks.

ssplatt · 2021-10-07T14:27:58Z

i pushed a change to all my services to increase their memory but I haven't seen any change in memory use or in the amount of errors produced. So I don't think a memory limit or OOM killer is the issue.

bcwilsondotcom · 2021-10-07T18:18:09Z

Yeah the OOM issues are a symptom not not being able to flush logs, not the cause.

ssplatt · 2021-10-14T18:33:21Z

deployed 2.21.0 to one of my services and this issue is still present. the s3 memory leak fix is a plus though, since I have many other services which are utilizing that and have had issue with s3 itmeouts.

2021-10-14 14:30:23[2021/10/14 18:30:23] [ warn] [engine] failed to flush chunk '1-1634236220.660823627.flb', retry in 6 seconds: task_id=0, input=forward.1 > output=datadog.2 (out_id=2)
2021-10-14 14:30:12[2021/10/14 18:30:12] [ warn] [engine] chunk '1-1634236202.657640279.flb' cannot be retried: task_id=1, input=forward.1 > output=datadog.2
2021-10-14 14:30:03[2021/10/14 18:30:03] [ warn] [engine] failed to flush chunk '1-1634236202.657640279.flb', retry in 9 seconds: task_id=1, input=forward.1 > output=datadog.2 (out_id=2)
2021-10-14 14:29:58[2021/10/14 18:29:58] [ warn] [engine] failed to flush chunk '1-1634236194.490145304.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1)
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [sp] stream processor started
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [engine] started (pid=1)
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] version=1.1.4, initializing...
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] in-memory
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [cmetrics] version=0.2.2
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
2021-10-14 14:29:54�[1mFluent Bit v1.8.8�[0m
2021-10-14 14:29:54* �[1m�[93mCopyright (C) 2019-2021 The Fluent Bit Authors�[0m
2021-10-14 14:29:54* �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
2021-10-14 14:29:54* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2021-10-14 14:29:54* https://fluentbit.io

zhonghui12 · 2021-10-21T22:10:38Z

Hi @ssplatt, I tried to reproduce the DNS issue on my side but failed. It seems my task goes well. I ran it several times and I saw this error log once but then it was retried and no more warning:
[2021/10/20 23:16:39] [ warn] [engine] failed to flush chunk '1-1634771796.18542375.flb', retry in 6 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1)

To be specific, here is my task definition:

{
    "family": "my-test-dns-issue",
    "taskRoleArn": "my-task-role-arn",
    "executionRoleArn": "my-execution-role-arn",
    "networkMode": "awsvpc",
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "memory": "512",
    "cpu": "256",
    "containerDefinitions": [
        {
            "essential": true,
            "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:2.21.0",
            "name": "log_router",
            "firelensConfiguration": {
                "type": "fluentbit",
                "options":{
                    "enable-ecs-log-metadata":"true"
                }
            },
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "dns-issue-test",
                    "awslogs-region": "us-east-1",
                    "awslogs-create-group": "true",
                    "awslogs-stream-prefix": "firelens"
                }
            },
            "memoryReservation": 50
         },
         {
            "essential": true,
            "image": "nginx",
            "name": "app",
            "logConfiguration": {
                "logDriver":"awsfirelens",
                "options": {
                   "Name": "datadog",
                   "Host": "http-intake.logs.datadoghq.com",
                   "TLS": "on",
                   "dd_service": "my-test-service-dns-issue",
                   "dd_source": "nginx",
                   "dd_tags": "project:example",
                   "provider": "ecs",
                   "apikey":"my-api-key",
                   "Log_Level":"warning",
                   "compress":"gzip"
               }
           }
        }
    ]
}

I am not sure if the issue is related to the env/config setup. So could you please try run in a different/clean env to see if the DNS issue still exists? I am not sure if there is a real bug in the upstream code about it. And there is one more option you could try with DNS to resolve the issue: "net.dns.mode":"UDP"or "net.dns.mode":"TCP".

ssplatt · 2021-10-21T22:52:57Z

Would running in us-east-1 be part of the problem? I have this error in 8 different applications, deployed across 3 different VPCs (environments), so 24 deployments in total. I guess if it's not a datadog issue, or fluent bit issue, then it could be a VPC dns issue? But I also have fluentd deployed which we were hoping to phase out which does not have this problem.

zhonghui12 · 2021-10-21T23:01:11Z

Would running in us-east-1 be part of the problem?

Do you mean the region for my CW group? I also tried us-west-2 which also works. And yeah, it might be the VPC dns issue.

PettitWesley · 2021-10-21T23:10:02Z

@ssplatt My guess is that its an issue with your VPC DNS yes.

To confirm- you tried both net.dns.mode TCP and UDP right?

But I also have fluentd deployed which we were hoping to phase out which does not have this problem.

Interesting. This does sort of indicate its a problem specific to some interaction between your DNS settings and Fluent Bit. Can you also curl/dig the Datadog endpoint in your VPC?

Does your company have a support relationship with AWS? If you do, we could possibly try to use that to debug this more closely with you. On GitHub we are more limited to just trying to repro with your config in our envs or give you suggestions of things to try.

PettitWesley · 2021-10-21T23:11:02Z

Actually, the verbose output of dig on the datadog endpoint in your VPC might possibly be interesting. We could compare it with ours.

Unfortunately I am not an expert on these DNS things 😐

ssplatt · 2021-10-22T00:02:33Z

I haven't tried the tcp or udp setting yet, Ill try that tomorrow. We do have a support relationship with aws so I'll open a ticket to see if they can help track down a dns issue in our VPCs. And we are also running many datadog agent containers. Actually each one of these applications has both a datadog agent and a fluent but side car. The datadog agent handles metrics and apm. I don't believe we are having any DNS issues with the datadog agents.

ssplatt · 2021-10-22T00:05:48Z

Would running in us-east-1 be part of the problem?

Do you mean the region for my CW group? I also tried us-west-2 which also works. And yeah, it might be the VPC dns issue.

I missed that your test was in us-east-1. I meant that all of my deployments are in us-east-1.

ssplatt · 2021-10-22T20:15:34Z

I have only been testing on one task for a few minutes but TCP seems to be working. UDP produced the error very quickly. Is the default net.dns.mode to use UDP? I don't see it documented anywhere.

I have a ticket open with AWS to figure out if there is an issue with DNS in my VPC setup.

ssplatt · 2021-10-22T20:33:31Z

and the nslookup using the VPC DNS:

➜ nslookup http-intake.logs.datadoghq.com
Server:		10.1.0.2
Address:	10.1.0.2#53

Non-authoritative answer:
http-intake.logs.datadoghq.com	canonical name = alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com.
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.117
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.122
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.7
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.23
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.24
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.35
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.43
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.87

zhonghui12 · 2021-10-22T20:44:13Z

Is the default net.dns.mode to use UDP? I don't see it documented anywhere.

@ssplatt The default value for this option is NULL so you need to manually set it up. For this option, upstream has released the change but it seems that they haven't updated the official doc for it. It is for selecting the primary DNS connection type (TCP or UDP).

ssplatt · 2021-10-22T21:38:36Z

And nslookup in the vpc works, which uses udp. Why would fluent bit using udp fail?

PettitWesley · 2021-10-25T17:45:57Z

@zhonghui12 @ssplatt looking at the code in flb_network.c in flb_net_getaddrinfo() it looks like NULL eventually gets translated to UDP. So if TCP is working for you, but not UDP, this would explain why you have to explicitly set it to TCP.

With it set to TCP- it seems to work completely fine?

ssplatt · 2021-10-25T19:55:02Z

@PettitWesley I spun up a test EC2 instance and installed docker on it then created a testing container based off AWSFFB with a custom configuration that is essentially just a "input forward" and a "output datadog". I haven't seen any DNS errors, even with UDP, but I have seen "failed to flush chunk". Interestingly, I'm sending logs in a loop with an order number in the message and I'm seeing them all come through in order in DD with no loss. Once I added in the "Retry_Limit False" setting I see no more messages about even retrying logs. Setting the Retry_Limit to another integer, like 10, seemed to reduce he number of "failed to flush chunk" messages as well for some reason.

I also see a few logs showing that dns timeouts for s3.amazonaws and our internal fluentd server, but only 1 or 2 over the past week.

So I think these may just be transient errors exacerbated by the shear number of logs we are sending, the fragility of UDP, and then the default retry limit of 2. I'll go ahead and close this and reopen if I run into more frequent errors or complete loss of logging.

ssplatt · 2021-10-29T13:55:32Z

FWIW, we're still seeing DNS resolution errors a few times per hour with net.dns.mode TCP set but the number of "cannot deliver chunk" messages has dropped to 0, I believe this is due to the "Retry Limit" being raised above the default of 2, so I don't think these warnings are resulting in dropped logs.

matthewfala · 2021-10-29T17:51:14Z

It might be a good idea to have an automatic dns retry in Fluent Bit core if more people experience this problem. This could be implemented with a for loop around the ares_getaddrinfo() and timer block here https://github.com/fluent/fluent-bit/blob/231ef4b38f312eb981583d8f1c2441f7313f128d/src/flb_network.c#L941 The condition could be result_code == ARES_ETIMEOUT From here Corresponding to your error message here

JohnPreston · 2022-05-30T14:04:32Z

Giving a 2022 update: still seeing that error coming out, with the destination being firehose. However, data does make it to firehose so unsure what's happening there.

ssplatt mentioned this issue Oct 6, 2021

Datadog output failing to flush chunks fluent/fluent-bit#4157

Closed

PettitWesley changed the title ~~Datadog output failing to flush chunks~~ DNS resolution issues still occur in 2.20.0 Oct 6, 2021

PettitWesley mentioned this issue Oct 6, 2021

Fluent-bit using a lot of memory #235

Closed

ssplatt mentioned this issue Oct 12, 2021

bump fluent version to 1.8.8 #257

Closed

ssplatt changed the title ~~DNS resolution issues still occur in 2.20.0~~ DNS resolution issues still occur in 2.21.0 Oct 18, 2021

zhonghui12 mentioned this issue Oct 20, 2021

Seeing Timeout while contacting DNS servers with latest v2.19.1 #233

Closed

ssplatt closed this as completed Oct 25, 2021

matthewfala mentioned this issue Oct 29, 2021

[flb_network] DNS auto retry (Timeout while contacting DNS servers) 🌐 fluent/fluent-bit#4257

Closed

gavenkoa mentioned this issue Apr 2, 2023

New ASYNC net.dns.resolver fails with getaddrinfo(err=12): Timeout while contacting DNS servers with Elasticsearch shipper fluent/fluent-bit#7105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS resolution issues still occur in 2.21.0 #253

DNS resolution issues still occur in 2.21.0 #253

ssplatt commented Oct 6, 2021

ssplatt commented Oct 6, 2021

PettitWesley commented Oct 6, 2021

ssplatt commented Oct 6, 2021

ssplatt commented Oct 7, 2021

bcwilsondotcom commented Oct 7, 2021

ssplatt commented Oct 14, 2021

zhonghui12 commented Oct 21, 2021

ssplatt commented Oct 21, 2021

zhonghui12 commented Oct 21, 2021 •

edited

Loading

PettitWesley commented Oct 21, 2021

PettitWesley commented Oct 21, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

zhonghui12 commented Oct 22, 2021

ssplatt commented Oct 22, 2021

PettitWesley commented Oct 25, 2021

ssplatt commented Oct 25, 2021

ssplatt commented Oct 29, 2021 •

edited

Loading

matthewfala commented Oct 29, 2021

JohnPreston commented May 30, 2022

DNS resolution issues still occur in 2.21.0 #253

DNS resolution issues still occur in 2.21.0 #253

Comments

ssplatt commented Oct 6, 2021

ssplatt commented Oct 6, 2021

PettitWesley commented Oct 6, 2021

ssplatt commented Oct 6, 2021

ssplatt commented Oct 7, 2021

bcwilsondotcom commented Oct 7, 2021

ssplatt commented Oct 14, 2021

zhonghui12 commented Oct 21, 2021

ssplatt commented Oct 21, 2021

zhonghui12 commented Oct 21, 2021 • edited Loading

PettitWesley commented Oct 21, 2021

PettitWesley commented Oct 21, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

ssplatt commented Oct 22, 2021

zhonghui12 commented Oct 22, 2021

ssplatt commented Oct 22, 2021

PettitWesley commented Oct 25, 2021

ssplatt commented Oct 25, 2021

ssplatt commented Oct 29, 2021 • edited Loading

matthewfala commented Oct 29, 2021

JohnPreston commented May 30, 2022

zhonghui12 commented Oct 21, 2021 •

edited

Loading

ssplatt commented Oct 29, 2021 •

edited

Loading