-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution issues still occur in 2.21.0 #253
Comments
CFN has been updated to use the LogConfiguration SecretOptions but the issue persists. |
Saw a report of DNS issues with 2.20.0 and ES and CW outputs: #235 (comment) |
memory doesn't appear to be the issue based on the graphs in datadog showing |
i pushed a change to all my services to increase their memory but I haven't seen any change in memory use or in the amount of errors produced. So I don't think a memory limit or OOM killer is the issue. |
Yeah the OOM issues are a symptom not not being able to flush logs, not the cause. |
deployed 2.21.0 to one of my services and this issue is still present. the s3 memory leak fix is a plus though, since I have many other services which are utilizing that and have had issue with s3 itmeouts.
|
Hi @ssplatt, I tried to reproduce the DNS issue on my side but failed. It seems my task goes well. I ran it several times and I saw this error log once but then it was retried and no more warning: To be specific, here is my task definition:
I am not sure if the issue is related to the env/config setup. So could you please try run in a different/clean env to see if the DNS issue still exists? I am not sure if there is a real bug in the upstream code about it. And there is one more option you could try with DNS to resolve the issue: |
Would running in us-east-1 be part of the problem? I have this error in 8 different applications, deployed across 3 different VPCs (environments), so 24 deployments in total. I guess if it's not a datadog issue, or fluent bit issue, then it could be a VPC dns issue? But I also have fluentd deployed which we were hoping to phase out which does not have this problem. |
Do you mean the region for my CW group? I also tried |
@ssplatt My guess is that its an issue with your VPC DNS yes. To confirm- you tried both net.dns.mode TCP and UDP right?
Interesting. This does sort of indicate its a problem specific to some interaction between your DNS settings and Fluent Bit. Can you also curl/dig the Datadog endpoint in your VPC? Does your company have a support relationship with AWS? If you do, we could possibly try to use that to debug this more closely with you. On GitHub we are more limited to just trying to repro with your config in our envs or give you suggestions of things to try. |
Actually, the verbose output of dig on the datadog endpoint in your VPC might possibly be interesting. We could compare it with ours. Unfortunately I am not an expert on these DNS things 😐 |
I haven't tried the tcp or udp setting yet, Ill try that tomorrow. We do have a support relationship with aws so I'll open a ticket to see if they can help track down a dns issue in our VPCs. And we are also running many datadog agent containers. Actually each one of these applications has both a datadog agent and a fluent but side car. The datadog agent handles metrics and apm. I don't believe we are having any DNS issues with the datadog agents. |
I missed that your test was in us-east-1. I meant that all of my deployments are in us-east-1. |
I have only been testing on one task for a few minutes but TCP seems to be working. UDP produced the error very quickly. Is the default I have a ticket open with AWS to figure out if there is an issue with DNS in my VPC setup. |
and the nslookup using the VPC DNS:
|
@ssplatt The default value for this option is NULL so you need to manually set it up. For this option, upstream has released the change but it seems that they haven't updated the official doc for it. It is for selecting the primary DNS connection type (TCP or UDP). |
And nslookup in the vpc works, which uses udp. Why would fluent bit using udp fail? |
@zhonghui12 @ssplatt looking at the code in With it set to TCP- it seems to work completely fine? |
@PettitWesley I spun up a test EC2 instance and installed docker on it then created a testing container based off AWSFFB with a custom configuration that is essentially just a "input forward" and a "output datadog". I haven't seen any DNS errors, even with UDP, but I have seen "failed to flush chunk". Interestingly, I'm sending logs in a loop with an order number in the message and I'm seeing them all come through in order in DD with no loss. Once I added in the "Retry_Limit False" setting I see no more messages about even retrying logs. Setting the I also see a few logs showing that dns timeouts for s3.amazonaws and our internal fluentd server, but only 1 or 2 over the past week. So I think these may just be transient errors exacerbated by the shear number of logs we are sending, the fragility of UDP, and then the default retry limit of 2. I'll go ahead and close this and reopen if I run into more frequent errors or complete loss of logging. |
FWIW, we're still seeing DNS resolution errors a few times per hour with |
It might be a good idea to have an automatic dns retry in Fluent Bit core if more people experience this problem. This could be implemented with a for loop around the |
Giving a 2022 update: still seeing that error coming out, with the destination being firehose. However, data does make it to firehose so unsure what's happening there. |
Copy of issue from upstream fluent/fluent-bit#4157
Fluent-bit 1.8.7, AWS for fluent 2.20.0. Running in AWS Fargate using Firelens.
Datadog output does deliver some logs but drops many.
Many messages per minute like
[2021/10/04 13:56:07] [ warn] [engine] failed to flush chunk '1-1633355763.278469126.flb', retry in 10 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1)
followed by[2021/10/04 13:56:17] [ warn] [engine] chunk '1-1633355763.278469126.flb' cannot be retried: task_id=0, input=forward.1 > output=datadog.1
. All of our services running fb 1.8.7/awsffb 2.20.0 are throwing this same error, in multiple environments.cloudformation example:
I do have a lot of occurrences of
[2021/10/04 18:31:01] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers
as well.The text was updated successfully, but these errors were encountered: