-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution timeout/failure in >= 1.8.5 #4050
Comments
I guess it's the same issue I hit after upgrading to 1.8.5 |
I am also seeing this and previously reported against aws/aws-for-fluent-bit#233 when they bumped their version from fluent-bit 1.8.3 -> 1.8.6. I'll add to this report that it is definitely not a host networking issue as |
@edsiper, many of our customers are facing this issue and they need to downgrade to 1.8.3 for now. Could you please take a look at this issue? Thanks. |
we are seeing this in our environment as well. as others have mentioned, downgrading to 1.8.4 fixes the problem. |
how do you install a specfic version |
Close, |
|
We have multiple different reports of this bug in the AWS distro repo: aws/aws-for-fluent-bit#233 We're working on repro'ing and some preliminary investigation. |
Hey folks, here is the results of my repro attempts. I was able to confirm this issue report, at least for the datadog output. Base ConfigI tested 4 different outputs:
Testing EnvJust my Mac on my local home network, running fluent bit in Docker. Versions TestedThis confirms the issue is in 1.8.5+, at least for the datadog output. For AWS for Fluent Bit customers, see our release page to map fluent bit versions to our versions: https://github.com/aws/aws-for-fluent-bit/releases ✅ == DNS resolution works.
|
Thanks @PettitWesley, since you've mentioned some differences in outputs I'll add that we're using the |
@edsiper We now also have seen one instance of this error with
It still remains most easily reproducible in datadog for me. |
@PettitWesley In the end I traced the issue to plugins that did not implement the config_map interface which caused their net_setup not to be initialized. This issue was not exposed because create_conn handled the case where timeout was set to zero by marking the connection so the timeout handler ignored it but the async DNS client did not apply such logic which is why it started showing up. There are a few more plugins (such as gelf and bigquery) that do not implement the interface but kinesis_firehose seems to do it so I think we'll have to talk about that one a bit more to determine if it's the same issue or not. |
Thanks Leonardo. A few questions:
|
For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>
For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>
Fixed in #4088 |
For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>
fluent#4088) For plugins that do not implement a config map interface, the networking setup was missing, leading to connect_timeout=0, no keep alive, plus others. In this patch we always initialize the plugin instance network defaults, but this becomes a fixed value that cannot be changed through the configuration. The long term solution is to migrate plugins to use config maps. Signed-off-by: Eduardo Silva <[email protected]>
We are still seeing this issue in 1.8.7 rancher/rancher#34772 |
maybe you hit this? in case that should be fixed in the next release |
That looks like it could be our issue, we will test with the next release. |
We are still seeing this DNS issue for >= 1.8.5 on Windows. Is there a similar fix also needed for fluent-bit-windows? Or is there any workaround? |
@liyanhui1228 It's fixed in 1.8.9. |
I tried 1.8.9 as well and encountered the same error though. |
@liyanhui1228 Do you see it repeatedly or sporadically? Also make sure you test the The DNS error handling may be could be improved to reduce sporadic errors, or we need some sort of auto-retry: #4257 |
Thanks @PettitWesley ! Setting the |
That setting is not available in 1.8.3 because the asynchronous DNS client was introduced in 1.8.5, however, there were some issues that were polished in the subsequent versions which is why I suggested updating to 1.8.9. Would you be able to test if you are able to issue DNS lookups through UDP where you are running fluent-bit? It would be odd for you not to be able to since that's actually the mainstream way of doing it but I'm asking because you mentioned that switching to TCP helped. Would you be able to share the configuration you're using and context required to reproduce the issue? |
I don't think I commented this in this thread but TCP did not fix this 100% for me in my environment. It did decrease it significantly. The main thing I needed to configure was the rety_limit which defaults to 2 and is severely low in these cases. In testing I found that 4 was the maximum amount of times that a chunk would require to be successful so I landed on a setting of 10 to be safe. In cases where I absolutely need the logs to go through I set 0 but I also found that disables the log output showing that chunks are retrying. |
With the 1.8.15 version of fluent bit image, we are seeing the DNS errors with the es output plugin. Below is the sample log from fluent bit pod.
When I tried with the 1.8.4 fluent bit image, didn't see any errors related to DNS. |
@morampudisouji Have you tried |
Thanks for reply, |
@morampudisouji So with TCP, you see that the logs are there- so it retries and DNS succeeds? Is the failure only one time on startup or something? It may be best to open a new issue for this. |
It's not startup error, the errors popping continuously |
We are hit by same timeout issues in 1.8.15 like @morampudisouji Logs do flow on retry, but may be some of them could be getting lost. |
Bug Report
Describe the bug
Hi, I am facing a DNS resolution timeout/failure since upgrading to >= 1.8.5 with the forward module to a fluentd instance. It is working fine with 1.8.4. I am running on ubuntu 20.04 and the local resolver accept UDP and TCP requests. I tried to set
net.dns.mode UDP
but it changes nothing. I am guessing there might be an issue with 1.8.5 and the changes to DNS resolution library. I still have the same error when setting the upstream towww.google.com
.To Reproduce
I have replaced the real fluentd hostname with fluentd.example.org in this log
Expected behavior
Messages should be sent to the upstream fluentd service.
Your Environment
The text was updated successfully, but these errors were encountered: