Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flb_network] DNS auto retry (Timeout while contacting DNS servers) 🌐 #4257

Closed
matthewfala opened this issue Oct 29, 2021 · 6 comments
Closed

Comments

@matthewfala
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Some Fluent Bit users are experiencing DNS timeout errors every now and then which lead to log loss. The current solution is to set retry_limit to a value above 2, as proposed by @ssplatt and @PettitWesley. This however may be too broad of a solution to address transient DNS errors. See aws/aws-for-fluent-bit#253 (comment)

Describe the solution you'd like

A cleaner solution than having DNS failure fail the entire request would be to add a DNS retry on DNS timeout directly to the DNS lookup method.

This could be implemented with a for loop around the ares_getaddrinfo() and timer block here

ares_getaddrinfo(lookup_context->ares_channel, node, service, &ares_hints,
The condition could be result_code == ARES_ETIMEOUT From here Corresponding to your error message here

Describe alternatives you've considered

This issue could also be mitigated with:

  1. DNS caching
  • very difficult to do and not recommended by developers online.
  1. Configurable DNS retry
  • This may be overkill to have people specify a DNS retry limit, since most people would probably want DNS to be retried.

Additional context

@matthewfala
Copy link
Contributor Author

It appears the DNS timeout was introduced as a fairly highlevel design decision (not part of the unix api response) 2 months ago in this commit.
Here is the line that sets the timeout error which is invoked when the DNS lookup timer times out

One design flaw may be that the timer times out based on connect_timeout Here That means that if some one sets flb_upstream u->net.connect_timeout to something unreasonably small which previously works for their connection timeout use case but is too short a time for the DNS lookup (in some cases) the DNS might be prone to failure. I'm thinking maybe the DNS lookup timeout needs to be separate from connect_timeout with its own default value for backwards compatibility.

Actually it doesn't seem like anyone is setting the connect_timeout other than AWS credential provider services(which don't use dns?). The default value is 10s which seems reasonably high. Not sure what the DNS lookup behavior was before this commit, when a DNS call would take more than 10 seconds.

@matthewfala matthewfala changed the title [flb_network] DNS auto retry 🌐 [flb_network] DNS auto retry 🌐 (Timeout while contacting DNS servers) Oct 29, 2021
@matthewfala matthewfala changed the title [flb_network] DNS auto retry 🌐 (Timeout while contacting DNS servers) [flb_network] DNS auto retry (Timeout while contacting DNS servers) 🌐 Oct 29, 2021
@PettitWesley
Copy link
Contributor

other than AWS credential provider services(which don't use dns?).

The STS and EKS providers do use DNS since they have to find the STS service.

@matthewfala
Copy link
Contributor Author

I see. The sts connect timeout which is set to FLB_AWS_CREDENTIAL_NET_TIMEOUT is also a generous value - 5 seconds. Not apparent that this would be a problem.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@matthewfala
Copy link
Contributor Author

Seems like most DNS problems are a byproduct of other fluent bit issues, and not really a DNS resolution failure. Seeing less DNS related problems lately and can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants