-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync DNS getaddrinfo() potentially hangs indefinitely #6140
Comments
@matthewfala I think you need to add a link and sentence introducing this testing tool shown in your first config snippet ;) |
@edsiper we are seeing multiple customers experiencing this issue when they send logs to CloudWatch. Any suggestions? |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
This problem was mainly impacting CloudWatch_Logs c plugin due to its use of the sync network stack. The problem does not impact the async network stack. We work around the sync network stack's hang issues by migrating CloudWatch_Logs to the async network stack. Please see: #6339 Closing, for now. If you are using the sync network stack and are experiencing hanging issues, please feel free to reopen this issue. |
Bug Report
We have a contact who has reported that fluent bit stops sending logs to cloudwatch after 48+ hours. Another contact reported that their fluent bit instance hangs after 2 hours. Logs continue to be ingested by TCP, but fail to be output by cloudwatch.
The Cloudwatch and S3 plugins use sync networking. Sync networking does not make use of the ares networking library for DNS but instead uses getaddrinfo(). We believe the pause is caused by getaddrinfo() hanging.
Attempting to replicate the issue, we sent logs to 4 different tcp inputs each to a separate cloudwatch stream. 2K 200byte logs were sent per each stream per second using the following application load test config
The following file was used but is most likely not important.
Running fluent bit in the debugger, we eventually found that fluent bit stops sending logs to cloudwatch when it hangs on the line getaddrinfo().
fluent-bit/src/flb_network.c
Line 1226 in 02447c8
It's not clear if this is how the customer's cloudwatch output is hanging, but in our replication attempts, it is how we get cloudwatch to hang.
We were only able to reproduce this issue when net.keepalive for cloudwatch was set to off.
More info
According to skarnet.com, https://skarnet.org/software/s6-dns/getaddrinfo.html, getaddrinfo doesn't have a good way of adding timeouts and should not be used in practice.
Suggestions
Add connect_timeout to the sync dns network call timeout to prevent freezing.
Other problems
Currently io_timeout which is used for timeouts on waiting for network responses, is set to infinite time.
fluent-bit/src/flb_network.c
Line 112 in 6c117f4
This may be a problem as well. If the cloudwatch server fails to respond to some request, then the cloudwatch plugin will pause forever! This seems like a big problem.
Screenshots
3 cloudwatch outputs are hanging on a lock, one cloudwatch plugin has the lock and is hanging on a getaddrinfo() call.
Your Environment
Additional context
Potentially related issues
#4606
The text was updated successfully, but these errors were encountered: