Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync DNS getaddrinfo() potentially hangs indefinitely #6140

Closed
matthewfala opened this issue Oct 5, 2022 · 5 comments
Closed

Sync DNS getaddrinfo() potentially hangs indefinitely #6140

matthewfala opened this issue Oct 5, 2022 · 5 comments

Comments

@matthewfala
Copy link
Contributor

matthewfala commented Oct 5, 2022

Bug Report

We have a contact who has reported that fluent bit stops sending logs to cloudwatch after 48+ hours. Another contact reported that their fluent bit instance hangs after 2 hours. Logs continue to be ingested by TCP, but fail to be output by cloudwatch.

The Cloudwatch and S3 plugins use sync networking. Sync networking does not make use of the ares networking library for DNS but instead uses getaddrinfo(). We believe the pause is caused by getaddrinfo() hanging.

Attempting to replicate the issue, we sent logs to 4 different tcp inputs each to a separate cloudwatch stream. 2K 200byte logs were sent per each stream per second using the following application load test config

{
    "component": "synchronizer",
    "config": {
        "repeat": 1,
        "waitBefore": 0.5,
        "waitAfter": 10,
        "waitBetween": 0.01,
        "isAsync": true
    },
    "children": [
        {
            "generator": {
                "name": "basic",
                "config": {
                    "contentLength": 200,
                    "batchSize": 2000,
                    "key": "log"
                }
            },
            "datajet": {
                "name": "tcp",
                "config": {
                    "host": "0.0.0.0",
                    "port": 5270
                }
                },
            "stage": {
                "batchRate": 1,
                "maxBatches": 1000000
            }
        },
        {
            "generator": {
                "name": "basic",
                "config": {
                    "contentLength": 200,
                    "batchSize": 2000,
                    "key": "log"
                }
            },
            "datajet": {
                "name": "tcp",
                "config": {
                    "host": "0.0.0.0",
                    "port": 5271,
                    "key": "log"
                }
                },
            "stage": {
                "batchRate": 1,
                "maxBatches": 1000000
            }
        },
        {
            "generator": {
                "name": "basic",
                "config": {
                    "contentLength": 200,
                    "batchSize": 2000,
                    "key": "log"
                }
            },
            "datajet": {
                "name": "tcp",
                "config": {
                    "host": "0.0.0.0",
                    "port": 5272
                }
                },
            "stage": {
                "batchRate": 1,
                "maxBatches": 1000000
            }
        },
        {
            "generator": {
                "name": "basic",
                "config": {
                    "contentLength": 200,
                    "batchSize": 2000,
                    "key": "log"
                }
            },
            "datajet": {
                "name": "tcp",
                "config": {
                    "host": "0.0.0.0",
                    "port": 5273
                }
                },
            "stage": {
                "batchRate": 1,
                "maxBatches": 1000000
            }
        }
    ]
}
[SERVICE]
     # See:
     # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_firelens.html
     # https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/tree/master/generated-configs/fluent-bit
     Flush        10
     Grace        30
     Log_Level    info
     Parsers_File ./fluent-parser.conf
[INPUT]
     Name        tcp
     Tag         TCP_A
     Listen      0.0.0.0
     Port        5270
     Format      json
[INPUT]
     Name        tcp
     Tag         TCP_B
     Listen      0.0.0.0
     Port        5271
     Format      json
[INPUT]
     Name        tcp
     Tag         TCP_C
     Listen      0.0.0.0
     Port        5272
     Format      json
[INPUT]
     Name        tcp
     Tag         TCP_D
     Listen      0.0.0.0
     Port        5273
     Format      json
[OUTPUT]
     Name cloudwatch_logs
     Match TCP_A
     log_stream_prefix TCP_A
     log_group_name cloudwatch_freeze
     auto_create_group true
     region us-west-2
     log_key log
     net.keepalive      off
     workers 1
[OUTPUT]
     Name cloudwatch_logs
     Match TCP_B
     log_stream_prefix TCP_B
     log_group_name cloudwatch_freeze
     auto_create_group true
     region us-west-2
     log_key log
     net.keepalive      off
     workers 1
[OUTPUT]
     Name cloudwatch_logs
     Match TCP_C
     log_stream_prefix TCP_C
     log_group_name cloudwatch_freeze
     auto_create_group true
     region us-west-2
     log_key log
     net.keepalive      off
     workers 1
[OUTPUT]
     Name cloudwatch_logs
     Match TCP_D
     log_stream_prefix TCP_D
     log_group_name cloudwatch_freeze
     auto_create_group true
     region us-west-2
     log_key log
     net.keepalive      off
     workers 1

The following file was used but is most likely not important.

# Parsers *must* be in a file separate from the main configuration
# See https://fluentbit.io/documentation/0.13/parser/

[PARSER]
    # A Parser to capture the beginning of a QueryLog multi-line entry.
    Name   QueryLogSeparator
    Format regex
    Regex  (?<log>-{20,})

[PARSER]
    # A Parser to capture QueryLog statements in their entirety.
    Name   QueryLog
    Format regex
    Regex  (?<start>-{20,})(?<content>[\S\s]+?EOE)

Running fluent bit in the debugger, we eventually found that fluent bit stops sending logs to cloudwatch when it hangs on the line getaddrinfo().

ret = getaddrinfo(host, _port, &hints, &res);

It's not clear if this is how the customer's cloudwatch output is hanging, but in our replication attempts, it is how we get cloudwatch to hang.

We were only able to reproduce this issue when net.keepalive for cloudwatch was set to off.

More info

According to skarnet.com, https://skarnet.org/software/s6-dns/getaddrinfo.html, getaddrinfo doesn't have a good way of adding timeouts and should not be used in practice.

Suggestions

Add connect_timeout to the sync dns network call timeout to prevent freezing.

Other problems

Currently io_timeout which is used for timeouts on waiting for network responses, is set to infinite time.

net->io_timeout = 0; /* Infinite time */

This may be a problem as well. If the cloudwatch server fails to respond to some request, then the cloudwatch plugin will pause forever! This seems like a big problem.

Screenshots

Screen Shot 2022-10-04 at 5 41 28 PM

3 cloudwatch outputs are hanging on a lock, one cloudwatch plugin has the lock and is hanging on a getaddrinfo() call.

Screen Shot 2022-10-04 at 5 42 32 PM

Your Environment

  • Version used: master, hash 02447c8, closest version 1.9.9
  • Configuration: see above
  • Environment name and version (e.g. Kubernetes? What version?): EC2 instance
  • Server type and version: EC2 instance
  • Operating System and version: amazon linux 2
  • Filters and plugins: tcp input, cloudwatch output

Additional context

Potentially related issues

#4606

@matthewfala matthewfala changed the title Sync DNS getaddrinfo() potentially hangs indefinitely Sync DNS getaddrinfo() potentially hangs indefinitely on network error Oct 5, 2022
@matthewfala matthewfala changed the title Sync DNS getaddrinfo() potentially hangs indefinitely on network error Sync DNS getaddrinfo() potentially hangs indefinitely Oct 5, 2022
@PettitWesley
Copy link
Contributor

@matthewfala I think you need to add a link and sentence introducing this testing tool shown in your first config snippet ;)

@lubingfeng
Copy link

@edsiper we are seeing multiple customers experiencing this issue when they send logs to CloudWatch. Any suggestions?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jan 13, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2023
@PettitWesley PettitWesley reopened this Jan 18, 2023
@matthewfala
Copy link
Contributor Author

This problem was mainly impacting CloudWatch_Logs c plugin due to its use of the sync network stack. The problem does not impact the async network stack.

We work around the sync network stack's hang issues by migrating CloudWatch_Logs to the async network stack. Please see: #6339

Closing, for now.

If you are using the sync network stack and are experiencing hanging issues, please feel free to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants