Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit Stops Processing Incoming Events #847

Closed
notarize-tgall opened this issue Aug 5, 2024 · 1 comment
Closed

Fluentbit Stops Processing Incoming Events #847

notarize-tgall opened this issue Aug 5, 2024 · 1 comment

Comments

@notarize-tgall
Copy link

notarize-tgall commented Aug 5, 2024

Describe the question/issue

Fluentbit begins consuming a full vCPU endlessly and stops processing inputs. This issue occurs intermittently in production environments. A processing thread is stuck in an epoll_wait loop, causing high CPU utilization and unresponsiveness to new log events.

8-9 hours after the fluentbit container stops processing new events, our main API container starts failing health checks on a simple ready endpoint.

curling one of the http inputs (port 8888, the unused one) below from within the fluentbit container receives a timeout while awaiting a response. (timeout set to 60s).

All containers that exhibit this issue have the following log line within a few milliseconds of when they stop processing. Unclear if this is a cause or a symptom, as we see this in healthy container logs as well.

[error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?

ps -T reveals one thread as the high CPU use. gdb thread dump reveals:

Thread 21 (Thread 0x7f7d5f9fd700 (LWP 36)):
#0  0x00007f7de938a84c in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000000000a4d741 in _mk_event_wait_2 (loop=0x7f7d5ea0a000, timeout=0) at /tmp/fluent-bit-1.9.10/lib/monkey/mk_core/mk_event_epoll.c:445
        ctx = 0x7f7d5ea00000
        ret = 0
#2  0x0000000000a4db78 in mk_event_wait_2 (loop=0x7f7d5ea0a000, timeout=0) at /tmp/fluent-bit-1.9.10/lib/monkey/mk_core/mk_event.c:204
No locals.
#3  0x00000000004f7366 in flb_engine_start (config=0x7f7de6619c40) at /tmp/fluent-bit-1.9.10/src/flb_engine.c:776
        __flb_event_priority_live_foreach_iter = 2
        __flb_event_priority_live_foreach_n_events = 8
        ret = 0
        ts = 1722566136463224561
        tmp = "24.0K\000\000\000\000\000\000\000\000\000\000"
        t_flush = {tm = {tv_sec = 1, tv_nsec = 0}}
        event = 0x7f7d5ea18640
        evl = 0x7f7d5ea0a000
        evl_bktq = 0x7f7d5ea13000
        sched = 0x7f7d5ea331e0
        dns_ctx = {lookups = {prev = 0x7f7d5f9f2a70, next = 0x7f7d5f9f2a70}, lookups_drop = {prev = 0x7f7d5f9f2a80, next = 0x7f7d5f9f2a80}}
#4  0x00000000004d3f1c in flb_lib_worker (data=0x7f7de6618000) at /tmp/fluent-bit-1.9.10/src/flb_lib.c:626
        ret = 0
        ctx = 0x7f7de6618000
        config = 0x7f7de6619c40
#5  0x00007f7deaf9044b in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x00007f7de938a52f in clone () from /lib64/libc.so.6
No symbol table info available.

Configuration

[SERVICE]
    flush     1
    log_level debug
    parsers_file /fluent-bit/parsers/parsers.conf

# Note this input is inactive; should not receive any events
[INPUT]
    name http
    tag  api_logs
    listen 0.0.0.0
    port 8888

[INPUT]
    name http
    tag  telegraf_metrics
    listen 0.0.0.0
    port 9999

[FILTER]
    name rewrite_tag
    match *-firelens-*
    rule $log \"TraceToken\"\: adx_event false

[FILTER]
    name parser
    match adx_event
    key_name log
    parser json
    reserve_data true

[OUTPUT]
    name                  azure_blob
    match                 adx_event
    account_name          ${AZURE_STORAGE_ACCOUNT_NAME}
    shared_key            ${AZURE_STORAGE_SHARED_KEY}
    path                  ${AWS_ENVIRONMENT}
    blob_type             blockblob
    container_name        api-logs
    auto_create_container on
    tls                   on

# Note the corresponding input is inactive; should not receive any events
[OUTPUT]
    name                  azure_blob
    match                 api_logs
    account_name          ${AZURE_STORAGE_ACCOUNT_NAME}
    shared_key            ${AZURE_STORAGE_SHARED_KEY}
    path                  ${AWS_ENVIRONMENT}
    blob_type             blockblob
    container_name        api-logs
    auto_create_container on
    tls                   on

[OUTPUT]
    name                  azure_blob
    match                 telegraf_metrics
    account_name          ${AZURE_STORAGE_ACCOUNT_NAME}
    shared_key            ${AZURE_STORAGE_SHARED_KEY}
    path                  ${AWS_ENVIRONMENT}
    blob_type             blockblob
    container_name        api-metrics
    auto_create_container on
    tls                   on

[OUTPUT]
    Name                  http
    Match                 *-firelens-*
    Host                  ${SUMOLOGIC_HOST}
    Port                  443
    URI                   ${SUMOLOGIC_ENDPOINT}
    Format                json_lines
    tls                   On
    tls.verify            Off

Fluent Bit Log Output

Last 100 lines (can provide more -- either other containers exhibiting the issue, or further back, if helpful)


@timestamp @message
2024-08-02 14:45:44.343 [2024/08/02 14:45:44] [debug] [upstream] KA connection #89 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:45:44.343 [2024/08/02 14:45:44] [debug] [upstream] KA connection #89 to blobstoreurl.blob.core.windows.net:443 has been disconnected by the remote service
2024-08-02 14:45:44.343 [2024/08/02 14:45:44] [error] [output:azure_blob:azure_blob.2] error requesting container properties
2024-08-02 14:45:44.343 [2024/08/02 14:45:44] [error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?
2024-08-02 14:44:48.883 [2024/08/02 14:44:48] [debug] [upstream] KA connection #87 to blobstoreurl.blob.core.windows.net:443 has been disconnected by the remote service
2024-08-02 14:44:48.883 [2024/08/02 14:44:48] [error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?
2024-08-02 14:44:48.883 [2024/08/02 14:44:48] [debug] [upstream] KA connection #87 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:44:48.883 [2024/08/02 14:44:48] [error] [output:azure_blob:azure_blob.0] error requesting container properties
2024-08-02 14:44:46.045 [2024/08/02 14:44:46] [error] [output:azure_blob:azure_blob.2] error sending append_blob
2024-08-02 14:44:46.045 [2024/08/02 14:44:46] [debug] [upstream] KA connection #91 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:44:46.045 [2024/08/02 14:44:46] [debug] [upstream] KA connection #91 to blobstoreurl.blob.core.windows.net:443 has been disconnected by the remote service
2024-08-02 14:44:46.045 [2024/08/02 14:44:46] [error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?
2024-08-02 14:43:28.876 [2024/08/02 14:43:28] [debug] [upstream] drop keepalive connection #-1 to endpoint2.collection.us2.sumologic.com:443 (keepalive idle timeout)
2024-08-02 14:43:28.876 [2024/08/02 14:43:28] [debug] [upstream] drop keepalive connection #-1 to blobstoreurl.blob.core.windows.net:443 (keepalive idle timeout)
2024-08-02 14:43:27.376 [2024/08/02 14:43:27] [debug] [upstream] drop keepalive connection #-1 to endpoint2.collection.us2.sumologic.com:443 (keepalive idle timeout)
2024-08-02 14:43:27.376 [2024/08/02 14:43:27] [debug] [upstream] drop keepalive connection #-1 to blobstoreurl.blob.core.windows.net:443 (keepalive idle timeout)
2024-08-02 14:42:58.501 [2024/08/02 14:42:58] [debug] [out flush] cb_destroy coro_id=1322
2024-08-02 14:42:58.501 [2024/08/02 14:42:58] [debug] [upstream] KA connection #86 to endpoint2.collection.us2.sumologic.com:443 is now available
2024-08-02 14:42:58.501 [2024/08/02 14:42:58] [ info] [output:http:http.3] endpoint2.collection.us2.sumologic.com:443, HTTP status=200
2024-08-02 14:42:58.413 [2024/08/02 14:42:58] [error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?
2024-08-02 14:42:58.413 [2024/08/02 14:42:58] [error] [output:azure_blob:azure_blob.0] error sending append_blob
2024-08-02 14:42:58.399 [2024/08/02 14:42:58] [debug] [out flush] cb_destroy coro_id=2641
2024-08-02 14:42:58.399 [2024/08/02 14:42:58] [debug] [task] destroy task=0x7f999ac04d60 (task_id=5)
2024-08-02 14:42:58.399 [2024/08/02 14:42:58] [ info] [output:azure_blob:azure_blob.0] blob id ZmxiLTE3MjI2MDk3NzguMzgwMy5pZA== committed successfully
2024-08-02 14:42:58.399 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:58.388 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:58.388 [2024/08/02 14:42:58] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:58.388 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:58.388 [2024/08/02 14:42:58] [ info] [output:azure_blob:azure_blob.0] content appended to blob successfully
2024-08-02 14:42:58.380 [2024/08/02 14:42:58] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:58.380 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:58.380 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [upstream] KA connection #86 to endpoint2.collection.us2.sumologic.com:443 has been assigned (recycled)
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [task] created task=0x7f999ac04d60 id=5 OK
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [output:http:http.3] task_id=0 assigned to thread #1
2024-08-02 14:42:58.376 [2024/08/02 14:42:58] [debug] [task] created task=0x7f999ac043c0 id=0 OK
2024-08-02 14:42:58.351 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=1009, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:58.350 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=506, records=1, input=forward.1
2024-08-02 14:42:58.334 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=879, records=1, input=forward.1
2024-08-02 14:42:58.334 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=469, records=1, input=forward.1
2024-08-02 14:42:58.327 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=496, records=1, input=forward.1
2024-08-02 14:42:58.326 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=1023, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:58.325 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=506, records=1, input=forward.1
2024-08-02 14:42:58.309 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=895, records=1, input=forward.1
2024-08-02 14:42:58.308 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=469, records=1, input=forward.1
2024-08-02 14:42:58.301 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=496, records=1, input=forward.1
2024-08-02 14:42:58.249 [2024/08/02 14:42:58] [debug] [input chunk] update output instances with new chunk size diff=557, records=1, input=forward.1
2024-08-02 14:42:57.974 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=984, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:57.974 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=448, records=1, input=forward.1
2024-08-02 14:42:57.971 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=967, records=1, input=forward.1
2024-08-02 14:42:57.971 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=440, records=1, input=forward.1
2024-08-02 14:42:57.963 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=476, records=1, input=forward.1
2024-08-02 14:42:57.959 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=503, records=1, input=forward.1
2024-08-02 14:42:57.959 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=1006, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:57.947 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=470, records=1, input=forward.1
2024-08-02 14:42:57.943 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=466, records=1, input=forward.1
2024-08-02 14:42:57.943 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=892, records=1, input=forward.1
2024-08-02 14:42:57.935 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=490, records=1, input=forward.1
2024-08-02 14:42:57.693 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=1009, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:57.692 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=508, records=1, input=forward.1
2024-08-02 14:42:57.594 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=885, records=1, input=forward.1
2024-08-02 14:42:57.594 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=469, records=1, input=forward.1
2024-08-02 14:42:57.587 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=496, records=1, input=forward.1
2024-08-02 14:42:57.442 [2024/08/02 14:42:57] [debug] [out flush] cb_destroy coro_id=1322
2024-08-02 14:42:57.442 [2024/08/02 14:42:57] [debug] [upstream] KA connection #88 to endpoint2.collection.us2.sumologic.com:443 is now available
2024-08-02 14:42:57.442 [2024/08/02 14:42:57] [ info] [output:http:http.3] endpoint2.collection.us2.sumologic.com:443, HTTP status=200
2024-08-02 14:42:57.442 [2024/08/02 14:42:57] [debug] [task] destroy task=0x7f999ac043c0 (task_id=0)
2024-08-02 14:42:57.430 [2024/08/02 14:42:57] [debug] [upstream] KA connection #93 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.430 [2024/08/02 14:42:57] [debug] [out flush] cb_destroy coro_id=268
2024-08-02 14:42:57.430 [2024/08/02 14:42:57] [ info] [output:azure_blob:azure_blob.2] blob id ZmxiLTE3MjI2MDk3NzcuNDExMy5pZA== committed successfully
2024-08-02 14:42:57.430 [2024/08/02 14:42:57] [debug] [task] destroy task=0x7f999ac04d60 (task_id=5)
2024-08-02 14:42:57.419 [2024/08/02 14:42:57] [debug] [upstream] KA connection #93 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:57.419 [2024/08/02 14:42:57] [debug] [upstream] KA connection #93 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.419 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.419 [2024/08/02 14:42:57] [ info] [output:azure_blob:azure_blob.2] content appended to blob successfully
2024-08-02 14:42:57.411 [2024/08/02 14:42:57] [debug] [upstream] KA connection #93 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.411 [2024/08/02 14:42:57] [debug] [upstream] KA connection #93 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:57.411 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.410 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=1009, records=1, input=emitter_for_rewrite_tag.1
2024-08-02 14:42:57.409 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=506, records=1, input=forward.1
2024-08-02 14:42:57.401 [2024/08/02 14:42:57] [ info] [output:azure_blob:azure_blob.0] blob id ZmxiLTE3MjI2MDk3NzcuMzgzMC5pZA== committed successfully
2024-08-02 14:42:57.401 [2024/08/02 14:42:57] [debug] [task] destroy task=0x7f999ac04dd0 (task_id=6)
2024-08-02 14:42:57.401 [2024/08/02 14:42:57] [debug] [out flush] cb_destroy coro_id=2640
2024-08-02 14:42:57.401 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.397 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.391 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=469, records=1, input=forward.1
2024-08-02 14:42:57.391 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=895, records=1, input=forward.1
2024-08-02 14:42:57.390 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.390 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.390 [2024/08/02 14:42:57] [ info] [output:azure_blob:azure_blob.0] content appended to blob successfully
2024-08-02 14:42:57.390 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:57.384 [2024/08/02 14:42:57] [debug] [input chunk] update output instances with new chunk size diff=496, records=1, input=forward.1
2024-08-02 14:42:57.383 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.383 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 is now available
2024-08-02 14:42:57.383 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)
2024-08-02 14:42:57.378 [2024/08/02 14:42:57] [debug] [http_client] not using http_proxy for header
2024-08-02 14:42:57.378 [2024/08/02 14:42:57] [debug] [upstream] KA connection #92 to blobstoreurl.blob.core.windows.net:443 has been assigned (recycled)

Fluent Bit Version Info

amazon/aws-for-fluent-bit:2.32.2

Cluster Details

We're using ECS Fargate with a sidecar deployment.

Relevant portions of ECS task config:

{
    "containerDefinitions": [
        {
            "name": "prod-api-service",
            "essential": true,
            "readonlyRootFilesystem": false,
            "ulimits": [
                {
                    "name": "nofile",
                    "softLimit": 1048576,
                    "hardLimit": 1048576
                }
            ],
            "logConfiguration": {
                "logDriver": "awsfirelens"
            }
        },
        {
            "name": "prod-log-router",
            "cpu": 0,
            "memory": 512,
            "portMappings": [],
            "essential": true,
            "command": [
                "/bin/sh",
                "-c",
                "echo $FLUENT_BIT_CONFIG | base64 -d > /tmp/main.conf; /entrypoint.sh" // we're doing some shennanigans to load the config file here
            ],
            "linuxParameters": {
                "capabilities": {
                    "add": [
                        "SYS_PTRACE" // added to use gdb
                    ],
                    "drop": []
                }
            },
            "user": "0",
            "readonlyRootFilesystem": false,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "api-log-router",
                    "awslogs-create-group": "true",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "firelens"
                }
            },
            "systemControls": [],
            "firelensConfiguration": {
                "type": "fluentbit",
                "options": {
                    "config-file-type": "file",
                    "config-file-value": "/tmp/main.conf",
                    "enable-ecs-log-metadata": "true"
                }
            }
        }
    ]

Application Details

The application sends two types of logs over STDOUT through docker's fluentd log-driver (via firelens, not customization on our end here). Most logs are small and fit into a single page, but we have some rogue log statements that contain large blobs of logs. (the fluentd-log-driver splits these into hundreds of pages)
We're also sending telegraf metrics over via the HTTP plugin.

Steps to reproduce issue

Unclear. We have multiple test environments, but are only seeing this in production intermittently.
In our application, the logs that get published to ADX (that get picked up by the rewrite_tag filter) is behind an environment variable. Disabling this greatly reduces the occurrence rate, but does not totally eliminate it.
Similarly, disabling telegraf metrics (but leaving ADX events on) the issue still occurs.

Related Issues

fluent/fluent-bit#1958 Is exhibiting similar symptoms, though resolved many versions ago (1.4.x)

@notarize-tgall
Copy link
Author

We were able to reproduce this issue by continously posting events, severing the network connection of fluentbit, then restoring the network connection. It seems like a core HTTP client issue.

version 2.0+ of fluentbit included some large refactoring of the HTTP facilities via fluent/fluent-bit#5918
Upon testing a newer version of fluentbit (3.x), we can no longer reproduce this issue.

If you find yourself in a similar position, you can continue to use the firelens log driver on your app container and use your own fluentbit image to listen to forward port 24224. The fluentd docker log driver will continue to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant