You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fluentbit begins consuming a full vCPU endlessly and stops processing inputs. This issue occurs intermittently in production environments. A processing thread is stuck in an epoll_wait loop, causing high CPU utilization and unresponsiveness to new log events.
8-9 hours after the fluentbit container stops processing new events, our main API container starts failing health checks on a simple ready endpoint.
curling one of the http inputs (port 8888, the unused one) below from within the fluentbit container receives a timeout while awaiting a response. (timeout set to 60s).
All containers that exhibit this issue have the following log line within a few milliseconds of when they stop processing. Unclear if this is a cause or a symptom, as we see this in healthy container logs as well.
[error] [http_client] broken connection to blobstoreurl.blob.core.windows.net:443 ?
ps -T reveals one thread as the high CPU use. gdb thread dump reveals:
Thread 21 (Thread 0x7f7d5f9fd700 (LWP 36)):
#0 0x00007f7de938a84c in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000000000a4d741 in _mk_event_wait_2 (loop=0x7f7d5ea0a000, timeout=0) at /tmp/fluent-bit-1.9.10/lib/monkey/mk_core/mk_event_epoll.c:445
ctx = 0x7f7d5ea00000
ret = 0
#2 0x0000000000a4db78 in mk_event_wait_2 (loop=0x7f7d5ea0a000, timeout=0) at /tmp/fluent-bit-1.9.10/lib/monkey/mk_core/mk_event.c:204
No locals.
#3 0x00000000004f7366 in flb_engine_start (config=0x7f7de6619c40) at /tmp/fluent-bit-1.9.10/src/flb_engine.c:776
__flb_event_priority_live_foreach_iter = 2
__flb_event_priority_live_foreach_n_events = 8
ret = 0
ts = 1722566136463224561
tmp = "24.0K\000\000\000\000\000\000\000\000\000\000"
t_flush = {tm = {tv_sec = 1, tv_nsec = 0}}
event = 0x7f7d5ea18640
evl = 0x7f7d5ea0a000
evl_bktq = 0x7f7d5ea13000
sched = 0x7f7d5ea331e0
dns_ctx = {lookups = {prev = 0x7f7d5f9f2a70, next = 0x7f7d5f9f2a70}, lookups_drop = {prev = 0x7f7d5f9f2a80, next = 0x7f7d5f9f2a80}}
#4 0x00000000004d3f1c in flb_lib_worker (data=0x7f7de6618000) at /tmp/fluent-bit-1.9.10/src/flb_lib.c:626
ret = 0
ctx = 0x7f7de6618000
config = 0x7f7de6619c40
#5 0x00007f7deaf9044b in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6 0x00007f7de938a52f in clone () from /lib64/libc.so.6
No symbol table info available.
Configuration
[SERVICE]
flush 1
log_level debug
parsers_file /fluent-bit/parsers/parsers.conf
# Note this input is inactive; should not receive any events
[INPUT]
name http
tag api_logs
listen 0.0.0.0
port 8888
[INPUT]
name http
tag telegraf_metrics
listen 0.0.0.0
port 9999
[FILTER]
name rewrite_tag
match *-firelens-*
rule $log \"TraceToken\"\: adx_event false
[FILTER]
name parser
match adx_event
key_name log
parser json
reserve_data true
[OUTPUT]
name azure_blob
match adx_event
account_name ${AZURE_STORAGE_ACCOUNT_NAME}
shared_key ${AZURE_STORAGE_SHARED_KEY}
path ${AWS_ENVIRONMENT}
blob_type blockblob
container_name api-logs
auto_create_container on
tls on
# Note the corresponding input is inactive; should not receive any events
[OUTPUT]
name azure_blob
match api_logs
account_name ${AZURE_STORAGE_ACCOUNT_NAME}
shared_key ${AZURE_STORAGE_SHARED_KEY}
path ${AWS_ENVIRONMENT}
blob_type blockblob
container_name api-logs
auto_create_container on
tls on
[OUTPUT]
name azure_blob
match telegraf_metrics
account_name ${AZURE_STORAGE_ACCOUNT_NAME}
shared_key ${AZURE_STORAGE_SHARED_KEY}
path ${AWS_ENVIRONMENT}
blob_type blockblob
container_name api-metrics
auto_create_container on
tls on
[OUTPUT]
Name http
Match *-firelens-*
Host ${SUMOLOGIC_HOST}
Port 443
URI ${SUMOLOGIC_ENDPOINT}
Format json_lines
tls On
tls.verify Off
Fluent Bit Log Output
Last 100 lines (can provide more -- either other containers exhibiting the issue, or further back, if helpful)
The application sends two types of logs over STDOUT through docker's fluentd log-driver (via firelens, not customization on our end here). Most logs are small and fit into a single page, but we have some rogue log statements that contain large blobs of logs. (the fluentd-log-driver splits these into hundreds of pages)
We're also sending telegraf metrics over via the HTTP plugin.
Steps to reproduce issue
Unclear. We have multiple test environments, but are only seeing this in production intermittently.
In our application, the logs that get published to ADX (that get picked up by the rewrite_tag filter) is behind an environment variable. Disabling this greatly reduces the occurrence rate, but does not totally eliminate it.
Similarly, disabling telegraf metrics (but leaving ADX events on) the issue still occurs.
Related Issues
fluent/fluent-bit#1958 Is exhibiting similar symptoms, though resolved many versions ago (1.4.x)
The text was updated successfully, but these errors were encountered:
We were able to reproduce this issue by continously posting events, severing the network connection of fluentbit, then restoring the network connection. It seems like a core HTTP client issue.
version 2.0+ of fluentbit included some large refactoring of the HTTP facilities via fluent/fluent-bit#5918
Upon testing a newer version of fluentbit (3.x), we can no longer reproduce this issue.
If you find yourself in a similar position, you can continue to use the firelens log driver on your app container and use your own fluentbit image to listen to forward port 24224. The fluentd docker log driver will continue to work.
Describe the question/issue
Fluentbit begins consuming a full vCPU endlessly and stops processing inputs. This issue occurs intermittently in production environments. A processing thread is stuck in an epoll_wait loop, causing high CPU utilization and unresponsiveness to new log events.
8-9 hours after the fluentbit container stops processing new events, our main API container starts failing health checks on a simple ready endpoint.
curl
ing one of the http inputs (port 8888, the unused one) below from within the fluentbit container receives a timeout while awaiting a response. (timeout set to 60s).All containers that exhibit this issue have the following log line within a few milliseconds of when they stop processing. Unclear if this is a cause or a symptom, as we see this in healthy container logs as well.
ps -T
reveals one thread as the high CPU use.gdb
thread dump reveals:Configuration
Fluent Bit Log Output
Last 100 lines (can provide more -- either other containers exhibiting the issue, or further back, if helpful)
Fluent Bit Version Info
amazon/aws-for-fluent-bit:2.32.2
Cluster Details
We're using ECS Fargate with a sidecar deployment.
Relevant portions of ECS task config:
Application Details
The application sends two types of logs over STDOUT through docker's fluentd log-driver (via firelens, not customization on our end here). Most logs are small and fit into a single page, but we have some rogue log statements that contain large blobs of logs. (the fluentd-log-driver splits these into hundreds of pages)
We're also sending telegraf metrics over via the HTTP plugin.
Steps to reproduce issue
Unclear. We have multiple test environments, but are only seeing this in production intermittently.
In our application, the logs that get published to ADX (that get picked up by the rewrite_tag filter) is behind an environment variable. Disabling this greatly reduces the occurrence rate, but does not totally eliminate it.
Similarly, disabling telegraf metrics (but leaving ADX events on) the issue still occurs.
Related Issues
fluent/fluent-bit#1958 Is exhibiting similar symptoms, though resolved many versions ago (1.4.x)
The text was updated successfully, but these errors were encountered: