Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudwatch_logs blocks on read forever #4606

Closed
fdr opened this issue Jan 12, 2022 · 26 comments
Closed

cloudwatch_logs blocks on read forever #4606

fdr opened this issue Jan 12, 2022 · 26 comments
Labels
Stale waiting-for-user Waiting for more information, tests or requested changes

Comments

@fdr
Copy link

fdr commented Jan 12, 2022

Bug Report

fluent-bit stops sending logs forever.

Expected behavior

It should keep sending logs.

Your Environment

  • Version used: 1.8.11
  • Configuration:
[SERVICE]
    flush        5
    daemon       Off
    log_level    info
    parsers_file parsers.conf
    plugins_file plugins.conf
    http_server  Off

[INPUT]
    name            systemd
    tag             ssh
    systemd_Filter  _SYSTEMD_UNIT=sshd.service

[INPUT]
    name            tail
    tag             audit
    path            /var/log/audit/audit.log
    db              audit.db

[INPUT]
    name            systemd
    tag             pgbouncer
    systemd_filter  _SYSTEMD_UNIT=pgbouncer.service

[INPUT]
    name            syslog
    path            /run/postgres-audit.socket
    tag             postgres
    syslog_format   rfc5424


[OUTPUT]
    Name cloudwatch_logs
    Match *
    region us-west-2
    log_group_name vnplueg6ejhkhl6fuszhsgjsfa
    log_stream_prefix mdlxmfh65faezgmcplz323cchu-
  • Operating System and version: AlmaLinux 8.5

Additional context

Thread stack traces look like this:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f28679ff700 (LWP 1888330)):
#0  0x00007f286c5f80f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000976901 in _mk_event_wait ()
#2  0x0000000000976c10 in mk_event_wait ()
#3  0x00000000004bc615 in log_worker_collector ()
#4  0x00000000004e22d8 in step_callback ()
#5  0x00007f286e30f17a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f286c5f7dc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f28687ff700 (LWP 1888329)):
#0  0x00007f286e3189e4 in read () from /lib64/libpthread.so.0
#1  0x00007f286d8004d2 in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f286d7fe337 in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f286db714e4 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f286db72cd5 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f286db6f714 in ssl3_read_internal () from /lib64/libssl.so.10
#6  0x00000000004f0f99 in tls_net_read ()
#7  0x00000000004f14a9 in flb_tls_net_read ()
#8  0x00000000004fca5e in flb_io_net_read ()
#9  0x00000000004fef9c in flb_http_do ()
#10 0x00000000005da2ce in request_do ()
#11 0x00000000005d97d8 in flb_aws_client_request ()
#12 0x0000000000594920 in put_log_events ()
#13 0x0000000000592235 in send_log_events ()
#14 0x0000000000593455 in process_and_send ()
#15 0x000000000059038d in cb_cloudwatch_flush ()
#16 0x00000000004c89f0 in output_pre_cb_flush ()
#17 0x0000000000978dc6 in co_init ()
#18 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f286e738bc0 (LWP 1888328)):
#0  0x00007f286c5c3d98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f286c5c3c9e in sleep () from /lib64/libc.so.6
#2  0x000000000045289d in flb_main ()
#3  0x000000000045290b in main ()
(gdb) q

It has been blocked since about three days ago, not long after it started.

Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu systemd[1]: Started TD Agent Bit.
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: Fluent Bit v1.8.11
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: * Copyright (C) 2015-2018 Treasure Data
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: * https://fluentbit.io
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [engine] started (pid=1888328)
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [storage] version=1.1.5, initializing...
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [storage] in-memory
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [cmetrics] version=0.2.2
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [in_syslog] UDP buffer size set to 32768 bytes
Jan 09 14:31:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:03] [ info] [sp] stream processor started
Jan 09 14:31:04 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:04] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51280075 watch_fd=1 name=/var/log/audit/audit.log
Jan 09 14:31:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:08] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream mdlxmfh65faezgmcplz323cchu-ssh in log group vnplueg6ejhkhl6fuszhsgjsfa
Jan 09 14:31:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:08] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream mdlxmfh65faezgmcplz323cchu-ssh already exists
Jan 09 14:31:11 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:11] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream mdlxmfh65faezgmcplz323cchu-audit in log group vnplueg6ejhkhl6fuszhsgjsfa
Jan 09 14:31:12 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:12] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream mdlxmfh65faezgmcplz323cchu-audit already exists
Jan 09 14:31:12 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:12] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream mdlxmfh65faezgmcplz323cchu-pgbouncer in log group vnplueg6ejhkhl6fuszhsgjsfa
Jan 09 14:31:12 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:12] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream mdlxmfh65faezgmcplz323cchu-pgbouncer already exists
Jan 09 14:31:18 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:18] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream mdlxmfh65faezgmcplz323cchu-postgres in log group vnplueg6ejhkhl6fuszhsgjsfa
Jan 09 14:31:18 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:31:18] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream mdlxmfh65faezgmcplz323cchu-postgres already exists
Jan 09 14:52:34 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:52:34] [ info] [input:tail:tail.1] inode=51280075 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 14:52:34 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:52:34] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51280075 watch_fd=1
Jan 09 14:52:34 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:52:34] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51280075 watch_fd=2 name=/var/log/audit/audit.log.1
Jan 09 14:52:34 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:52:34] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594127 watch_fd=3 name=/var/log/audit/audit.log
Jan 09 14:52:43 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 14:52:43] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51280075 watch_fd=2
Jan 09 15:15:21 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:15:21] [ info] [input:tail:tail.1] inode=51594127 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 15:15:21 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:15:21] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594127 watch_fd=3
Jan 09 15:15:21 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:15:21] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594127 watch_fd=4 name=/var/log/audit/audit.log.1
Jan 09 15:15:21 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:15:21] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51129053 watch_fd=5 name=/var/log/audit/audit.log
Jan 09 15:15:28 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:15:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594127 watch_fd=4
Jan 09 15:38:15 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:38:15] [ info] [input:tail:tail.1] inode=51129053 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 15:38:15 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:38:15] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51129053 watch_fd=5
Jan 09 15:38:15 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:38:15] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51129053 watch_fd=6 name=/var/log/audit/audit.log.1
Jan 09 15:38:15 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:38:15] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594220 watch_fd=7 name=/var/log/audit/audit.log
Jan 09 15:38:23 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 15:38:23] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51129053 watch_fd=6
Jan 09 16:00:58 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:00:58] [ info] [input:tail:tail.1] inode=51594220 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 16:00:58 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:00:58] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594220 watch_fd=7
Jan 09 16:00:58 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:00:58] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594220 watch_fd=8 name=/var/log/audit/audit.log.1
Jan 09 16:00:58 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:00:58] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51401458 watch_fd=9 name=/var/log/audit/audit.log
Jan 09 16:01:03 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:01:03] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594220 watch_fd=8
Jan 09 16:23:50 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:23:50] [ info] [input:tail:tail.1] inode=51401458 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 16:23:50 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:23:50] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51401458 watch_fd=9
Jan 09 16:23:50 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:23:50] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51401458 watch_fd=10 name=/var/log/audit/audit.log.1
Jan 09 16:23:50 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:23:50] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51280075 watch_fd=11 name=/var/log/audit/audit.log
Jan 09 16:23:58 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:23:58] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51401458 watch_fd=10
Jan 09 16:46:30 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:46:30] [ info] [input:tail:tail.1] inode=51280075 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 16:46:30 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:46:30] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51280075 watch_fd=11
Jan 09 16:46:30 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:46:30] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51280075 watch_fd=12 name=/var/log/audit/audit.log.1
Jan 09 16:46:30 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:46:30] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594127 watch_fd=13 name=/var/log/audit/audit.log
Jan 09 16:46:38 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 16:46:38] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51280075 watch_fd=12
Jan 09 17:09:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:09:08] [ info] [input:tail:tail.1] inode=51594127 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 17:09:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:09:08] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594127 watch_fd=13
Jan 09 17:09:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:09:08] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594127 watch_fd=14 name=/var/log/audit/audit.log.1
Jan 09 17:09:08 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:09:08] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51129053 watch_fd=15 name=/var/log/audit/audit.log
Jan 09 17:09:13 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:09:13] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594127 watch_fd=14
Jan 09 17:31:46 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:31:46] [ info] [input:tail:tail.1] inode=51129053 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 17:31:46 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:31:46] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51129053 watch_fd=15
Jan 09 17:31:46 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:31:46] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51129053 watch_fd=16 name=/var/log/audit/audit.log.1
Jan 09 17:31:46 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:31:46] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594220 watch_fd=17 name=/var/log/audit/audit.log
Jan 09 17:31:53 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:31:53] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51129053 watch_fd=16
Jan 09 17:54:09 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:54:09] [ info] [input:tail:tail.1] inode=51594220 handle rotation(): /var/log/audit/audit.log => /var/log/audit/audit.log.1
Jan 09 17:54:09 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:54:09] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594220 watch_fd=17
Jan 09 17:54:09 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:54:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51594220 watch_fd=18 name=/var/log/audit/audit.log.1
Jan 09 17:54:09 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:54:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51401458 watch_fd=19 name=/var/log/audit/audit.log
Jan 09 17:54:18 i-mdlxmfh65faezgmcplz323cchu td-agent-bit[1888328]: [2022/01/09 17:54:18] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594220 watch_fd=18
[centos@i-mdlxmfh65faezgmcplz323cchu ~]$ date
Wed Jan 12 17:55:15 UTC 2022

Note that unlike some of my other reports of this kind, there are no errors displayed at all since the program has started.

@fdr
Copy link
Author

fdr commented Jan 12, 2022

I have a second occurrence on another, unrelated server:

Thread 3 (Thread 0x7f41c4fff700 (LWP 2778312)):
#0  0x00007f41c9d5f0f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000976901 in _mk_event_wait ()
#2  0x0000000000976c10 in mk_event_wait ()
#3  0x00000000004bc615 in log_worker_collector ()
#4  0x00000000004e22d8 in step_callback ()
#5  0x00007f41cba7417a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f41c9d5edc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f41c5dff700 (LWP 2778311)):
#0  0x00007f41cba7d9e4 in read () from /lib64/libpthread.so.0
#1  0x00007f41caf674d2 in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f41caf65337 in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f41cb2d84e4 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f41cb2d9db0 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f41cb2db4e4 in ssl3_get_message () from /lib64/libssl.so.10
#6  0x00007f41cb2cf8f7 in ssl3_get_server_certificate () from /lib64/libssl.so.10
#7  0x00007f41cb2d4732 in ssl3_connect () from /lib64/libssl.so.10
#8  0x00007f41cb2de2fe in ssl23_connect () from /lib64/libssl.so.10
#9  0x00000000004f1133 in tls_net_handshake ()
#10 0x00000000004f18c1 in flb_tls_session_create ()
#11 0x00000000004fc0e1 in flb_io_net_connect ()
#12 0x00000000004e08ec in create_conn ()
#13 0x00000000004e0e5f in flb_upstream_conn_get ()
#14 0x00000000005d9c36 in request_do ()
#15 0x00000000005d97d8 in flb_aws_client_request ()
#16 0x0000000000594920 in put_log_events ()
#17 0x0000000000592235 in send_log_events ()
#18 0x0000000000593455 in process_and_send ()
#19 0x000000000059038d in cb_cloudwatch_flush ()
#20 0x00000000004c89f0 in output_pre_cb_flush ()
#21 0x0000000000978dc6 in co_init ()
#22 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f41cbea0bc0 (LWP 2778310)):
#0  0x00007f41c9d2ad98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f41c9d2ac9e in sleep () from /lib64/libc.so.6
#2  0x000000000045289d in flb_main ()
#3  0x000000000045290b in main ()

@fdr
Copy link
Author

fdr commented Jan 12, 2022

May be duplicate with #4553, the errors may be incidental.

@fdr
Copy link
Author

fdr commented Jan 25, 2022

@PettitWesley, sorry to bother you, but reading the commit logs this seems perhaps closest to your modified areas of the code. Do you have any thoughts on the blocking behavior, or know who might be the best person to handle this, or do you need any more information? I have a few live cases rolling around.

@PettitWesley
Copy link
Contributor

@fdr Sorry I missed this issue- are you using the AWS Distro or not? We recently have been adding some extra patches... one of which is for an issue that can cause it to get stuck on network IO: #4098
https://github.com/aws/aws-for-fluent-bit/releases

@PettitWesley
Copy link
Contributor

@fdr
Copy link
Author

fdr commented Jan 26, 2022

@fdr Sorry I missed this issue- are you using the AWS Distro or not?

No, I'm not. AlmaLinux of recent vintage here. Thanks for considering my question.

@fdr
Copy link
Author

fdr commented Jan 26, 2022

Looks like 1.8.12 may contain some of these? I'll upgrade and see what happens.

@fdr
Copy link
Author

fdr commented Jan 27, 2022

Alas, 1.8.12 is affected as well. Are some of the fixes un-applied to it? Temporally (merge vs. release date) it seems like they should be, but I haven't used git describe or anything like that just yet.

Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna systemd[1]: td-agent-bit.service: Failed with result 'timeout'.
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna systemd[1]: Stopped TD Agent Bit.
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna systemd[1]: Started TD Agent Bit.
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2084989]: Fluent Bit v1.8.12
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2084989]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2084989]: * Copyright (C) 2015-2018 Treasure Data
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2084989]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Jan 26 21:01:24 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2084989]: * https://fluentbit.io
Thread 3 (Thread 0x7f9352fff700 (LWP 2084996)):
#0  0x00007f9357bbb0f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000979cfa in _mk_event_wait ()
#2  0x000000000097a057 in mk_event_wait ()
#3  0x00000000004bc935 in log_worker_collector ()
#4  0x00000000004e292c in step_callback ()
#5  0x00007f93598d217a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f9357bbadc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f9353dff700 (LWP 2084995)):
#0  0x00007f93598db9e4 in read () from /lib64/libpthread.so.0
#1  0x00007f9358dc34d2 in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f9358dc1337 in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f93591344e4 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f9359135cd5 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f93591374e4 in ssl3_get_message () from /lib64/libssl.so.10
#6  0x00007f935912e4b6 in ssl3_get_key_exchange () from /lib64/libssl.so.10
#7  0x00007f935913005b in ssl3_connect () from /lib64/libssl.so.10
#8  0x00007f935913a2fe in ssl23_connect () from /lib64/libssl.so.10
#9  0x00000000004f17c0 in tls_net_handshake ()
#10 0x00000000004f1fa2 in flb_tls_session_create ()
#11 0x00000000004fcbaa in flb_io_net_connect ()
#12 0x00000000004e0ecc in create_conn ()
#13 0x00000000004e143f in flb_upstream_conn_get ()
#14 0x00000000005dc77f in request_do ()
#15 0x00000000005dc321 in flb_aws_client_request ()
#16 0x0000000000595f62 in put_log_events ()
#17 0x00000000005937a9 in send_log_events ()
#18 0x0000000000594a97 in process_and_send ()
#19 0x00000000005918fe in cb_cloudwatch_flush ()
#20 0x00000000004c8d2c in output_pre_cb_flush ()
#21 0x000000000097c206 in co_init ()
#22 0x6469752038393130 in ?? ()
#23 0x3d6469756120303d in ?? ()
#24 0x7365732030303031 in ?? ()
#25 0x627573203139383d in ?? ()
#26 0x6d65747379733d6a in ?? ()
#27 0x65747379733a755f in ?? ()
#28 0x646873733a725f6d in ?? ()
#29 0x30732d30733a745f in ?? ()
#30 0x323031632e30633a in ?? ()
#31 0x6f273d67736d2033 in ?? ()
#32 0x206e69676f6c3d70 in ?? ()
#33 0x20303030313d6469 in ?? ()
#34 0x73752f223d657865 in ?? ()
#35 0x732f6e6962732f72 in ?? ()
#36 0x736f682022646873 in ?? ()
#37 0x203f3d656d616e74 in ?? ()
#38 0x2e35333d72646461 in ?? ()
#39 0x2e3331312e363032 in ?? ()
#40 0x6d72657420313432 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--c
#41 0x6873733d6c616e69 in ?? ()
#42 0x6375733d73657220 in ?? ()
#43 0x49551d2773736563 in ?? ()
#44 0x22746f6f72223d44 in ?? ()
#45 0x63223d4449554120 in ?? ()
#46 0x492022736f746e65 in ?? ()
#47 0x6f746e6563223d44 in ?? ()
#48 0x94f26100d7922273 in ?? ()
#49 0x6ca3819b8d1005a6 in ?? ()
#50 0x7079740c01da676f in ?? ()
#51 0x4c5f524553553d65 in ?? ()
#52 0x736d2054554f474f in ?? ()
#53 0x2874696475613d67 in ?? ()
#54 0x3737383233343631 in ?? ()
#55 0x323a3338302e3831 in ?? ()
#56 0x3a29343431303839 in ?? ()
#57 0x3036313d64697020 in ?? ()
#58 0x3d64697520383931 in ?? ()
#59 0x313d646975612030 in ?? ()
#60 0x3d73657320303030 in ?? ()
#61 0x6a62757320313938 in ?? ()
#62 0x5f6d65747379733d in ?? ()
#63 0x6d65747379733a75 in ?? ()
#64 0x5f646873733a725f in ?? ()
#65 0x3a30732d30733a74 in ?? ()
#66 0x33323031632e3063 in ?? ()
#67 0x706f273d67736d20 in ?? ()
#68 0x69206e69676f6c3d in ?? ()
#69 0x6520303030313d64 in ?? ()
#70 0x7273752f223d6578 in ?? ()
#71 0x73732f6e6962732f in ?? ()
#72 0x74736f6820226468 in ?? ()
#73 0x61203f3d656d616e in ?? ()
#74 0x322e35333d726464 in ?? ()
#75 0x322e3331312e3630 in ?? ()
#76 0x696d726574203134 in ?? ()
#77 0x206873733d6c616e in ?? ()
#78 0x636375733d736572 in ?? ()
#79 0x4449551d27737365 in ?? ()
#80 0x2022746f6f72223d in ?? ()
#81 0x6563223d44495541 in ?? ()
#82 0x44492022736f746e in ?? ()
#83 0x736f746e6563223d in ?? ()
#84 0xa694f26100d79222 in ?? ()
#85 0x6f6ca3810c171505 in ?? ()
#86 0x657079740b01da67 in ?? ()
#87 0x4f4c5f524553553d in ?? ()
#88 0x3d67736d204e4947 in ?? ()
#89 0x3631287469647561 in ?? ()
#90 0x3831373738323334 in ?? ()
#91 0x3839323a3438302e in ?? ()
#92 0x70203a2935343130 in ?? ()
#93 0x39313036313d6469 in ?? ()
#94 0x20303d6469752038 in ?? ()
#95 0x3030313d64697561 in ?? ()
#96 0x39383d7365732030 in ?? ()
#97 0x733d6a6275732031 in ?? ()
#98 0x3a755f6d65747379 in ?? ()
#99 0x725f6d6574737973 in ?? ()
#100 0x3a745f646873733a in ?? ()
#101 0x30633a30732d3073 in ?? ()
#102 0x6d2033323031632e in ?? ()
#103 0x6c3d706f273d6773 in ?? ()
#104 0x3d6469206e69676f in ?? ()
#105 0x6578652030303031 in ?? ()
#106 0x732f7273752f223d in ?? ()
#107 0x646873732f6e6962 in ?? ()
#108 0x616e74736f682022 in ?? ()
#109 0x646461203f3d656d in ?? ()
#110 0x3630322e35333d72 in ?? ()
#111 0x3134322e3331312e in ?? ()
#112 0x616e696d72657420 in ?? ()
#113 0x6572206873733d6c in ?? ()
#114 0x7365636375733d73 in ?? ()
#115 0x223d4449551d2773 in ?? ()
#116 0x55412022746f6f72 in ?? ()
#117 0x746e6563223d4449 in ?? ()
#118 0x223d44492022736f in ?? ()
#119 0x9222736f746e6563 in ?? ()
#120 0x1505a694f26100d7 in ?? ()
#121 0xda676f6ca3816821 in ?? ()
#122 0x553d657079740b01 in ?? ()
#123 0x524154535f524553 in ?? ()
#124 0x75613d67736d2054 in ?? ()
#125 0x3334363128746964 in ?? ()
#126 0x302e383137373832 in ?? ()
#127 0x31303839323a3438 in ?? ()
#128 0x646970203a293634 in ?? ()
#129 0x203839313036313d in ?? ()
#130 0x756120303d646975 in ?? ()
#131 0x20303030313d6469 in ?? ()
#132 0x203139383d736573 in ?? ()
#133 0x7379733d6a627573 in ?? ()
#134 0x79733a755f6d6574 in ?? ()
#135 0x733a725f6d657473 in ?? ()
#136 0x30733a745f646873 in ?? ()
#137 0x632e30633a30732d in ?? ()
#138 0x67736d2033323031 in ?? ()
#139 0x676f6c3d706f273d in ?? ()
#140 0x30313d6469206e69 in ?? ()
#141 0x223d657865203030 in ?? ()
#142 0x6962732f7273752f in ?? ()
#143 0x2022646873732f6e in ?? ()
#144 0x656d616e74736f68 in ?? ()
#145 0x3d72646461203f3d in ?? ()
#146 0x312e3630322e3533 in ?? ()
#147 0x74203134322e3331 in ?? ()
#148 0x3d6c616e696d7265 in ?? ()
#149 0x3d73657220687373 in ?? ()
#150 0x2773736563637573 in ?? ()
#151 0x6f72223d4449551d in ?? ()
#152 0x444955412022746f in ?? ()
#153 0x736f746e6563223d in ?? ()
#154 0x6563223d44492022 in ?? ()
#155 0x00d79222736f746e in ?? ()
#156 0xa7e42805a694f261 in ?? ()
#157 0x9701da676f6ca381 in ?? ()
#158 0x5952433d65707974 in ?? ()
#159 0x5f59454b5f4f5450 in ?? ()
#160 0x67736d2052455355 in ?? ()
#161 0x312874696475613d in ?? ()
#162 0x3137373832333436 in ?? ()
#163 0x39323a3538302e38 in ?? ()
#164 0x203a293734313038 in ?? ()
#165 0x313036313d646970 in ?? ()
#166 0x303d646975203839 in ?? ()
#167 0x30313d6469756120 in ?? ()
#168 0x383d736573203030 in ?? ()
#169 0x3d6a627573203139 in ?? ()
#170 0x755f6d6574737973 in ?? ()
#171 0x5f6d65747379733a in ?? ()
#172 0x745f646873733a72 in ?? ()
#173 0x633a30732d30733a in ?? ()
#174 0x2033323031632e30 in ?? ()
#175 0x3d706f273d67736d in ?? ()
#176 0x20796f7274736564 in ?? ()
#177 0x7265733d646e696b in ?? ()
#178 0x533d706620726576 in ?? ()
#179 0x36313a3635324148 in ?? ()
#180 0x343a34313a33323a in ?? ()
#181 0x3a65363a66323a36 in ?? ()
#182 0x32393a37373a3862 in ?? ()
#183 0x373a33663a36343a in ?? ()
#184 0x3a34373a39363a31 in ?? ()
#185 0x33373a33393a3031 in ?? ()
#186 0x663a62303a34663a in ?? ()
#187 0x3a66653a33643a35 in ?? ()
#188 0x34643a34653a3238 in ?? ()
#189 0x623a36613a37663a in ?? ()
#190 0x3a35353a34653a36 in ?? ()
#191 0x69642062373a3861 in ?? ()
#192 0x3d6e6f6974636572 in ?? ()
#193 0x323d64697073203f in ?? ()
#194 0x7320393437343730 in ?? ()
#195 0x303030313d646975 in ?? ()
#196 0x2f223d6578652020 in ?? ()
#197 0x6e6962732f727375 in ?? ()
#198 0x682022646873732f in ?? ()
#199 0x3d656d616e74736f in ?? ()
#200 0x3f3d72646461203f in ?? ()
#201 0x616e696d72657420 in ?? ()
#202 0x3d736572203f3d6c in ?? ()
#203 0x2773736563637573 in ?? ()
#204 0x6f72223d4449551d in ?? ()
#205 0x444955412022746f in ?? ()
#206 0x736f746e6563223d in ?? ()
#207 0x223d444955532022 in ?? ()
#208 0x9222736f746e6563 in ?? ()
#209 0x3709a694f26100d7 in ?? ()
#210 0xda676f6ca381d058 in ?? ()
#211 0x553d657079740901 in ?? ()
#212 0x20444e455f524553 in ?? ()
#213 0x696475613d67736d in ?? ()
#214 0x3832333436312874 in ?? ()
#215 0x3335312e38313737 in ?? ()
#216 0x383431303839323a in ?? ()
#217 0x313d646970203a29 in ?? ()
#218 0x6975203839313036 in ?? ()
#219 0x6469756120303d64 in ?? ()
#220 0x657320303030313d in ?? ()
#221 0x7573203139383d73 in ?? ()
#222 0x65747379733d6a62 in ?? ()
#223 0x747379733a755f6d in ?? ()
#224 0x6873733a725f6d65 in ?? ()
#225 0x732d30733a745f64 in ?? ()
#226 0x3031632e30633a30 in ?? ()
#227 0x273d67736d203332 in ?? ()
#228 0x6e69676f6c3d706f in ?? ()
#229 0x303030313d646920 in ?? ()
#230 0x752f223d65786520 in ?? ()
#231 0x2f6e6962732f7273 in ?? ()
#232 0x6f68202264687373 in ?? ()
#233 0x3f3d656d616e7473 in ?? ()
#234 0x35333d7264646120 in ?? ()
#235 0x3331312e3630322e in ?? ()
#236 0x726574203134322e in ?? ()
#237 0x73733d6c616e696d in ?? ()
#238 0x75733d7365722068 in ?? ()
#239 0x551d277373656363 in ?? ()
#240 0x746f6f72223d4449 in ?? ()
#241 0x223d444955412022 in ?? ()
#242 0x2022736f746e6563 in ?? ()
#243 0x746e6563223d4449 in ?? ()
#244 0xf26100d79222736f in ?? ()
#245 0xa3813d6f3709a694 in ?? ()
#246 0x79740c01da676f6c in ?? ()
#247 0x5f524553553d6570 in ?? ()
#248 0x6d2054554f474f4c in ?? ()
#249 0x74696475613d6773 in ?? ()
#250 0x3738323334363128 in ?? ()
#251 0x3a3335312e383137 in ?? ()
#252 0x2939343130383932 in ?? ()
#253 0x36313d646970203a in ?? ()
#254 0x6469752038393130 in ?? ()
#255 0x3d6469756120303d in ?? ()
#256 0x7365732030303031 in ?? ()
#257 0x627573203139383d in ?? ()
#258 0x6d65747379733d6a in ?? ()
#259 0x65747379733a755f in ?? ()
#260 0x646873733a725f6d in ?? ()
#261 0x30732d30733a745f in ?? ()
#262 0x323031632e30633a in ?? ()
#263 0x6f273d67736d2033 in ?? ()
#264 0x206e69676f6c3d70 in ?? ()
#265 0x20303030313d6469 in ?? ()
#266 0x73752f223d657865 in ?? ()
#267 0x732f6e6962732f72 in ?? ()
#268 0x736f682022646873 in ?? ()
#269 0x203f3d656d616e74 in ?? ()
#270 0x2e35333d72646461 in ?? ()
#271 0x2e3331312e363032 in ?? ()
#272 0x6d72657420313432 in ?? ()
#273 0x6873733d6c616e69 in ?? ()
#274 0x6375733d73657220 in ?? ()
#275 0x49551d2773736563 in ?? ()
#276 0x22746f6f72223d44 in ?? ()
#277 0x63223d4449554120 in ?? ()
#278 0x492022736f746e65 in ?? ()
#279 0x6f746e6563223d44 in ?? ()
#280 0x0000000000002273 in ?? ()
#281 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f9359cfbbc0 (LWP 2084989)):
#0  0x00007f9357b86d98 in nanosleep () from /lib64/libc.so.6
#1  0x00007f9357b86c9e in sleep () from /lib64/libc.so.6
#2  0x0000000000452bbd in flb_main ()
#3  0x0000000000452c2b in main ()

@fdr
Copy link
Author

fdr commented Jan 28, 2022

@PettitWesley yeah, I think 1.8.12 has the patch you mention, but it doesn't do the trick:

git log --grep=openssl -i -n1 v1.8.12
commit 1e8203ee9a25ee156835e5de90251d2d7a4d9380
Author: Ramya Krishnamoorthy <[email protected]>
Date:   Fri Jan 21 16:40:03 2022 -0800

    tls: openssl: fix error handling for OpenSSL apis (#4584)
    
    Signed-off-by: Ramya Krishnamoorthy <[email protected]>

Also looking at the patch, this label name relative to what it's being checked by is...suspicious? WANT_READ is a retry_read, WANT_WRITE is a retry_read? I haven't read the context, just the commit message and was wondering if it was an oversight.

     if (ret == FLB_TLS_WANT_READ) {
         goto retry_read;
     }
+    else if (ret == FLB_TLS_WANT_WRITE) {
+        goto retry_read;
+    }

@fdr
Copy link
Author

fdr commented Feb 25, 2022

Picked up a few more of these. 1.8.12.

No symbol table info available.
#1  0x00007f6af95d44d2 in sock_read () from /lib64/libcrypto.so.10
No symbol table info available.
#2  0x00007f6af95d2337 in BIO_read () from /lib64/libcrypto.so.10
No symbol table info available.
#3  0x00007f6af99454e4 in ssl3_read_n () from /lib64/libssl.so.10
No symbol table info available.
#4  0x00007f6af9946cd5 in ssl3_read_bytes () from /lib64/libssl.so.10
No symbol table info available.
#5  0x00007f6af9943714 in ssl3_read_internal () from /lib64/libssl.so.10
No symbol table info available.
#6  0x00000000004f1613 in tls_net_read ()
No symbol table info available.
#7  0x00000000004f1b36 in flb_tls_net_read ()
No symbol table info available.
#8  0x00000000004fd527 in flb_io_net_read ()
No symbol table info available.
#9  0x00000000004ffa65 in flb_http_do ()
No symbol table info available.
#10 0x00000000005dce17 in request_do ()
No symbol table info available.
#11 0x00000000005dc321 in flb_aws_client_request ()
--Type <RET> for more, q to quit, c to continue without paging--c
No symbol table info available.
#12 0x0000000000595f62 in put_log_events ()
No symbol table info available.
#13 0x00000000005937a9 in send_log_events ()
No symbol table info available.
#14 0x0000000000594a97 in process_and_send ()
No symbol table info available.
#15 0x00000000005918fe in cb_cloudwatch_flush ()
No symbol table info available.
#16 0x00000000004c8d2c in output_pre_cb_flush ()

@fdr
Copy link
Author

fdr commented Feb 25, 2022

Another one cropped up, 1.8.12.

#0  0x00007f8a9ab919e4 in read () from /lib64/libpthread.so.0
#1  0x00007f8a9a0794d2 in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f8a9a077337 in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f8a9a3ea4e4 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f8a9a3ebdb0 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f8a9a3ed4e4 in ssl3_get_message () from /lib64/libssl.so.10
#6  0x00007f8a9a3e18f7 in ssl3_get_server_certificate () from /lib64/libssl.so.10
#7  0x00007f8a9a3e6732 in ssl3_connect () from /lib64/libssl.so.10
#8  0x00007f8a9a3f02fe in ssl23_connect () from /lib64/libssl.so.10
#9  0x00000000004f17c0 in tls_net_handshake ()
#10 0x00000000004f1fa2 in flb_tls_session_create ()
#11 0x00000000004fcbaa in flb_io_net_connect ()
#12 0x00000000004e0ecc in create_conn ()
#13 0x00000000004e143f in flb_upstream_conn_get ()
#14 0x00000000005dc77f in request_do ()
#15 0x00000000005dc321 in flb_aws_client_request ()
#16 0x0000000000595f62 in put_log_events ()
#17 0x00000000005937a9 in send_log_events ()
#18 0x0000000000594a97 in process_and_send ()
#19 0x00000000005918fe in cb_cloudwatch_flush ()
#20 0x00000000004c8d2c in output_pre_cb_flush ()
#21 0x000000000097c206 in co_init ()
#22 0x0000000000000000 in ?? ()

@PettitWesley
Copy link
Contributor

@krispraws Does this match up with the SSL read fix? Or does this look like something different?

@fdr
Copy link
Author

fdr commented Feb 25, 2022

Out of curiosity, how do you time things out when doing synchronous network I/O?

    /*
     * Remove async flag from upstream
     * CW output runs in sync mode; because the CW API currently requires
     * PutLogEvents requests to a log stream to be made serially
     */
    upstream->flags &= ~(FLB_IO_ASYNC);

@fdr
Copy link
Author

fdr commented Mar 3, 2022

Alright, new information...so I decided to see what would happen if I injected some kind of fault at the OS level, whether fluent-bit would un-stick. Indeed, it does.

Firstly, I ran ss and noticed that the connection to cloudwatch logs via https was in the ESTAB state. Not too much surprise there.

I was going to use ss -K except my kernel (AlmaLinux a.k.a. RHEL) doesn't have CONFIG_INET_DIAG_DESTROY enabled, so I had to do something a bit more creative.

Using ss (or lsof), I took note of the file descriptor of the stalled connection, and used a GDB session like the following to turn on TCP keepalive and instruct it to have a limit on number of missed keepalives. The result is Linux promptly closed the connection (as it's probably been dead a long time) and then some logs printed (notice the many days with no logs at all), posted here after the GDB session, and logs began flowing once more.

set variable $SOL_SOCKET = 1
set variable $SO_KEEPALIVE = 9
set variable $SOL_TCP = 6
set variable $TCP_KEEPIDLE = 4
set variable $TCP_KEEPINTVL = 5
set variable $TCP_KEEPCNT = 6

set variable $one = (int*)malloc(sizeof(int))
p *$one = 1   
set variable $fd = 42
p (int)setsockopt($fd, $SOL_TCP, $TCP_KEEPIDLE, $one, sizeof(int))
p (int)setsockopt($fd, $SOL_TCP, $TCP_KEEPINTVL, $one, sizeof(int))
p (int)setsockopt($fd, $SOL_TCP, $TCP_KEEPCNT, $one, sizeof(int))
p (int)setsockopt($fd, $SOL_SOCKET, $SO_KEEPALIVE, $one, sizeof(int))

Note the large gap in dates below. I run this in GDB on Mar 03, here.

Feb 26 17:53:36 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/02/26 17:53:36] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
Feb 26 17:53:36 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/02/26 17:53:36] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
Feb 26 17:53:36 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/02/26 17:53:36] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
Feb 26 17:53:36 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/02/26 17:53:36] [ warn] [engine] failed to flush chunk '2405476-1645898005.384637165.flb', retry in 8 seconds: task_id=1, input=syslog.3 > output=cloudwatch_logs.0>
Mar 03 05:19:47 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:47] [error] [aws_client] connection initialization error
Mar 03 05:19:47 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:47] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
Mar 03 05:19:47 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:47] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [error] [aws_client] auth error, refreshing creds
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [error] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents API responded with error='ExpiredTokenException', message='The security token inclu>
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=51594220 watch_fd=187
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream bjrjy5jarney5n6ize4lk7rhna-ssh in log group s6fnjf23q5ed3dpodkclgxryi4
Mar 03 05:19:48 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:48] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream bjrjy5jarney5n6ize4lk7rhna-ssh already exists
Mar 03 05:19:50 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:50] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream bjrjy5jarney5n6ize4lk7rhna-pgbouncer in log group s6fnjf23q5ed3dpodkclgxryi4
Mar 03 05:19:50 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:50] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream bjrjy5jarney5n6ize4lk7rhna-pgbouncer already exists
Mar 03 05:19:52 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:52] [ warn] [engine] failed to flush chunk '2405476-1645898248.888454056.flb', retry in 10 seconds: task_id=0, input=tail.1 > output=cloudwatch_logs.0 >
Mar 03 05:19:52 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:52] [ warn] [engine] failed to flush chunk '2405476-1645898248.888623740.flb', retry in 6 seconds: task_id=1, input=syslog.3 > output=cloudwatch_logs.0>
Mar 03 05:19:52 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:52] [ info] [input:tail:tail.1] inotify_fs_add(): inode=51560016 watch_fd=188 name=/var/log/audit/audit.log
Mar 03 05:19:54 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream bjrjy5jarney5n6ize4lk7rhna-audit in log group s6fnjf23q5ed3dpodkclgxryi4
Mar 03 05:19:54 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Log Stream bjrjy5jarney5n6ize4lk7rhna-audit already exists
Mar 03 05:19:59 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:59] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
Mar 03 05:19:59 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:59] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
Mar 03 05:19:59 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:59] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
Mar 03 05:19:59 i-bjrjy5jarney5n6ize4lk7rhna td-agent-bit[2405476]: [2022/03/03 05:19:59] [ warn] [engine] chunk '2405476-1645898248.888623740.flb' cannot be retried: task_id=1, input=syslog.3 > output=cloudwatch_logs.0
line

So, if it's too hairy to enable asynchronous I/O at this time, you could consider TCP keepalive as a workaround, by running setsockopt a handful of times. Better than the plugin not working indefinitely, I think.

@fdr
Copy link
Author

fdr commented Mar 15, 2022

@krispraws @PettitWesley hello folks...just wanted to raise this again. Seems bad that any connection problem stops logs forever, so long as the connection state stays in TCP ESTAB, which, in normal operation, it will.

@PettitWesley
Copy link
Contributor

PettitWesley commented Mar 16, 2022

Using ss (or lsof), I took note of the file descriptor of the stalled connection, and used a GDB session like the following to turn on TCP keepalive and instruct it to have a limit on number of missed keepalives. The result is Linux promptly closed the connection (as it's probably been dead a long time) and then some logs printed (notice the many days with no logs at all), posted here after the GDB session, and logs began flowing once more.

Ok so please forgive me I actually really don't know what I am doing with all of this low level networking stuff, but the PR #3096 that you linked should probably fix this right?

Also, do you have any thoughts on whether this is related to: aws/aws-for-fluent-bit#293

I'm gonna build that PR and enable the new setting and see if it reduces the rate of connection errors.

@PettitWesley
Copy link
Contributor

@krispraws I would appreciate your help and expertise on this

@fdr
Copy link
Author

fdr commented Mar 16, 2022

I'll give a crash course on the low level network stuff...

If you have reasons to not use asynchronous I/O and implement timeouts of individual syscalls that way, you definitely want keepalive on. I consider aws/aws-for-fluent-bit#293 a very bad idea. Without the errors created by missing keepalive packets, the client has no way to make progress (as seen in this bug report)

TCP keepalive using setsockopt (not to be confused with various application-level functionality of a similar name) can have just as well been named "makedead," as in it tells the operating system to helpfully close connections where heartbeats cannot be exchanged (after some number of heartbeats, at some spacing, and after some idle interval). The operating system doing this allows an error to be sent to the application performing the otherwise blocking system call. It so happens exchanging that data can keep sessions in unforgiving routers fresh, hence the name. It fulfills both purposes. False positives can occur, e.g. the network connection would have in fact picked up after some number of seconds, but a too stringent keepalive setting jumps the gun on closing it.

So, alternatives. I only think the first one is worth entertaining at all, but I mention them all to give a background to how people have dealt with this problem.

  1. Use asynchronous I/O. When the select syscall times out (typically after a brief period), it provides control to your program, and some code of your own has a chance to handle things, e.g. implementing a longer user-defined timeout and giving a helpful error message about what deadline in particular was violated. But if you are happy with synchronous I/O then it's not clear what implementing your own timeouts has going for it. However, it does seem asynchronous I/O tends to be the usual way things get done in fluentbit. From what I can tell, the TCP connection in question is dead and never coming back, there's no viable error handling other than terminate promptly and try again, limiting the appeal of this approach.
  2. You can use some form of signaling, like pthread_cancel. The problem with this is it can cause any syscall to return EINTR. I wouldn't do it, it necessarily lacks precision, it's only really a useful technique where one has limited control over a program or library or knows something very special about how it is synchronized around its syscalls.
  3. You could use multiple processes, so you can exit the timed-out process. Superficially similar to the second option, it relies on signaling, but because the whole process gets destroyed upon a timeout there's less potential for injected signals to cause accumulative error in the long running supervisor. This option finds purchase to avoid the very, very difficult (often impossible) task of making any process with significant code in it signal-safe.

As for #3096: I don't know fluentbit software engineering well, but if it was a nice generic way to apply keepalive and it worked for cloudwatch_logs sockets, then it'd do the trick. That said, I'd suggest making keepalive the default, otherwise users who are not as strict about watching for expected log traffic are not left in the dark for indeterminate periods. I went to the trouble of this because I use fluentbit for storing auditing-related logs that must be retained.

@PettitWesley
Copy link
Contributor

PettitWesley commented Mar 16, 2022

Thanks for that explanation.

If you have reasons to not use asynchronous I/O

So the reason why cloudwatch_logs plugin uses sync IO (in contrast to basically all other plugins which use async), is because of unfortunate mismatches between Fluent Bit's concurrency model and the CloudWatch PutLogEvents API. The API expects a sequence token for each request to upload logs to a log stream, and then a successful request returns the next sequence token.

This doesn't match up well with Fluent Bit concurrency model, which is partly explained here: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

But basically, if I make an async request to PutLogEvents, then until it returns, I can't upload any more logs to the same stream because I need that sequence token. Hence I chose sync IO.

With workers Eduardo changed how things work a bit and I think you can now use mutex lock in output plugins, so actually I could experiment I think with changing to async IO and then having a lock on each log stream so that if a request to it has already been issues, subsequent flush calls just block on the lock. This would probably be better than the current behavior which is that every flush call just blocks until the request finishes. This is also why cloudwatch_logs only supports a single worker thread.

EDIT: Actually I think the above doesn't work, if a new flush call blocks on a lock held by a now inactive thread, then you can get deadlock very easily. The thread is now blocked on the lock and so the coroutine holding it won't ever wake up. This is the case where there is a single worker thread, with multiple workers, if all are sending to the same log stream then you can get this as well. A more complicated solution is needed that uses custom logic in the fluent bit event loop/engine.

@fdr
Copy link
Author

fdr commented Mar 16, 2022

Thanks for that explanation.

If you have reasons to not use asynchronous I/O

So the reason why cloudwatch_logs plugin uses sync IO (in contrast to basically all other plugins which use async), is because of unfortunate mismatches between Fluent Bit's concurrency model and the CloudWatch PutLogEvents API. The API expects a sequence token for each request to upload logs to a log stream, and then a successful request returns the next sequence token.

This doesn't match up well with Fluent Bit concurrency model, which is partly explained here: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

But basically, if I make an async request to PutLogEvents, then until it returns, I can't upload any more logs to the same stream because I need that sequence token. Hence I chose sync IO.

Well, it could be that synchronous I/O with keepalive is genuinely the best choice for quite a while, in that there may not be a profitable way to solve the problem for an acceptable level of complication (three syscalls).

Though, given the use of cooperative tasking, I think there are other effects here, besides better error messages, that you would want to solve: the cloudwatch logs output plugin is sitting on the cooperatively-tasked mutex you describe for, perhaps, hundreds to thousands of times longer than an async implementation would, only waiting for I/O. This might have some interesting effect as it comes to flow of logs to multiple outputs, or triggering log loss for input sources like syslog unix datagrams, among other things, since the fluentbit process is effectively frozen for far longer than designed.

@lecaros
Copy link
Contributor

lecaros commented Mar 31, 2022

Hi, we've recently merged some code fixes related to connections management. Is this issue still reproducible for you using 1.8.15 or 1.9.1?

@fdr
Copy link
Author

fdr commented Mar 31, 2022

Hi, we've recently merged some code fixes related to connections management. Is this issue still reproducible for you using 1.8.15 or 1.9.1?

Can you summarize what has been changed that may impact this problem? Evaluating this imposes some impact not only on me, but on my team, and since we have a pretty good idea what the problem is: system calls on blocking I/O are not obliged to return at any particular time (i.e. never), fluent-bit and/or the cloudwatch output module can't time it out, and can't schedule its other work items, either, as cloudwatch sits on the cooperative dispatch forever. I was skimming the git commit logs and it's not obvious what would address this issue.

@lecaros
Copy link
Contributor

lecaros commented Apr 12, 2022

Hi @fdr, it would be great if you could provide a repro scenario for this using either 1.8.15 or 1.9.1.
Thanks!

@lecaros lecaros added waiting-for-user Waiting for more information, tests or requested changes and removed status: waiting-for-triage labels Apr 12, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jul 12, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Oct 14, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

No branches or pull requests

3 participants