out_http: amd64.c:121: crash: Assertion `0' failed. #557

edsiper · 2018-04-07T14:30:15Z

Reported on #429, we will track this problem here:

[2018/04/07 01:44:14] [ info] [out_http] HTTP STATUS=200
[2018/04/07 01:44:15] [ info] [out_http] HTTP STATUS=200
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

[engine] caught signal
#0  0x7f6defdfa529      in  ???() at ???:0
#1  0x7f6defdf1e66      in  ???() at ???:0
#2  0x7f6defdf1f11      in  ???() at ???:0
#3  0x55994d80c389      in  ???() at ???:0
#4  0xffffffffffffffff  in  ???() at ???:0

This problem is associated to the bad use of co-routines implementation, likely there is an unexpected and explicit return from a co-routine that trigger this problem.

cc: @StevenACoffman @jgsqware @onorua

The text was updated successfully, but these errors were encountered:

edsiper · 2018-04-09T15:38:50Z

@StevenACoffman @jgsqware @onorua

are you facing this crash only when the HTTP server is enabled ?

onorua · 2018-04-09T15:52:22Z

I did not try to disable HTTP server to be honest, as we use it for metrics exposure.

StevenACoffman · 2018-04-10T01:46:23Z

@edsiper We had not noticed excessive restarts when the liveness and readiness probes were omitted previously. I will disable the HTTP Server, remove the prometheus annotations, the liveness and readiness probes and start over, and let it run overnight.

edsiper · 2018-04-10T02:26:24Z

thanks. I was able to track down the issue in 0.13-dev, indeed the problem is in the HTTP Server, a fix will be available shortly.

note: If you see a restart when the HTTP server is off, that's a separate problem.

jgsqware · 2018-04-10T03:02:15Z

Yes we have Http Server setup, and we use the dev version for this purpose. Le mar. 10 avr. 2018 à 04:26, Eduardo Silva <[email protected]> a écrit :

…

thanks. I was able to track down the issue in 0.13-dev, indeed the problem is in the HTTP Server, a fix will be available shortly. note: If you see a restart when the HTTP server is off, that's a separate problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#557 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJrq-1ACsMLmAlHI-XYirHoGkEdUJR0ks5tnBhRgaJpZM4TLGmM> .

edsiper · 2018-04-10T19:48:01Z

I've fixed the issues associated to the HTTP Server that generated the crash. All please upgrade to the following Docker image:

fluent/fluent-bit-0.13-dev:0.16

edsiper · 2018-04-10T19:55:13Z

FYI: https://github.com/fluent/fluent-bit-kubernetes-logging/tree/0.13-dev

StevenACoffman · 2018-04-10T20:57:55Z

@edsiper FYI After 24 hours running fluent/fluent-bit-0.13-dev:0.15 without the http server, I only had 4 restarts on a single pod, and one on another. The other 20 pods had no issues. Not the same issue as what you fixed, but if you are curious, all 5 restarts were this error:

[2018/04/10 03:27:50] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x7ffb141f2e63      in  ???() at ???:0
#1  0x55de3bed7aec      in  ???() at ???:0
#2  0x55de3c0e7415      in  msgpack_pack_ext_body() at lib/msgpack-2.1.3/include/msgpack/pack_template.h:890
#3  0x55de3c0e7415      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:72
#4  0x55de3bed84ce      in  ???() at ???:0
#5  0x55de3bed8d1c      in  ???() at ???:0
#6  0x55de3be7b3da      in  ???() at ???:0
#7  0x55de3be78e2c      in  ???() at ???:0
#8  0x55de3be7a98c      in  ???() at ???:0
#9  0x55de3be9e99d      in  ???() at ???:0
#10 0x55de3be9f7e8      in  ???() at ???:0
#11 0x55de3be9d34c      in  ???() at ???:0
#12 0x55de3bea3721      in  ???() at ???:0
#13 0x55de3be7aecb      in  ???() at ???:0
#14 0x55de3be81a86      in  ???() at ???:0
#15 0x55de3be21ef3      in  ???() at ???:0
#16 0x7ffb140ea2e0      in  ???() at ???:0
#17 0x55de3be20449      in  ???() at ???:0
#18 0xffffffffffffffff  in  ???() at ???:0

I have just applied the fluent/fluent-bit-0.13-dev:0.16 changes and will monitor it and let you know.

StevenACoffman · 2018-04-10T21:02:15Z

@edsiper fluent/fluent-bit-0.13-dev:0.16 with HTTP_Server On is in CrashLoopBackOff for all applied containers:

[2018/04/10 20:57:33] [ info] [out_http] HTTP STATUS=200
[2018/04/10 20:57:33] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x55a212d8e727      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x55a212d8e75e      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x55a212d8eb64      in  mk_http_thread_purge() at lib/monkey/mk_server/mk_http_thread.c:197
#3  0x55a212d8e8e6      in  thread_cb_init_vars() at lib/monkey/mk_server/mk_http_thread.c:104
#4  0x55a212d995e6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#5  0xffffffffffffffff  in  ???() at ???:0

Some error immediately, some run for a minute, but then they all crash with the same error.

edsiper · 2018-04-10T21:29:29Z

thanks for the info. troubleshooting.

fluent/fluent-bit#557 (comment) This reverts commit 247a8f2.

marckamerbeek · 2018-04-16T07:45:06Z

I'm doubting to start a new issue...but I faced some crashes aswell.

Version: fluent/fluent-bit-0.13-dev:0.16
Kubernetes: 1.8.4

[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data
 #0  0x7f59d36f0f08      in  ???() at ???:0
#1  0x56214c2c0e8b      in  ???() at ???:0
#2  0x56214c2bf538      in  ???() at ???:0
#3  0x56214c30eebe      in  ???() at ???:0
#4  0x56214c30f1a5      in  ???() at ???:0
#5  0x56214c30fd7c      in  ???() at ???:0
#6  0x56214c2b243a      in  ???() at ???:0
#7  0x56214c2afe8c      in  ???() at ???:0
#8  0x56214c2b19ec      in  ???() at ???:0
#9  0x56214c2d59fd      in  ???() at ???:0
#10 0x56214c2d6848      in  ???() at ???:0
#11 0x56214c2d4237      in  ???() at ???:0
#12 0x56214c2b1f2b      in  ???() at ???:0
#13 0x56214c2b8ae6      in  ???() at ???:0
#14 0x56214c258f53      in  ???() at ???:0
#15 0x7f59d35f02e0      in  ???() at ???:0
#16 0x56214c2574a9      in  ???() at ???:0
#17 0xffffffffffffffff  in  ???() at ???:0```

There are several hundred of pods running, but its was caused by one pod only.  It's a nginx pod. I changed the log format to json and then fluent-bit started to crash on that particular node. 

I figured out that the json was not correct. The json parser of fluent-bit crashes without giving any clue. The c implementation of the json parser is probably a limited one.

I attached a file with the wrong log and one file with the fixed one.  This way it might be reproducible.

[wrong_json.log](https://github.com/fluent/fluent-bit/files/1914262/wrong_json.log)
[fixed_json.log](https://github.com/fluent/fluent-bit/files/1914265/fixed_json.log)

edsiper · 2018-04-16T17:08:54Z

@marckamerbeek what you have reported looks like a different issue, I've filed #567 for your case, let's follow up there.

jgsqware · 2018-04-16T18:01:57Z

We currently experiencing the same since we upgraded to 0.16

edsiper · 2018-04-16T18:10:37Z

would you please try the following image which have several fixes in the HTTP server side?

edsiper/fluent-bit-0.13-next:0.17

note: this image size is 150MB and should be used only to try to reproduce the main problem in question

jgsqware · 2018-04-16T19:14:25Z

Currently, no restart with your version.

jgsqware · 2018-04-16T19:25:45Z

10 minutes without restart, I'll let it run all gnight long and keep you posted

edsiper · 2018-04-16T19:47:20Z

thanks @jgsqware

StevenACoffman · 2018-04-16T23:08:36Z

@edsiper I have applied the change and am going to leave it running overnight as well. Here are results from one hour:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-24c7c   3/3       Running   4          1h
fluent-bit-2f4ph   3/3       Running   0          1h
fluent-bit-884lq   3/3       Running   0          1h
fluent-bit-ccbp8   3/3       Running   1          1h
fluent-bit-fmq4k   2/3       Running   0          1h
fluent-bit-g7mjq   3/3       Running   10         1h
fluent-bit-mk5cf   3/3       Running   0          1h
fluent-bit-mtqhc   3/3       Running   0          1h
fluent-bit-pd629   3/3       Running   0          1h
fluent-bit-pfgsv   3/3       Running   2          1h
fluent-bit-s2phj   2/3       Running   0          1h
fluent-bit-tsbv5   3/3       Running   0          1h

kubectl logs -n kangaroo fluent-bit-g7mjq -c fluent-bit --previous results:

[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200

All the pods with restarts exhibit the same behavior. Sometimes [engine] caught signal is the last log message, but generally it's a few before that.

StevenACoffman · 2018-04-16T23:17:46Z

If it is helpful, here is the startup logs:

[2018/04/16 23:14:56] [ info] [engine] started
[2018/04/16 23:14:56] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/16 23:14:56] [ info] [filter_kube] local POD info OK
[2018/04/16 23:14:56] [ info] [filter_kube] testing connectivity with API server...
[2018/04/16 23:14:56] [ info] [filter_kube] API server connectivity OK
[2018/04/16 23:14:56] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2018/04/16 23:14:58] [ info] [out_http] HTTP STATUS=200

StevenACoffman · 2018-04-16T23:22:38Z

With log level set to debug:

[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [input tail.0] [mem buf] size = 5212894
[2018/04/16 23:20:56] [debug] [input] tail.0 paused (mem buf overlimit)
[2018/04/16 23:20:56] [debug] [in_tail] file=/var/log/containers/fluent-bit-g4nst_kangaroo_fluent-bit-5327b6357ef3287060d07e9ff6523f6ec9c592b5cd7d085617605cdd197bc631.log read=32753 lines=214
[2018/04/16 23:20:56] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:20:56] [debug] [task] created task=0x563255dcbd10 id=2 OK
[2018/04/16 23:20:57] [ info] [out_http] HTTP STATUS=200

StevenACoffman · 2018-04-20T18:31:57Z

@edsiper That looks like it was generated by

fluent-bit/plugins/out_http/http.c

Line 404 in ed4b2f0

flb_error("[out_http] %s:%i, HTTP status=%i",

StevenACoffman · 2018-04-20T18:36:44Z

It looks like HTTP STATUS were changed to HTTP status in defac78 so I think this is the latest version.

StevenACoffman · 2018-04-20T18:44:35Z

Although I do see:

[2018/04/20 18:35:02] [ warn] [filter_kube] annotation parser 'json' not found (ns='teachers' pod_name='go-spew-5d99bbd878-5jq8p')

Which you changed in b4b1e09 so maybe it is an older version that got pushed to that docker tag?

I did see one that got this message before it crashed with the same error:

[2018/04/20 18:36:23] [error] [out_http] no upstream connections available

StevenACoffman · 2018-04-20T18:53:38Z

Ah, right. The configmap overwrote the new parser.conf with the old json-test, so I needed to update that there to match. I can't tell how to verify that I am not using an older version. When I run kubectl exec -it fluent-bit-ncv9l -n kangaroo -c fluent-bit -- /bin/sh and then /fluent-bit/bin/fluent-bit --version I get Fluent Bit v0.13.0

When I run ./fluent-bit --sosreport I get:

Fluent-Bit v0.13.0
Copyright (C) Treasure Data

[2018/04/20 19:04:35] [ info] [engine] started

Fluent Bit Enterprise - SOS Report
==================================
The following report aims to be used by Fluent Bit and Fluentd Enterprise
Customers of Treasure Data. For more details visit:

    https://fluentd.treasuredata.com


[Fluent Bit]
    Edition		Community Edition
    Version		0.13.0
    Built Flags		 JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_SQLDB FLB_HAVE_BUFFERING FLB_HAVE_METRICS FLB_HAVE_HTTP_SERVER FLB_HAVE_FLUSH_LIBCO FLB_HAVE_SYSTEMD FLB_HAVE_FORK FLB_HAVE_TIMESPEC_GET FLB_HAVE_PROXY_GO FLB_HAVE_JEMALLOC JEMALLOC_MANGLE FLB_HAVE_LIBBACKTRACE FLB_HAVE_REGEX FLB_HAVE_C_TLS FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY

[Operating System]
    Name		Linux
    Release		4.4.0-1020-aws
    Version		#29-Ubuntu SMP Wed Jun 14 15:54:52 UTC 2017

[Hardware]
    Architecture	x86_64
    Processors		2

[Built Plugins]
    Inputs              cpu mem kmsg tail proc disk systemd netif dummy head exec health serial stdin tcp mqtt lib forward random syslog
    Filters             grep stdout throttle kubernetes parser record_modifier
    Outputs             counter es exit file forward http influxdb kafka kafka-rest nats null plot splunk stdout td lib flowcounter

[SERVER] Runtime configuration
    Flush		5
    Daemon		Off
    Log_Level		Info

StevenACoffman · 2018-04-20T20:03:09Z

With HTTP_Server Off and the prometheus annotations removed, and liveness and readiness probes removed, I got a restart at 44 minutes on a single pod:

$ POD_ID=fluent-bit-bqlsb;kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/20 19:57:54] [ info] [engine] started
[2018/04/20 19:57:54] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/20 19:57:54] [ info] [filter_kube] local POD info OK
[2018/04/20 19:57:54] [ info] [filter_kube] testing connectivity with API server...
[2018/04/20 19:57:55] [ info] [filter_kube] API server connectivity OK
[2018/04/20 19:57:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103
#1  0x55ae207aeaeb      in  ???() at ???:0
#2  0x55ae207af339      in  ???() at ???:0
#3  0x55ae2075179a      in  ???() at ???:0
#4  0x55ae2074f1ec      in  ???() at ???:0
#5  0x55ae20750d4c      in  ???() at ???:0
#6  0x55ae20774f4d      in  ???() at ???:0
#7  0x55ae20775d98      in  ???() at ???:0
#8  0x55ae20773787      in  ???() at ???:0
#9  0x55ae2075128b      in  ???() at ???:0
#10 0x55ae20757ee2      in  ???() at ???:0
#11 0x55ae206f82b3      in  ???() at ???:0
#12 0x7fcde44382e0      in  ???() at ???:0
#13 0x55ae206f6809      in  ???() at ???:0
#14 0xffffffffffffffff  in  ???() at ???:0

StevenACoffman · 2018-04-20T21:04:10Z

After more than two hours, I have no further restarts with the HTTP Server off. I will let it run overnight, and until monday.

edsiper · 2018-04-20T22:47:22Z

thanks for the report. Focusing on this specific error:

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103

my assumption is that some invalid data was passes to msgpack-c library.. and looks very associated to #567 where passing an invalid Map (on this case a missing value for a key) generate a small corruption.

I have done two changes:

upgrade json parser which fix the problem (with a custom PR on their repo)
double-check unpacking status

Please test the following new image:

edsiper/fluent-bit-0.13-next:0.18

StevenACoffman · 2018-04-20T23:19:04Z

With HTTP Server on, metrics on, liveness probes they all are in CrashLoopBackoff after running for only a few minutes:

[2018/04/20 23:17:43] [ info] [engine] started
[2018/04/20 23:17:43] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/20 23:17:43] [ info] [filter_kube] local POD info OK
[2018/04/20 23:17:43] [ info] [filter_kube] testing connectivity with API server...
[2018/04/20 23:17:48] [ info] [filter_kube] API server connectivity OK
[2018/04/20 23:17:48] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal

edsiper · 2018-04-20T23:20:22Z

are you able to see this issue again ?

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103

StevenACoffman · 2018-04-20T23:32:16Z

I got this once:

[engine] caught signal
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:48] [ warn] [engine] service will stop in 5 seconds
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

StevenACoffman · 2018-04-20T23:38:20Z

When I disable the HTTP Server, I'm still getting similar immediate crash behavior.

edsiper · 2018-04-22T07:12:44Z

@StevenACoffman

I've pushed the following new image:

edsiper/fluent-bit-0.13-next:0.18-2

the co-routines library was not build with the required pre-processor flags to enable multithreading support, for hence when starting the HTTP server (in a separate thread) and multiple co-routines where started i the HTTP Server + Fluent Bit output doing the same, it lead to data corruption.

After add the missing flags I was not able to reproduce the crash. Would you please double-check in your end?

StevenACoffman · 2018-04-22T13:19:52Z

With edsiper/fluent-bit-0.13-next:0.18-2, with HTTP_Server On and the prometheus annotations applied, and liveness and readiness probes applied things are more stable:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS             RESTARTS   AGE
fluent-bit-2jwj5   3/3       Running            0          51m
fluent-bit-4nksl   3/3       Running            0          51m
fluent-bit-56dl2   3/3       Running            0          51m
fluent-bit-92qvj   3/3       Running            0          51m
fluent-bit-crvhn   3/3       Running            0          51m
fluent-bit-fnqqp   3/3       Running            2          51m
fluent-bit-lrlpx   3/3       Running            0          51m
fluent-bit-m2ghf   2/3       CrashLoopBackOff   15         51m
fluent-bit-q56gt   3/3       Running            0          51m
fluent-bit-s7rvz   3/3       Running            0          51m
fluent-bit-x2vh5   3/3       Running            0          51m
fluent-bit-xkncq   3/3       Running            0          51m

$ POD_ID=fluent-bit-m2ghf; kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/22 13:16:35] [ info] [engine] started
[2018/04/22 13:16:36] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 13:16:36] [ info] [filter_kube] local POD info OK
[2018/04/22 13:16:36] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 13:16:36] [ info] [filter_kube] API server connectivity OK
[2018/04/22 13:16:36] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal

$ POD_ID=fluent-bit-fnqqp; kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/22 12:24:39] [ info] [engine] started
[2018/04/22 12:24:39] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 12:24:39] [ info] [filter_kube] local POD info OK
[2018/04/22 12:24:39] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 12:24:39] [ info] [filter_kube] API server connectivity OK
[2018/04/22 12:24:39] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2018/04/22 12:24:40] [ warn] [filter_kube] annotation parser 'json' not found (ns='kangaroo' pod_name='go-spew-pgm52')
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[engine] caught signal

StevenACoffman · 2018-04-22T14:23:57Z

I adjusted that one pod that was in CrashLoopBackOff to Log_Level debug:

$ POD_ID=fluent-bit-ngmph; kubectl logs -n kangaroo $POD_ID -c fluent-bit
[2018/04/22 14:18:51] [ info] [engine] started
[2018/04/22 14:18:51] [debug] [in_tail] inotify watch fd=20
[2018/04/22 14:18:51] [debug] [in_tail] scanning path /var/log/containers/*.log
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/consortium-books-holdings-1524268140-q9sp5_teachers_consortium-books-holdings-4807e8810c330d5a9d69f239c9c2480ece3d1d0f9156626bb7c31fef2e5d975d.log, offset=382774
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eureka-yup-9726a3ba513d5a05d064d7c3b8b33a95ecd480388c457eb6ea0cf57ab9c91a3b.log, offset=1091595
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eureka-yup-e0125f0f07d00e2e9b3749b60f8bf0654214096564dd4725c86ec58c0497e056.log, offset=229024
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eviz-hackathon-e969ec8e57c1d14ce5f9a208297e815a539863ae161f2e46081c489874e3bb46.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_fluent-bit-0065e2781b06f877ae5207d0eeac9a7c1a3c592f9c7f034fbea0018ec221701a.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_fluent-bit-7d05c4331e95df14fccdea3732be855e975c64854aa88c5687f7952721d5291b.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_go-kafka-logsink-af433108645cd7378b5f435c20480a084af395b4a5236b4f7f341139fd7ad775.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_go-s3-logsink-ab15461b1abc4064b5bee72b73c6670e116513138ec32969e784d538b2f5a21d.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/go-spew-5d99bbd878-5jq8p_teachers_go-spew-2796b3ae129b6f38835e28724ce92d0b08ee16b248fa8630de0f68e06e6cc062.log, offset=7495520
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/go-spew-gbbzk_kangaroo_go-spew-71d3b944a5a9c944409d2356aca4a3706b3435b4fb93823afe82f12c4a04a119.log, offset=301401
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/hive-metastore-65467bc98f-2tgbm_teachers_hive-metastore-ff0fbfc32857f67a512a9096d5ee72477a2fcae99b56d5074a0e6d842a3a504e.log, offset=847530
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/ingress-nginx-1911116232-jrq5t_kube-ingress_dashboard-ingress-controller-b513dbbf44d2b9639d85b1b9c661aa28ac8a798be40035526bca0785a615de5d.log, offset=3899152
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_curl-47d37a334a4fe421f2a9db10969756046d1013efe78e9b3d5a5f0302a4062e5c.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_docker-aaf0fa3cb424d1307b62ac8b26efd36e73ad219ff57557bf88e53e4a0c4075bb.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_jnlp-93828032008ad110576779239086784470d2c0e4702e160c21701acf3abab963.log, offset=7985
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_kubectl-faadee10626ff802d1833ab5e553825d5141a7449a020875062371f546064376.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/kube-proxy-ip-172-28-154-19.ec2.internal_kube-system_kube-proxy-8dea9e1e6f7b0837857a51efc732a0489d9fe67f13ead0454f016cc581c63a0e.log, offset=179002
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/kube2iam-bxjp9_kube-system_kube2iam-c42d3e90ca4ae7089c946cdfdde1532f8c8e2d6f949bc12a271ac9356fb52b51.log, offset=130367
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_go-eureka-2dcbe2818580157b4f6880f54a594859fd2f99583886362ff0da6f9d65f0d22a.log, offset=4548006
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_licensing-service-dstar-2598-e3c971-2871d75cdaa1ef18acaaeecae012cb3d619ecbc9cc287c2e6f53325fd3004452.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_licensing-service-dstar-2598-e3c971-74eb7c39bc5e375a5d613117c1a4372c3d417a09adb69b6d50d1fa06f6c34107.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/selenium-python-6479976d89-tfg9x_teachers_selenium-python-c22ceee7b235c9310ec00920d8716b72911a8c0820d77b77a1e9cb35788f5309.log, offset=221
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/servicegraph-7677cbf586-qxszr_istio-system_servicegraph-8dd37c54695619087e413e9a9160c9274104d5e89795fc37d54eaf85cdb7cf56.log, offset=129
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_backup-sidecar-144b980cb69f453cd5254e7752b9676c6224c635e4255f1e8c0ad43e702afc3a.log, offset=1218185
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_eureka-yup-5837c6caafbab943a2266e10bb63aa2c888d8b011212b21d5134958a5ed74545.log, offset=5638801
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_eureka-yup-79e46f557cfd8b201296e93012b8781cead883c49411fe747b322840fdf1dd18.log, offset=65418
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-51cd10ed588a7e477766adacb4abc21b6ebf3da0e4148949e47d192ebf5fa724.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-8e9e10aa191a12184e49919f7ebc1915a8da858dc0a5064b04ad59dd2cb4204a.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-8fc4ead750d8022aa26a9bbb987f270851e12da6f946ae126ebbd074bfc28235.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-c2d228943e009424bc7950b526bc555beb9debb3472b2664ac63aca9bec77d7a.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_sonarqube-04431a036b0aec39da0e661a5b76961c59443905450a840fef1c9d91a96dfb31.log, offset=355692
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/test-nginix-release-nginx-ingress-controller-4c2r8_kangaroo_nginx-ingress-controller-99ddb30b5661fb5a3a90b441b9494b58be44f0a8f3724e5a717d4309b97b7501.log, offset=65346
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/test-prometheus-prometheus-node-exporter-9xjt6_kangaroo_prometheus-node-exporter-59f18e0015509c0fe053c685e6d13b35ca316160596a3f777a7eef4380284854.log, offset=5335
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/weave-net-gtc5g_kube-system_weave-c3d87840f863d35d7006baacbfa6053ba97f18c6fcd96d35268e211658f36b41.log, offset=5198496
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/weave-net-gtc5g_kube-system_weave-npc-f307ee0f21479416fd7e27a333da6d05028be8d601d8193724d075196c338c7c.log, offset=56754
[2018/04/22 14:18:51] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 14:18:51] [ info] [filter_kube] local POD info OK
[2018/04/22 14:18:51] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 14:18:51] [debug] [filter_kube] API Server (ns=kangaroo, pod=fluent-bit-ngmph) http_do=0, HTTP Status: 200
[2018/04/22 14:18:51] [ info] [filter_kube] API server connectivity OK
[2018/04/22 14:18:51] [debug] [router] input=tail.0 'DYNAMIC TAG'
[2018/04/22 14:18:51] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal

StevenACoffman · 2018-04-22T14:30:00Z

With edsiper/fluent-bit-0.13-next:0.18-2, with HTTP_Server Off and the prometheus annotations removed, and liveness and readiness probes removed, and Log_Level info, whatever pod gets scheduled to that node still has a problem:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS             RESTARTS   AGE
fluent-bit-2k4d7   3/3       Running            0          1m
fluent-bit-485r8   3/3       Running            0          1m
fluent-bit-87wzl   3/3       Running            0          1m
fluent-bit-btf6t   3/3       Running            0          1m
fluent-bit-chns9   3/3       Running            0          1m
fluent-bit-dfzsz   3/3       Running            0          1m
fluent-bit-dsd6v   3/3       Running            0          1m
fluent-bit-gjjhb   2/3       CrashLoopBackOff   3          1m
fluent-bit-j4sg2   3/3       Running            0          1m
fluent-bit-pd4h9   3/3       Running            0          1m
fluent-bit-rn9n8   3/3       Running            0          1m
fluent-bit-tvh4k   3/3       Running            0          1m

I don't see an obvious error message:

$ POD_ID=fluent-bit-gjjhb; kubectl logs -n kangaroo $POD_ID -c fluent-bit
[2018/04/22 14:27:37] [ info] [engine] started
[2018/04/22 14:27:37] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 14:27:37] [ info] [filter_kube] local POD info OK
[2018/04/22 14:27:37] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 14:27:37] [ info] [filter_kube] API server connectivity OK
[engine] caught signal

StevenACoffman · 2018-04-22T14:57:03Z

@edsiper I am not sure what's going on with that one pod, but I have not seen any other restarts after 30 minutes.

edsiper · 2018-04-23T04:54:27Z

@StevenACoffman

I've pushed a new 0.13 dev image:

fluent/fluent-bit-0.13-dev:0.18

this version merge the latest changes and also add some minor information about the signal being trapped (which is fundamental to continue troubleshooting), please give it a try...

StevenACoffman · 2018-04-23T11:59:30Z

With fluent/fluent-bit-0.13-dev:0.18 and the HTTP_Server Off I have no restarts after an hour:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-2g259   3/3       Running   0          1h
fluent-bit-2l998   3/3       Running   0          1h
fluent-bit-62zvd   3/3       Running   0          1h
fluent-bit-6pjh7   3/3       Running   0          1h
fluent-bit-72dm4   3/3       Running   0          1h
fluent-bit-8rggc   3/3       Running   0          1h
fluent-bit-8tx9c   3/3       Running   0          1h
fluent-bit-fdxqs   3/3       Running   0          1h
fluent-bit-g7rgp   3/3       Running   0          1h
fluent-bit-nf76d   3/3       Running   0          1h
fluent-bit-rrz6n   3/3       Running   0          1h
fluent-bit-zhn9w   3/3       Running   0          1h

I'm going to turn the HTTP_Server On, with prometheus annotations and liveness/readiness checks applied.

StevenACoffman · 2018-04-23T12:28:06Z

Well, so far, so good. With fluent/fluent-bit-0.13-dev:0.18 and the HTTP_Server On with prometheus annotations, liveness probes, readiness probes applied, I'm not seeing any restarts:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-889qs   3/3       Running   0          8h
fluent-bit-8qj5m   3/3       Running   0          8h
fluent-bit-gwg5c   3/3       Running   0          8h
fluent-bit-jj8b5   3/3       Running   0          8h
fluent-bit-k6bmx   3/3       Running   0          8h
fluent-bit-nqspb   3/3       Running   0          8h
fluent-bit-nqt99   3/3       Running   0          8h
fluent-bit-rqkjs   3/3       Running   0          8h
fluent-bit-s7pnv   3/3       Running   0          8h
fluent-bit-tf622   3/3       Running   0          8h
fluent-bit-wq5h9   3/3       Running   0          8h
fluent-bit-xqvhm   3/3       Running   0          8h

I will continue to monitor it, and I have some additional clusters I can apply this to. Nice work!

StevenACoffman · 2018-04-23T21:45:38Z

Please disregard this last error message (since deleted). I mistakenly reverted to an older version.

edsiper · 2018-04-23T22:57:59Z

That's running an older version of Fluent Bit

…

On Mon, Apr 23, 2018, 15:45 Steve Coffman ***@***.***> wrote: I applied the change to another cluster, and within the first few seconds I got: $ kubectl logs -n kangaroo fluent-bit-qrb8v -c fluent-bit --previous [2018/04/23 21:41:44] [ info] [engine] started [2018/04/23 21:41:44] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443 [2018/04/23 21:41:44] [ info] [filter_kube] local POD info OK [2018/04/23 21:41:44] [ info] [filter_kube] testing connectivity with API server... [2018/04/23 21:41:50] [ info] [filter_kube] API server connectivity OK [2018/04/23 21:41:50] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:51] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:52] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:53] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:53] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:53] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:53] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:54] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:55] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:55] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:55] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:55] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:56] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:56] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:56] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:56] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:57] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:57] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:57] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:57] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:58] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:59] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:59] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:59] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:41:59] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:00] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:00] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:00] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:00] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:01] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:02] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:03] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:04] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:04] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:04] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:04] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:04] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:05] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:06] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:07] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:07] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:07] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:07] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:07] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:08] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:09] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:10] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:11] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:12] [ info] [out_http] HTTP STATUS=200 Fluent-Bit v0.13.0 Copyright (C) Treasure Data retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read retry read [2018/04/23 21:42:13] [ info] [out_http] HTTP STATUS=200 [engine] caught signal [2018/04/23 21:42:13] [ info] [out_http] HTTP STATUS=200 [2018/04/23 21:42:13] [ info] [out_http] HTTP STATUS=200 retry rea — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#557 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAWkNlMl8JiUIUW55-lYeuHZto8NsWxcks5trkuFgaJpZM4TLGmM> .

StevenACoffman · 2018-04-24T11:38:50Z

I let it run in three clusters overnight, and I have had no more restarts. I think this issue can be closed as fixed! Thank you!

jgsqware · 2018-04-24T11:58:23Z

Same for me! Thanks

edsiper · 2018-04-24T16:08:41Z

thanks everyone who helped to troubleshoot this issue! now it's time to close it :)

Fixed.

marckamerbeek · 2018-04-25T05:57:56Z

Great job Eduardo!

when using multiple outputs comming of the same type (e.g: http), if there is a problem when flushing data to one of them, some retries are enqueued into the scheduler. But if the other output succeed it's not considering that some 'retries' still exists and remove the 'task' generating a corruption in memory. Configuration example: [SERVICE] Flush 1 Log_Level info [INPUT] Name random Samples 1 [OUTPUT] <--- http:0 Name http Match * [OUTPUT] <--- http:1 Name http Match * if when starting Fluent Bit both http:0 and http:1 are down, the task generated by 'random' input will create two retries (one for each output), but if http:0 goes up and succeed it will detroy the task leading to corryption in the existen retry created for http:1. This patch make sure to avoid a task deletion if some retries are still in use by the Scheduler. Signed-off-by: Eduardo Silva <[email protected]>

edsiper mentioned this issue Apr 7, 2018

fluent-bit prints garbage on shutdown #429

Closed

edsiper self-assigned this Apr 7, 2018

edsiper added bug troubleshooting labels Apr 7, 2018

edsiper added fixed and removed troubleshooting labels Apr 10, 2018

edsiper added troubleshooting and removed fixed labels Apr 10, 2018

solsson added a commit to Yolean/fluent-bit-kubernetes-logging that referenced this issue Apr 12, 2018

Revert 0.13-dev:0.16 due to

fbe5660

fluent/fluent-bit#557 (comment) This reverts commit 247a8f2.

edsiper mentioned this issue Apr 16, 2018

parsing json crash #567

Closed

edsiper added fixed and removed troubleshooting labels Apr 24, 2018

edsiper closed this as completed Apr 24, 2018

solsson mentioned this issue Jun 1, 2018

0.13.2 crash at msgpack_sbuffer_write #611

Closed

out_http: amd64.c:121: crash: Assertion `0' failed. #557

out_http: amd64.c:121: crash: Assertion `0' failed. #557

Comments

edsiper commented Apr 7, 2018

edsiper commented Apr 9, 2018

onorua commented Apr 9, 2018

StevenACoffman commented Apr 10, 2018 • edited Loading

edsiper commented Apr 10, 2018

jgsqware commented Apr 10, 2018 via email

edsiper commented Apr 10, 2018

edsiper commented Apr 10, 2018

StevenACoffman commented Apr 10, 2018

StevenACoffman commented Apr 10, 2018

edsiper commented Apr 10, 2018

marckamerbeek commented Apr 16, 2018 • edited Loading

edsiper commented Apr 16, 2018 • edited Loading

jgsqware commented Apr 16, 2018

edsiper commented Apr 16, 2018

jgsqware commented Apr 16, 2018

jgsqware commented Apr 16, 2018

edsiper commented Apr 16, 2018

StevenACoffman commented Apr 16, 2018 • edited Loading

StevenACoffman commented Apr 16, 2018 • edited Loading

StevenACoffman commented Apr 16, 2018

StevenACoffman commented Apr 20, 2018

StevenACoffman commented Apr 20, 2018 • edited Loading

StevenACoffman commented Apr 20, 2018

StevenACoffman commented Apr 20, 2018 • edited Loading

StevenACoffman commented Apr 20, 2018 • edited Loading

StevenACoffman commented Apr 20, 2018

edsiper commented Apr 20, 2018

StevenACoffman commented Apr 20, 2018

edsiper commented Apr 20, 2018

StevenACoffman commented Apr 20, 2018

StevenACoffman commented Apr 20, 2018

edsiper commented Apr 22, 2018

StevenACoffman commented Apr 22, 2018

StevenACoffman commented Apr 22, 2018

StevenACoffman commented Apr 22, 2018

StevenACoffman commented Apr 22, 2018

edsiper commented Apr 23, 2018

StevenACoffman commented Apr 23, 2018

StevenACoffman commented Apr 23, 2018 • edited Loading

StevenACoffman commented Apr 23, 2018 • edited Loading

edsiper commented Apr 23, 2018 via email

StevenACoffman commented Apr 24, 2018

jgsqware commented Apr 24, 2018

edsiper commented Apr 24, 2018

marckamerbeek commented Apr 25, 2018

StevenACoffman commented Apr 10, 2018 •

edited

Loading

marckamerbeek commented Apr 16, 2018 •

edited

Loading

edsiper commented Apr 16, 2018 •

edited

Loading

StevenACoffman commented Apr 16, 2018 •

edited

Loading

StevenACoffman commented Apr 16, 2018 •

edited

Loading

StevenACoffman commented Apr 20, 2018 •

edited

Loading

StevenACoffman commented Apr 20, 2018 •

edited

Loading

StevenACoffman commented Apr 20, 2018 •

edited

Loading

StevenACoffman commented Apr 23, 2018 •

edited

Loading

StevenACoffman commented Apr 23, 2018 •

edited

Loading