Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_http: amd64.c:121: crash: Assertion `0' failed. #557

Closed
edsiper opened this issue Apr 7, 2018 · 78 comments
Closed

out_http: amd64.c:121: crash: Assertion `0' failed. #557

edsiper opened this issue Apr 7, 2018 · 78 comments
Assignees

Comments

@edsiper
Copy link
Member

edsiper commented Apr 7, 2018

Reported on #429, we will track this problem here:

[2018/04/07 01:44:14] [ info] [out_http] HTTP STATUS=200
[2018/04/07 01:44:15] [ info] [out_http] HTTP STATUS=200
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

[engine] caught signal
#0  0x7f6defdfa529      in  ???() at ???:0
#1  0x7f6defdf1e66      in  ???() at ???:0
#2  0x7f6defdf1f11      in  ???() at ???:0
#3  0x55994d80c389      in  ???() at ???:0
#4  0xffffffffffffffff  in  ???() at ???:0

This problem is associated to the bad use of co-routines implementation, likely there is an unexpected and explicit return from a co-routine that trigger this problem.

cc: @StevenACoffman @jgsqware @onorua

@edsiper
Copy link
Member Author

edsiper commented Apr 9, 2018

@StevenACoffman @jgsqware @onorua

are you facing this crash only when the HTTP server is enabled ?

@onorua
Copy link
Contributor

onorua commented Apr 9, 2018

I did not try to disable HTTP server to be honest, as we use it for metrics exposure.

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 10, 2018

@edsiper We had not noticed excessive restarts when the liveness and readiness probes were omitted previously. I will disable the HTTP Server, remove the prometheus annotations, the liveness and readiness probes and start over, and let it run overnight.

@edsiper
Copy link
Member Author

edsiper commented Apr 10, 2018

thanks. I was able to track down the issue in 0.13-dev, indeed the problem is in the HTTP Server, a fix will be available shortly.

note: If you see a restart when the HTTP server is off, that's a separate problem.

@jgsqware
Copy link

jgsqware commented Apr 10, 2018 via email

@edsiper
Copy link
Member Author

edsiper commented Apr 10, 2018

I've fixed the issues associated to the HTTP Server that generated the crash. All please upgrade to the following Docker image:

  • fluent/fluent-bit-0.13-dev:0.16

@edsiper
Copy link
Member Author

edsiper commented Apr 10, 2018

@StevenACoffman
Copy link
Contributor

@edsiper FYI After 24 hours running fluent/fluent-bit-0.13-dev:0.15 without the http server, I only had 4 restarts on a single pod, and one on another. The other 20 pods had no issues. Not the same issue as what you fixed, but if you are curious, all 5 restarts were this error:

[2018/04/10 03:27:50] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x7ffb141f2e63      in  ???() at ???:0
#1  0x55de3bed7aec      in  ???() at ???:0
#2  0x55de3c0e7415      in  msgpack_pack_ext_body() at lib/msgpack-2.1.3/include/msgpack/pack_template.h:890
#3  0x55de3c0e7415      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:72
#4  0x55de3bed84ce      in  ???() at ???:0
#5  0x55de3bed8d1c      in  ???() at ???:0
#6  0x55de3be7b3da      in  ???() at ???:0
#7  0x55de3be78e2c      in  ???() at ???:0
#8  0x55de3be7a98c      in  ???() at ???:0
#9  0x55de3be9e99d      in  ???() at ???:0
#10 0x55de3be9f7e8      in  ???() at ???:0
#11 0x55de3be9d34c      in  ???() at ???:0
#12 0x55de3bea3721      in  ???() at ???:0
#13 0x55de3be7aecb      in  ???() at ???:0
#14 0x55de3be81a86      in  ???() at ???:0
#15 0x55de3be21ef3      in  ???() at ???:0
#16 0x7ffb140ea2e0      in  ???() at ???:0
#17 0x55de3be20449      in  ???() at ???:0
#18 0xffffffffffffffff  in  ???() at ???:0

I have just applied the fluent/fluent-bit-0.13-dev:0.16 changes and will monitor it and let you know.

@StevenACoffman
Copy link
Contributor

@edsiper fluent/fluent-bit-0.13-dev:0.16 with HTTP_Server On is in CrashLoopBackOff for all applied containers:

[2018/04/10 20:57:33] [ info] [out_http] HTTP STATUS=200
[2018/04/10 20:57:33] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x55a212d8e727      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x55a212d8e75e      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x55a212d8eb64      in  mk_http_thread_purge() at lib/monkey/mk_server/mk_http_thread.c:197
#3  0x55a212d8e8e6      in  thread_cb_init_vars() at lib/monkey/mk_server/mk_http_thread.c:104
#4  0x55a212d995e6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#5  0xffffffffffffffff  in  ???() at ???:0

Some error immediately, some run for a minute, but then they all crash with the same error.

@edsiper
Copy link
Member Author

edsiper commented Apr 10, 2018

thanks for the info. troubleshooting.

solsson added a commit to Yolean/fluent-bit-kubernetes-logging that referenced this issue Apr 12, 2018
@marckamerbeek
Copy link

marckamerbeek commented Apr 16, 2018

I'm doubting to start a new issue...but I faced some crashes aswell.

Version: fluent/fluent-bit-0.13-dev:0.16
Kubernetes: 1.8.4

[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data
 #0  0x7f59d36f0f08      in  ???() at ???:0
#1  0x56214c2c0e8b      in  ???() at ???:0
#2  0x56214c2bf538      in  ???() at ???:0
#3  0x56214c30eebe      in  ???() at ???:0
#4  0x56214c30f1a5      in  ???() at ???:0
#5  0x56214c30fd7c      in  ???() at ???:0
#6  0x56214c2b243a      in  ???() at ???:0
#7  0x56214c2afe8c      in  ???() at ???:0
#8  0x56214c2b19ec      in  ???() at ???:0
#9  0x56214c2d59fd      in  ???() at ???:0
#10 0x56214c2d6848      in  ???() at ???:0
#11 0x56214c2d4237      in  ???() at ???:0
#12 0x56214c2b1f2b      in  ???() at ???:0
#13 0x56214c2b8ae6      in  ???() at ???:0
#14 0x56214c258f53      in  ???() at ???:0
#15 0x7f59d35f02e0      in  ???() at ???:0
#16 0x56214c2574a9      in  ???() at ???:0
#17 0xffffffffffffffff  in  ???() at ???:0```

There are several hundred of pods running, but its was caused by one pod only.  It's a nginx pod. I changed the log format to json and then fluent-bit started to crash on that particular node. 

I figured out that the json was not correct. The json parser of fluent-bit crashes without giving any clue. The c implementation of the json parser is probably a limited one.

I attached a file with the wrong log and one file with the fixed one.  This way it might be reproducible.

[wrong_json.log](https://github.com/fluent/fluent-bit/files/1914262/wrong_json.log)
[fixed_json.log](https://github.com/fluent/fluent-bit/files/1914265/fixed_json.log)





@edsiper
Copy link
Member Author

edsiper commented Apr 16, 2018

@marckamerbeek what you have reported looks like a different issue, I've filed #567 for your case, let's follow up there.

@jgsqware
Copy link

We currently experiencing the same since we upgraded to 0.16

@edsiper
Copy link
Member Author

edsiper commented Apr 16, 2018

would you please try the following image which have several fixes in the HTTP server side?

edsiper/fluent-bit-0.13-next:0.17

note: this image size is 150MB and should be used only to try to reproduce the main problem in question

@jgsqware
Copy link

Currently, no restart with your version.

@jgsqware
Copy link

10 minutes without restart, I'll let it run all gnight long and keep you posted

@edsiper
Copy link
Member Author

edsiper commented Apr 16, 2018

thanks @jgsqware

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 16, 2018

@edsiper I have applied the change and am going to leave it running overnight as well. Here are results from one hour:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-24c7c   3/3       Running   4          1h
fluent-bit-2f4ph   3/3       Running   0          1h
fluent-bit-884lq   3/3       Running   0          1h
fluent-bit-ccbp8   3/3       Running   1          1h
fluent-bit-fmq4k   2/3       Running   0          1h
fluent-bit-g7mjq   3/3       Running   10         1h
fluent-bit-mk5cf   3/3       Running   0          1h
fluent-bit-mtqhc   3/3       Running   0          1h
fluent-bit-pd629   3/3       Running   0          1h
fluent-bit-pfgsv   3/3       Running   2          1h
fluent-bit-s2phj   2/3       Running   0          1h
fluent-bit-tsbv5   3/3       Running   0          1h

kubectl logs -n kangaroo fluent-bit-g7mjq -c fluent-bit --previous results:

[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[engine] caught signal
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:05:02] [ info] [out_http] HTTP STATUS=200

All the pods with restarts exhibit the same behavior. Sometimes [engine] caught signal is the last log message, but generally it's a few before that.

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 16, 2018

If it is helpful, here is the startup logs:

[2018/04/16 23:14:56] [ info] [engine] started
[2018/04/16 23:14:56] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/16 23:14:56] [ info] [filter_kube] local POD info OK
[2018/04/16 23:14:56] [ info] [filter_kube] testing connectivity with API server...
[2018/04/16 23:14:56] [ info] [filter_kube] API server connectivity OK
[2018/04/16 23:14:56] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2018/04/16 23:14:58] [ info] [out_http] HTTP STATUS=200

@StevenACoffman
Copy link
Contributor

With log level set to debug:

[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [filter_kube] could not merge JSON log as requested
[2018/04/16 23:20:56] [debug] [input tail.0] [mem buf] size = 5212894
[2018/04/16 23:20:56] [debug] [input] tail.0 paused (mem buf overlimit)
[2018/04/16 23:20:56] [debug] [in_tail] file=/var/log/containers/fluent-bit-g4nst_kangaroo_fluent-bit-5327b6357ef3287060d07e9ff6523f6ec9c592b5cd7d085617605cdd197bc631.log read=32753 lines=214
[2018/04/16 23:20:56] [ info] [out_http] HTTP STATUS=200
[2018/04/16 23:20:56] [debug] [task] created task=0x563255dcbd10 id=2 OK
[2018/04/16 23:20:57] [ info] [out_http] HTTP STATUS=200

@StevenACoffman
Copy link
Contributor

@edsiper That looks like it was generated by

flb_error("[out_http] %s:%i, HTTP status=%i",

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 20, 2018

It looks like HTTP STATUS were changed to HTTP status in defac78 so I think this is the latest version.

@StevenACoffman
Copy link
Contributor

Although I do see:

[2018/04/20 18:35:02] [ warn] [filter_kube] annotation parser 'json' not found (ns='teachers' pod_name='go-spew-5d99bbd878-5jq8p')

Which you changed in b4b1e09 so maybe it is an older version that got pushed to that docker tag?

I did see one that got this message before it crashed with the same error:

[2018/04/20 18:36:23] [error] [out_http] no upstream connections available

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 20, 2018

Ah, right. The configmap overwrote the new parser.conf with the old json-test, so I needed to update that there to match. I can't tell how to verify that I am not using an older version. When I run kubectl exec -it fluent-bit-ncv9l -n kangaroo -c fluent-bit -- /bin/sh and then /fluent-bit/bin/fluent-bit --version I get Fluent Bit v0.13.0

When I run ./fluent-bit --sosreport I get:

Fluent-Bit v0.13.0
Copyright (C) Treasure Data

[2018/04/20 19:04:35] [ info] [engine] started

Fluent Bit Enterprise - SOS Report
==================================
The following report aims to be used by Fluent Bit and Fluentd Enterprise
Customers of Treasure Data. For more details visit:

    https://fluentd.treasuredata.com


[Fluent Bit]
    Edition		Community Edition
    Version		0.13.0
    Built Flags		 JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_SQLDB FLB_HAVE_BUFFERING FLB_HAVE_METRICS FLB_HAVE_HTTP_SERVER FLB_HAVE_FLUSH_LIBCO FLB_HAVE_SYSTEMD FLB_HAVE_FORK FLB_HAVE_TIMESPEC_GET FLB_HAVE_PROXY_GO FLB_HAVE_JEMALLOC JEMALLOC_MANGLE FLB_HAVE_LIBBACKTRACE FLB_HAVE_REGEX FLB_HAVE_C_TLS FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY

[Operating System]
    Name		Linux
    Release		4.4.0-1020-aws
    Version		#29-Ubuntu SMP Wed Jun 14 15:54:52 UTC 2017

[Hardware]
    Architecture	x86_64
    Processors		2

[Built Plugins]
    Inputs              cpu mem kmsg tail proc disk systemd netif dummy head exec health serial stdin tcp mqtt lib forward random syslog
    Filters             grep stdout throttle kubernetes parser record_modifier
    Outputs             counter es exit file forward http influxdb kafka kafka-rest nats null plot splunk stdout td lib flowcounter

[SERVER] Runtime configuration
    Flush		5
    Daemon		Off
    Log_Level		Info

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 20, 2018

With HTTP_Server Off and the prometheus annotations removed, and liveness and readiness probes removed, I got a restart at 44 minutes on a single pod:

$ POD_ID=fluent-bit-bqlsb;kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/20 19:57:54] [ info] [engine] started
[2018/04/20 19:57:54] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/20 19:57:54] [ info] [filter_kube] local POD info OK
[2018/04/20 19:57:54] [ info] [filter_kube] testing connectivity with API server...
[2018/04/20 19:57:55] [ info] [filter_kube] API server connectivity OK
[2018/04/20 19:57:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103
#1  0x55ae207aeaeb      in  ???() at ???:0
#2  0x55ae207af339      in  ???() at ???:0
#3  0x55ae2075179a      in  ???() at ???:0
#4  0x55ae2074f1ec      in  ???() at ???:0
#5  0x55ae20750d4c      in  ???() at ???:0
#6  0x55ae20774f4d      in  ???() at ???:0
#7  0x55ae20775d98      in  ???() at ???:0
#8  0x55ae20773787      in  ???() at ???:0
#9  0x55ae2075128b      in  ???() at ???:0
#10 0x55ae20757ee2      in  ???() at ???:0
#11 0x55ae206f82b3      in  ???() at ???:0
#12 0x7fcde44382e0      in  ???() at ???:0
#13 0x55ae206f6809      in  ???() at ???:0
#14 0xffffffffffffffff  in  ???() at ???:0

@StevenACoffman
Copy link
Contributor

After more than two hours, I have no further restarts with the HTTP Server off. I will let it run overnight, and until monday.

@edsiper
Copy link
Member Author

edsiper commented Apr 20, 2018

thanks for the report. Focusing on this specific error:

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103

my assumption is that some invalid data was passes to msgpack-c library.. and looks very associated to #567 where passing an invalid Map (on this case a missing value for a key) generate a small corruption.

I have done two changes:

  • upgrade json parser which fix the problem (with a custom PR on their repo)
  • double-check unpacking status

Please test the following new image:

edsiper/fluent-bit-0.13-next:0.18

@StevenACoffman
Copy link
Contributor

With HTTP Server on, metrics on, liveness probes they all are in CrashLoopBackoff after running for only a few minutes:

[2018/04/20 23:17:43] [ info] [engine] started
[2018/04/20 23:17:43] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/20 23:17:43] [ info] [filter_kube] local POD info OK
[2018/04/20 23:17:43] [ info] [filter_kube] testing connectivity with API server...
[2018/04/20 23:17:48] [ info] [filter_kube] API server connectivity OK
[2018/04/20 23:17:48] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal

@edsiper
Copy link
Member Author

edsiper commented Apr 20, 2018

are you able to see this issue again ?

#0  0x55ae209bf572      in  msgpack_pack_object() at lib/msgpack-2.1.3/src/objectc.c:103

@StevenACoffman
Copy link
Contributor

I got this once:

[engine] caught signal
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:47] [ warn] [input] cannot disable event for tail.0
[2018/04/20 23:22:48] [ warn] [engine] service will stop in 5 seconds
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

@StevenACoffman
Copy link
Contributor

When I disable the HTTP Server, I'm still getting similar immediate crash behavior.

@edsiper
Copy link
Member Author

edsiper commented Apr 22, 2018

@StevenACoffman

I've pushed the following new image:

edsiper/fluent-bit-0.13-next:0.18-2

the co-routines library was not build with the required pre-processor flags to enable multithreading support, for hence when starting the HTTP server (in a separate thread) and multiple co-routines where started i the HTTP Server + Fluent Bit output doing the same, it lead to data corruption.

After add the missing flags I was not able to reproduce the crash. Would you please double-check in your end?

@StevenACoffman
Copy link
Contributor

With edsiper/fluent-bit-0.13-next:0.18-2, with HTTP_Server On and the prometheus annotations applied, and liveness and readiness probes applied things are more stable:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS             RESTARTS   AGE
fluent-bit-2jwj5   3/3       Running            0          51m
fluent-bit-4nksl   3/3       Running            0          51m
fluent-bit-56dl2   3/3       Running            0          51m
fluent-bit-92qvj   3/3       Running            0          51m
fluent-bit-crvhn   3/3       Running            0          51m
fluent-bit-fnqqp   3/3       Running            2          51m
fluent-bit-lrlpx   3/3       Running            0          51m
fluent-bit-m2ghf   2/3       CrashLoopBackOff   15         51m
fluent-bit-q56gt   3/3       Running            0          51m
fluent-bit-s7rvz   3/3       Running            0          51m
fluent-bit-x2vh5   3/3       Running            0          51m
fluent-bit-xkncq   3/3       Running            0          51m
$ POD_ID=fluent-bit-m2ghf; kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/22 13:16:35] [ info] [engine] started
[2018/04/22 13:16:36] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 13:16:36] [ info] [filter_kube] local POD info OK
[2018/04/22 13:16:36] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 13:16:36] [ info] [filter_kube] API server connectivity OK
[2018/04/22 13:16:36] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal
$ POD_ID=fluent-bit-fnqqp; kubectl logs -n kangaroo $POD_ID -c fluent-bit --previous
[2018/04/22 12:24:39] [ info] [engine] started
[2018/04/22 12:24:39] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 12:24:39] [ info] [filter_kube] local POD info OK
[2018/04/22 12:24:39] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 12:24:39] [ info] [filter_kube] API server connectivity OK
[2018/04/22 12:24:39] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2018/04/22 12:24:40] [ warn] [filter_kube] annotation parser 'json' not found (ns='kangaroo' pod_name='go-spew-pgm52')
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:45] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:3000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[2018/04/22 12:24:46] [ info] [out_http] 127.0.0.1:4000, HTTP status=200
[engine] caught signal

@StevenACoffman
Copy link
Contributor

I adjusted that one pod that was in CrashLoopBackOff to Log_Level debug:

$ POD_ID=fluent-bit-ngmph; kubectl logs -n kangaroo $POD_ID -c fluent-bit
[2018/04/22 14:18:51] [ info] [engine] started
[2018/04/22 14:18:51] [debug] [in_tail] inotify watch fd=20
[2018/04/22 14:18:51] [debug] [in_tail] scanning path /var/log/containers/*.log
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/consortium-books-holdings-1524268140-q9sp5_teachers_consortium-books-holdings-4807e8810c330d5a9d69f239c9c2480ece3d1d0f9156626bb7c31fef2e5d975d.log, offset=382774
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eureka-yup-9726a3ba513d5a05d064d7c3b8b33a95ecd480388c457eb6ea0cf57ab9c91a3b.log, offset=1091595
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eureka-yup-e0125f0f07d00e2e9b3749b60f8bf0654214096564dd4725c86ec58c0497e056.log, offset=229024
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/eviz-hackathon-77fbb76f-2xw6v_teachers_eviz-hackathon-e969ec8e57c1d14ce5f9a208297e815a539863ae161f2e46081c489874e3bb46.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_fluent-bit-0065e2781b06f877ae5207d0eeac9a7c1a3c592f9c7f034fbea0018ec221701a.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_fluent-bit-7d05c4331e95df14fccdea3732be855e975c64854aa88c5687f7952721d5291b.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_go-kafka-logsink-af433108645cd7378b5f435c20480a084af395b4a5236b4f7f341139fd7ad775.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/fluent-bit-ngmph_kangaroo_go-s3-logsink-ab15461b1abc4064b5bee72b73c6670e116513138ec32969e784d538b2f5a21d.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/go-spew-5d99bbd878-5jq8p_teachers_go-spew-2796b3ae129b6f38835e28724ce92d0b08ee16b248fa8630de0f68e06e6cc062.log, offset=7495520
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/go-spew-gbbzk_kangaroo_go-spew-71d3b944a5a9c944409d2356aca4a3706b3435b4fb93823afe82f12c4a04a119.log, offset=301401
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/hive-metastore-65467bc98f-2tgbm_teachers_hive-metastore-ff0fbfc32857f67a512a9096d5ee72477a2fcae99b56d5074a0e6d842a3a504e.log, offset=847530
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/ingress-nginx-1911116232-jrq5t_kube-ingress_dashboard-ingress-controller-b513dbbf44d2b9639d85b1b9c661aa28ac8a798be40035526bca0785a615de5d.log, offset=3899152
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_curl-47d37a334a4fe421f2a9db10969756046d1013efe78e9b3d5a5f0302a4062e5c.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_docker-aaf0fa3cb424d1307b62ac8b26efd36e73ad219ff57557bf88e53e4a0c4075bb.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_jnlp-93828032008ad110576779239086784470d2c0e4702e160c21701acf3abab963.log, offset=7985
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/jenkins-slave-pkgd2-mb2s9_teachers_kubectl-faadee10626ff802d1833ab5e553825d5141a7449a020875062371f546064376.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/kube-proxy-ip-172-28-154-19.ec2.internal_kube-system_kube-proxy-8dea9e1e6f7b0837857a51efc732a0489d9fe67f13ead0454f016cc581c63a0e.log, offset=179002
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/kube2iam-bxjp9_kube-system_kube2iam-c42d3e90ca4ae7089c946cdfdde1532f8c8e2d6f949bc12a271ac9356fb52b51.log, offset=130367
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_go-eureka-2dcbe2818580157b4f6880f54a594859fd2f99583886362ff0da6f9d65f0d22a.log, offset=4548006
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_licensing-service-dstar-2598-e3c971-2871d75cdaa1ef18acaaeecae012cb3d619ecbc9cc287c2e6f53325fd3004452.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/licensing-service-dstar-2598-e3c971-6ccd87468-5xk98_teachers_licensing-service-dstar-2598-e3c971-74eb7c39bc5e375a5d613117c1a4372c3d417a09adb69b6d50d1fa06f6c34107.log, offset=0
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/selenium-python-6479976d89-tfg9x_teachers_selenium-python-c22ceee7b235c9310ec00920d8716b72911a8c0820d77b77a1e9cb35788f5309.log, offset=221
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/servicegraph-7677cbf586-qxszr_istio-system_servicegraph-8dd37c54695619087e413e9a9160c9274104d5e89795fc37d54eaf85cdb7cf56.log, offset=129
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_backup-sidecar-144b980cb69f453cd5254e7752b9676c6224c635e4255f1e8c0ad43e702afc3a.log, offset=1218185
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_eureka-yup-5837c6caafbab943a2266e10bb63aa2c888d8b011212b21d5134958a5ed74545.log, offset=5638801
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_eureka-yup-79e46f557cfd8b201296e93012b8781cead883c49411fe747b322840fdf1dd18.log, offset=65418
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-51cd10ed588a7e477766adacb4abc21b6ebf3da0e4148949e47d192ebf5fa724.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-8e9e10aa191a12184e49919f7ebc1915a8da858dc0a5064b04ad59dd2cb4204a.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-8fc4ead750d8022aa26a9bbb987f270851e12da6f946ae126ebbd074bfc28235.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_restore-settings-c2d228943e009424bc7950b526bc555beb9debb3472b2664ac63aca9bec77d7a.log, offset=374
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/sonarqube-77784d95b-tfn9x_teachers_sonarqube-04431a036b0aec39da0e661a5b76961c59443905450a840fef1c9d91a96dfb31.log, offset=355692
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/test-nginix-release-nginx-ingress-controller-4c2r8_kangaroo_nginx-ingress-controller-99ddb30b5661fb5a3a90b441b9494b58be44f0a8f3724e5a717d4309b97b7501.log, offset=65346
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/test-prometheus-prometheus-node-exporter-9xjt6_kangaroo_prometheus-node-exporter-59f18e0015509c0fe053c685e6d13b35ca316160596a3f777a7eef4380284854.log, offset=5335
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/weave-net-gtc5g_kube-system_weave-c3d87840f863d35d7006baacbfa6053ba97f18c6fcd96d35268e211658f36b41.log, offset=5198496
[2018/04/22 14:18:51] [debug] [in_tail] add to scan queue /var/log/containers/weave-net-gtc5g_kube-system_weave-npc-f307ee0f21479416fd7e27a333da6d05028be8d601d8193724d075196c338c7c.log, offset=56754
[2018/04/22 14:18:51] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 14:18:51] [ info] [filter_kube] local POD info OK
[2018/04/22 14:18:51] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 14:18:51] [debug] [filter_kube] API Server (ns=kangaroo, pod=fluent-bit-ngmph) http_do=0, HTTP Status: 200
[2018/04/22 14:18:51] [ info] [filter_kube] API server connectivity OK
[2018/04/22 14:18:51] [debug] [router] input=tail.0 'DYNAMIC TAG'
[2018/04/22 14:18:51] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[engine] caught signal

@StevenACoffman
Copy link
Contributor

With edsiper/fluent-bit-0.13-next:0.18-2, with HTTP_Server Off and the prometheus annotations removed, and liveness and readiness probes removed, and Log_Level info, whatever pod gets scheduled to that node still has a problem:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS             RESTARTS   AGE
fluent-bit-2k4d7   3/3       Running            0          1m
fluent-bit-485r8   3/3       Running            0          1m
fluent-bit-87wzl   3/3       Running            0          1m
fluent-bit-btf6t   3/3       Running            0          1m
fluent-bit-chns9   3/3       Running            0          1m
fluent-bit-dfzsz   3/3       Running            0          1m
fluent-bit-dsd6v   3/3       Running            0          1m
fluent-bit-gjjhb   2/3       CrashLoopBackOff   3          1m
fluent-bit-j4sg2   3/3       Running            0          1m
fluent-bit-pd4h9   3/3       Running            0          1m
fluent-bit-rn9n8   3/3       Running            0          1m
fluent-bit-tvh4k   3/3       Running            0          1m

I don't see an obvious error message:

$ POD_ID=fluent-bit-gjjhb; kubectl logs -n kangaroo $POD_ID -c fluent-bit
[2018/04/22 14:27:37] [ info] [engine] started
[2018/04/22 14:27:37] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/22 14:27:37] [ info] [filter_kube] local POD info OK
[2018/04/22 14:27:37] [ info] [filter_kube] testing connectivity with API server...
[2018/04/22 14:27:37] [ info] [filter_kube] API server connectivity OK
[engine] caught signal

@StevenACoffman
Copy link
Contributor

@edsiper I am not sure what's going on with that one pod, but I have not seen any other restarts after 30 minutes.

@edsiper
Copy link
Member Author

edsiper commented Apr 23, 2018

@StevenACoffman

I've pushed a new 0.13 dev image:

fluent/fluent-bit-0.13-dev:0.18

this version merge the latest changes and also add some minor information about the signal being trapped (which is fundamental to continue troubleshooting), please give it a try...

@StevenACoffman
Copy link
Contributor

With fluent/fluent-bit-0.13-dev:0.18 and the HTTP_Server Off I have no restarts after an hour:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-2g259   3/3       Running   0          1h
fluent-bit-2l998   3/3       Running   0          1h
fluent-bit-62zvd   3/3       Running   0          1h
fluent-bit-6pjh7   3/3       Running   0          1h
fluent-bit-72dm4   3/3       Running   0          1h
fluent-bit-8rggc   3/3       Running   0          1h
fluent-bit-8tx9c   3/3       Running   0          1h
fluent-bit-fdxqs   3/3       Running   0          1h
fluent-bit-g7rgp   3/3       Running   0          1h
fluent-bit-nf76d   3/3       Running   0          1h
fluent-bit-rrz6n   3/3       Running   0          1h
fluent-bit-zhn9w   3/3       Running   0          1h

I'm going to turn the HTTP_Server On, with prometheus annotations and liveness/readiness checks applied.

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 23, 2018

Well, so far, so good. With fluent/fluent-bit-0.13-dev:0.18 and the HTTP_Server On with prometheus annotations, liveness probes, readiness probes applied, I'm not seeing any restarts:

$ kubectl get pods -n kangaroo -l k8s-app=fluent-bit-logging
NAME               READY     STATUS    RESTARTS   AGE
fluent-bit-889qs   3/3       Running   0          8h
fluent-bit-8qj5m   3/3       Running   0          8h
fluent-bit-gwg5c   3/3       Running   0          8h
fluent-bit-jj8b5   3/3       Running   0          8h
fluent-bit-k6bmx   3/3       Running   0          8h
fluent-bit-nqspb   3/3       Running   0          8h
fluent-bit-nqt99   3/3       Running   0          8h
fluent-bit-rqkjs   3/3       Running   0          8h
fluent-bit-s7pnv   3/3       Running   0          8h
fluent-bit-tf622   3/3       Running   0          8h
fluent-bit-wq5h9   3/3       Running   0          8h
fluent-bit-xqvhm   3/3       Running   0          8h

I will continue to monitor it, and I have some additional clusters I can apply this to. Nice work!

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 23, 2018

Please disregard this last error message (since deleted). I mistakenly reverted to an older version.

@edsiper
Copy link
Member Author

edsiper commented Apr 23, 2018 via email

@StevenACoffman
Copy link
Contributor

I let it run in three clusters overnight, and I have had no more restarts. I think this issue can be closed as fixed! Thank you!

@jgsqware
Copy link

Same for me! Thanks

@edsiper
Copy link
Member Author

edsiper commented Apr 24, 2018

thanks everyone who helped to troubleshoot this issue! now it's time to close it :)

Fixed.

@marckamerbeek
Copy link

Great job Eduardo!

micklonge pushed a commit to micklonge/elk_fluent-bit that referenced this issue Jun 26, 2018
when using multiple outputs comming of the same type (e.g: http),
if there is a problem when flushing data to one of them, some retries
are enqueued into the scheduler. But if the other output succeed it's
not considering that some 'retries' still exists and remove the 'task'
generating a corruption in memory. Configuration example:

  [SERVICE]
    Flush     1
    Log_Level info

  [INPUT]
    Name      random
    Samples   1

  [OUTPUT]                 <--- http:0
    Name      http
    Match     *

  [OUTPUT]                 <--- http:1
    Name      http
    Match     *

if when starting Fluent Bit both http:0 and http:1 are down, the task
generated by 'random' input will create two retries (one for each output),
but if http:0 goes up and succeed it will detroy the task leading to
corryption in the existen retry created for http:1.

This patch make sure to avoid a task deletion if some retries are still
in use by the Scheduler.

Signed-off-by: Eduardo Silva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants