-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloudwatch_logs blocks on read forever #4606
Comments
I have a second occurrence on another, unrelated server:
|
May be duplicate with #4553, the errors may be incidental. |
@PettitWesley, sorry to bother you, but reading the commit logs this seems perhaps closest to your modified areas of the code. Do you have any thoughts on the blocking behavior, or know who might be the best person to handle this, or do you need any more information? I have a few live cases rolling around. |
@fdr Sorry I missed this issue- are you using the AWS Distro or not? We recently have been adding some extra patches... one of which is for an issue that can cause it to get stuck on network IO: #4098 |
No, I'm not. AlmaLinux of recent vintage here. Thanks for considering my question. |
Looks like 1.8.12 may contain some of these? I'll upgrade and see what happens. |
Alas, 1.8.12 is affected as well. Are some of the fixes un-applied to it? Temporally (merge vs. release date) it seems like they should be, but I haven't used
|
@PettitWesley yeah, I think 1.8.12 has the patch you mention, but it doesn't do the trick:
Also looking at the patch, this label name relative to what it's being checked by is...suspicious? WANT_READ is a retry_read, WANT_WRITE is a retry_read? I haven't read the context, just the commit message and was wondering if it was an oversight.
|
Picked up a few more of these. 1.8.12.
|
Another one cropped up, 1.8.12.
|
@krispraws Does this match up with the SSL read fix? Or does this look like something different? |
Out of curiosity, how do you time things out when doing synchronous network I/O? /*
* Remove async flag from upstream
* CW output runs in sync mode; because the CW API currently requires
* PutLogEvents requests to a log stream to be made serially
*/
upstream->flags &= ~(FLB_IO_ASYNC); |
Alright, new information...so I decided to see what would happen if I injected some kind of fault at the OS level, whether fluent-bit would un-stick. Indeed, it does. Firstly, I ran I was going to use Using
Note the large gap in dates below. I run this in GDB on
So, if it's too hairy to enable asynchronous I/O at this time, you could consider TCP keepalive as a workaround, by running setsockopt a handful of times. Better than the plugin not working indefinitely, I think. |
@krispraws @PettitWesley hello folks...just wanted to raise this again. Seems bad that any connection problem stops logs forever, so long as the connection state stays in TCP ESTAB, which, in normal operation, it will. |
Ok so please forgive me I actually really don't know what I am doing with all of this low level networking stuff, but the PR #3096 that you linked should probably fix this right? Also, do you have any thoughts on whether this is related to: aws/aws-for-fluent-bit#293 I'm gonna build that PR and enable the new setting and see if it reduces the rate of connection errors. |
@krispraws I would appreciate your help and expertise on this |
I'll give a crash course on the low level network stuff... If you have reasons to not use asynchronous I/O and implement timeouts of individual syscalls that way, you definitely want keepalive on. I consider aws/aws-for-fluent-bit#293 a very bad idea. Without the errors created by missing keepalive packets, the client has no way to make progress (as seen in this bug report) TCP keepalive using So, alternatives. I only think the first one is worth entertaining at all, but I mention them all to give a background to how people have dealt with this problem.
As for #3096: I don't know fluentbit software engineering well, but if it was a nice generic way to apply keepalive and it worked for |
Thanks for that explanation.
So the reason why cloudwatch_logs plugin uses sync IO (in contrast to basically all other plugins which use async), is because of unfortunate mismatches between Fluent Bit's concurrency model and the CloudWatch PutLogEvents API. The API expects a sequence token for each request to upload logs to a log stream, and then a successful request returns the next sequence token. This doesn't match up well with Fluent Bit concurrency model, which is partly explained here: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency But basically, if I make an async request to PutLogEvents, then until it returns, I can't upload any more logs to the same stream because I need that sequence token. Hence I chose sync IO. With workers Eduardo changed how things work a bit and I think you can now use mutex lock in output plugins, EDIT: Actually I think the above doesn't work, if a new flush call blocks on a lock held by a now inactive thread, then you can get deadlock very easily. The thread is now blocked on the lock and so the coroutine holding it won't ever wake up. This is the case where there is a single worker thread, with multiple workers, if all are sending to the same log stream then you can get this as well. A more complicated solution is needed that uses custom logic in the fluent bit event loop/engine. |
Well, it could be that synchronous I/O with keepalive is genuinely the best choice for quite a while, in that there may not be a profitable way to solve the problem for an acceptable level of complication (three syscalls). Though, given the use of cooperative tasking, I think there are other effects here, besides better error messages, that you would want to solve: the cloudwatch logs output plugin is sitting on the cooperatively-tasked mutex you describe for, perhaps, hundreds to thousands of times longer than an async implementation would, only waiting for I/O. This might have some interesting effect as it comes to flow of logs to multiple outputs, or triggering log loss for input sources like syslog unix datagrams, among other things, since the fluentbit process is effectively frozen for far longer than designed. |
Hi, we've recently merged some code fixes related to connections management. Is this issue still reproducible for you using 1.8.15 or 1.9.1? |
Can you summarize what has been changed that may impact this problem? Evaluating this imposes some impact not only on me, but on my team, and since we have a pretty good idea what the problem is: system calls on blocking I/O are not obliged to return at any particular time (i.e. never), fluent-bit and/or the cloudwatch output module can't time it out, and can't schedule its other work items, either, as cloudwatch sits on the cooperative dispatch forever. I was skimming the git commit logs and it's not obvious what would address this issue. |
Hi @fdr, it would be great if you could provide a repro scenario for this using either 1.8.15 or 1.9.1. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
Bug Report
fluent-bit stops sending logs forever.
Expected behavior
It should keep sending logs.
Your Environment
Additional context
Thread stack traces look like this:
It has been blocked since about three days ago, not long after it started.
Note that unlike some of my other reports of this kind, there are no errors displayed at all since the program has started.
The text was updated successfully, but these errors were encountered: