-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Multiple Server Behavior #1719
Comments
Does all I tried reproduce this problem but it doesn't happen.
I shutdown
So maybe, the problem is outside of |
If your network is sometimes unstable, |
@repeatedly I'm going to down one of the servers tomorrow morning. I will catch a more complete log when I do it. |
@repeatedly Performed maintenance this morning and was able to catch it in the act. Logs of events happening. Yes, this is google-fluentd but I've personally seen this happen with td-agent also. Clearly downing the one server causes all of them to detach upstream. Why? I have no idea. Also, all servers do not have the same host. I was just hiding the IPs. They are internal though, so I included them below.
|
I see. I will check timeout case with heartbeat. |
We see this same happening with SSL connection. At least in SSL case the reason seems to be that SSL socket connect is blocking and does not have timeout, and all heartbeat connects to the non-responding server block indefinitely. I already tried a quick fix around sock.connect in |
@repeatedly I know this issue likely flies under the radar a bit, but this issue makes managing a cluster of fluentd servers reliably nearly impossible. If you can't down one server without effectively downing them all, leading to negative upstream effects then you don't really have a fault tolerant cluster. You have a single point of failure that happens to be spread over multiple machines. |
Yeah, we need to fix it and I considered several ways for writing the patch.
The problem is heartbeat is handled on only 1 thread and the elapsed time of previous heartbeat affects other heartbeat. We have several approaches for this problem, e.g. calculate precise elapsed time across heartbeats, launching threads for each server, but it seems heavy... |
Maybe, mpeltonen's case is other issue. |
@repeatedly I will test this patch today and get back to you ASAP |
Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719
Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719
FluentD v14 - Technically google-fluentd, but observed in td-agent as well.
CentOS
Problem: In this multiple server forwarding config if I take down any of the machines they will all eventually be marked as "no nodes available" even though two of them are available. Am I misunderstanding this multi-server configuration option?
The text was updated successfully, but these errors were encountered: