Unexpected Multiple Server Behavior #1719

GLStephen · 2017-10-18T15:06:12Z

FluentD v14 - Technically google-fluentd, but observed in td-agent as well.
CentOS

<match tag.**>
  @type forward
<server>
    name fluentd-groupb-1
    host x.y.z.w
    port 24228
    weight 60
  </server>

  <server>
    host x.y.z.w
    port 24228
    weight 60
  </server>

  <server>
    host x.y.z.w
    port 24228
    weight 60
  </server>
.... other stuff ....
</match>

Problem: In this multiple server forwarding config if I take down any of the machines they will all eventually be marked as "no nodes available" even though two of them are available. Am I misunderstanding this multi-server configuration option?

The text was updated successfully, but these errors were encountered:

repeatedly · 2017-10-19T23:36:53Z

host x.y.z.w

Does all server have same host?

I tried reproduce this problem but it doesn't happen.

config

# out.conf
<source>
  @type forward
</source>

<match test.**>
  @type forward
  <buffer tag>
    @type memory
    flush_interval 5s
  </buffer>
  <server>
    host 127.0.0.1
    port 24225
    weight 60
  </server>
  <server>
    host 127.0.0.1
    port 24226
    weight 60
  </server>
  <server>
    host 127.0.0.1
    port 24227
    weight 60
  </server>
</match>

# in1.conf
<source>
  @type forward
  port 24225
</source>

<match test.**>
  @type stdout
</match>

# in2.conf
<source>
  @type forward
  port 24226
</source>

<match test.**>
  @type stdout
</match>

# in3.conf
<source>
  @type forward
  port 24227
</source>

<match test.**>
  @type stdout
</match>

log

I shutdown in3.conf instance but out_forward sends event to other nodes.

[...snip...]
2017-10-20 08:25:14 +0900 [info]: adding match pattern="test.**" type="forward"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server 'fluentd-groupb-1' host="127.0.0.1" port=24225 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server '127.0.0.1:24226' host="127.0.0.1" port=24226 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server '127.0.0.1:24227' host="127.0.0.1" port=24227 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: adding source type="forward"
2017-10-20 08:25:14 +0900 [info]: #0 starting fluentd worker pid=95658 ppid=95630 worker=0
2017-10-20 08:25:14 +0900 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-10-20 08:25:14 +0900 [info]: #0 listening port port=24224 bind="0.0.0.0"
2017-10-20 08:25:14 +0900 [info]: #0 fluentd worker is now running worker=0
2017-10-20 08:29:06 +0900 [warn]: #0 detached forwarding server '' host="127.0.0.1" port=24227 phi=16.360984605887207 phi_threshold=16
2017-10-20 08:31:40 +0900 [warn]: #0 recovered forwarding server '' host="127.0.0.1" port=24227

So maybe, the problem is outside of out_forward's alive monitoring routine. No detached forwarding server in your log?

repeatedly · 2017-10-19T23:39:34Z

If your network is sometimes unstable, phi_failure_detector false may help.

GLStephen · 2017-10-20T00:08:53Z

@repeatedly I'm going to down one of the servers tomorrow morning. I will catch a more complete log when I do it.

GLStephen · 2017-10-20T14:58:58Z

@repeatedly Performed maintenance this morning and was able to catch it in the act. Logs of events happening. Yes, this is google-fluentd but I've personally seen this happen with td-agent also. Clearly downing the one server causes all of them to detach upstream. Why? I have no idea.

Also, all servers do not have the same host. I was just hiding the IPs. They are internal though, so I included them below.

=================
=== Web/Forwarder

2017-10-20 09:46:49 +0000 [info]: adding match pattern="prod.**" type="forward"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server '10.142.0.54:24228' host="10.142.0.54" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server '10.142.0.55:24228' host="10.142.0.55" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-10-20 09:46:50 +0000 [info]: adding match pattern="**" type="google_cloud"
2017-10-20 09:46:50 +0000 [info]: #0 Detected GCE platform
2017-10-20 09:46:50 +0000 [info]: #0 Logs viewer address: https://redacted//
2017-10-20 09:46:50 +0000 [info]: adding source type="unix"
2017-10-20 09:46:50 +0000 [info]: adding source type="prometheus"
2017-10-20 09:46:50 +0000 [info]: adding source type="prometheus_monitor"
2017-10-20 09:46:50 +0000 [info]: #0 starting fluentd worker pid=1033 ppid=1028 worker=0
2017-10-20 09:46:50 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-10-20 09:46:50 +0000 [info]: #0 listening fluent socket on /var/run/td-agent/td-agent.sock
2017-10-20 09:46:50 +0000 [info]: #0 fluentd worker is now running worker=0
2017-10-20 12:05:12 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.55" port=24228 hard_timeout=true
=========== configuration changes to *one* of the aggregator servers ~7 minutes pass ===========
=========== things come back up just fine, only one server is lost during the process ===========
2017-10-20 12:12:37 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.55" port=24228

=========== first shutdown real clean and the aggregator was clean ===========
=========== below is second shutdown ===========
=========== in this shutdown the aggregator server deadlocks ===========
=========== somehow that downstream deadlock effects the status of *all* ===========
=========== of the servers in the set ===========

2017-10-20 12:40:20 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.065658695999446 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
2017-10-20 12:55:14 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.055869156998597 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
2017-10-20 12:56:09 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.072869617000833 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
=========== servers detach as a group ===========
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228 hard_timeout=true
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.54" port=24228 hard_timeout=true
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.55" port=24228 hard_timeout=true
2017-10-20 13:00:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-10-20 13:00:26 +0000 chunk="55bf9ed18fa6b9be5b9f5ae0902bee88" error_class=Errno::ETIMEDOUT error="Connection timed out - connect(2) for \"10.142.0.55\" port 24228"
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `initialize'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `new'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `socket_create_tcp'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin/out_forward.rb:334:in `create_transfer_socket'
2017-10-20 13:00:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-10-20 13:00:26 +0000 chunk="55bf9ed1957edd3630d919
85eb2be454" error_class=Errno::ETIMEDOUT error="Connection timed out - connect(2) for \"10.142.0.55\" port 24228"
2017-10-20 13:00:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-10-20 13:00:32 +0000 chunk="55bf9ed1957edd3630d919
85eb2be454" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:32 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:00:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-10-20 13:00:40 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:32 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:00:50 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-10-20 13:00:50 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:50 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:01:20 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-10-20 13:01:20 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:01:20 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:02:30 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-10-20 13:02:30 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:02:30 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:04:43 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-10-20 13:04:43 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:04:43 +0000 [warn]: #0 suppressed same stacktrace
=========== servers recover as a group, only the one server was downed ===========
2017-10-20 13:08:30 +0000 [warn]: #0 recovered forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228
2017-10-20 13:08:30 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.54" port=24228
2017-10-20 13:08:56 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.55" port=24228
2017-10-20 13:09:06 +0000 [warn]: #0 retry succeeded. chunk_id="55bf9ed18fa6b9be5b9f5ae0902bee88"



========================================================================
=== Aggregator Log - One Server of 3 Stopped and Started
========================================================================

=========== this is a clean shutdown, matches above in timing ===========

2017-10-20 12:04:10 +0000 [debug]: #0 calling after_shutdown on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:10 +0000 [debug]: #0 calling after_shutdown on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on input plugin type=:forward plugin_id="object:3f92aa9ec8bc"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:04:10 +0000 [info]: fluent/log.rb:316:info: Worker 1 finished with status 0

2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on input plugin type=:forward plugin_id="object:3fcec70ec1a0"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:04:24 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished with status 0

=========== the events below are the second shutdown above ===========
=========== look like the dealock we've seen and matches with 
=========== the recover and multiple failure ===========
=========== above *for all the forwarders*, but only one was taken down ===========

2017-10-20 12:58:16 +0000 [info]: fluent/log.rb:316:info: Received graceful stop
2017-10-20 12:58:16 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 12:58:16 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 12:58:16 +0000 [debug]: #0 getting start to shutdown main process
2017-10-20 12:58:16 +0000 [debug]: #1 getting start to shutdown main process
2017-10-20 12:58:16 +0000 [info]: #1 fluentd worker is now stopping worker=1
2017-10-20 12:58:16 +0000 [info]: #1 shutting down fluentd worker worker=1
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on input plugin type=:forward plugin_id="object:3fecc2623a10"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown input plugin type=:forward plugin_id="object:3fecc2623a10"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:58:16 +0000 [info]: #0 fluentd worker is now stopping worker=0
2017-10-20 12:58:16 +0000 [info]: #0 shutting down fluentd worker worker=0
=========== this stuff repeates a bit ===========
2017-10-20 13:00:16 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:00:16 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:01:12 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:01:12 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:02:17 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:02:17 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:03:32 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:03:32 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:04:57 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:04:57 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:06:32 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:06:32 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin2] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007fd9857a3288@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f4841aed9c8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f4841aed6f8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin1] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007fd98593f0b0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin5] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007f4841aed4a0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin1] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007fd98593ee30@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f48423b1758@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin13] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f48422a1c28@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
=========== irrelevant log showing more deadloack sort of behavior ===========
2017-10-20 13:08:17 +0000 [info]: fluent/log.rb:316:info: Worker 1 finished with signal SIGQUIT
2017-10-20 13:08:17 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished with signal SIGQUIT
=========== configuration changes were made while server ===========
=========== winding down, so restart is nearly immediate ===========
2017-10-20 13:08:33 +0000 [info]: parsing config file is succeeded path="/etc/td-agent/td-agent.conf"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin1] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin2] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin3] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin4] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin5] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [info]: using configuration file: <ROOT>

repeatedly · 2017-10-20T21:53:01Z

I see. I will check timeout case with heartbeat.

mpeltonen · 2017-10-24T10:11:43Z

We see this same happening with SSL connection. At least in SSL case the reason seems to be that SSL socket connect is blocking and does not have timeout, and all heartbeat connects to the non-responding server block indefinitely.

I already tried a quick fix around sock.connect in socket_create_tls, but it seemed like something needs to be done also for the SO_LINGER setting, as closing the socket in timeout case blocks in the close call. Didn't debug much further, but I suspected that it may be because of SO_LINGER is set to @send_timeout, which is quite long by default, and SSLSocket has stuff to send in buffer already at that stage. Not trying to close the socket in timeout at all seemed to cause a resource leak and caused the whole fluentd process get stuck after a while.

GLStephen · 2017-10-31T22:52:25Z

@repeatedly I know this issue likely flies under the radar a bit, but this issue makes managing a cluster of fluentd servers reliably nearly impossible. If you can't down one server without effectively downing them all, leading to negative upstream effects then you don't really have a fault tolerant cluster. You have a single point of failure that happens to be spread over multiple machines.

repeatedly · 2017-11-01T12:27:26Z

this issue makes managing a cluster of fluentd servers reliably nearly impossible.

Yeah, we need to fix it and I considered several ways for writing the patch.
Very simple patch with hard_timeout is like below.

diff --git a/lib/fluent/plugin/out_forward.rb b/lib/fluent/plugin/out_forward.rb
index 2b5b4d7..24f790c 100644
--- a/lib/fluent/plugin/out_forward.rb
+++ b/lib/fluent/plugin/out_forward.rb
@@ -395,9 +395,6 @@ module Fluent::Plugin
 
     def on_timer
       @nodes.each {|n|
-        if n.tick
-          rebuild_weight_array
-        end
         begin
           log.trace "sending heartbeat", host: n.host, port: n.port, heartbeat_type: @heartbeat_type
           n.usock = @usock if @usock
@@ -407,6 +404,9 @@ module Fluent::Plugin
         rescue => e
           log.debug "unexpected error happen during heartbeat", host: n.host, port: n.port, heartbeat_type: @heartbeat_type, error: e
         end
+        if n.tick
+          rebuild_weight_array
+        end
       }
     end

The problem is heartbeat is handled on only 1 thread and the elapsed time of previous heartbeat affects other heartbeat.
Could you test above patch? I can't reproduce same environment of yours so I'm not sure this patch fixes your case correctly.

We have several approaches for this problem, e.g. calculate precise elapsed time across heartbeats, launching threads for each server, but it seems heavy...

repeatedly · 2017-11-01T12:34:00Z

Maybe, mpeltonen's case is other issue.
If you notice the bad code in socket handling/setting, please open new issue.

GLStephen · 2017-11-01T14:19:15Z

@repeatedly I will test this patch today and get back to you ASAP

Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719

repeatedly closed this as completed in 59d30bd Nov 7, 2017

repeatedly added a commit that referenced this issue Nov 7, 2017

Merge pull request #1738 from fluent/fix-heartbeat-tick-check

c61cfbf

Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719

repeatedly added a commit that referenced this issue Nov 14, 2017

Merge pull request #1738 from fluent/fix-heartbeat-tick-check

90d8ab5

Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Multiple Server Behavior #1719

Unexpected Multiple Server Behavior #1719

GLStephen commented Oct 18, 2017 •

edited

Loading

repeatedly commented Oct 19, 2017

repeatedly commented Oct 19, 2017

GLStephen commented Oct 20, 2017

GLStephen commented Oct 20, 2017 •

edited

Loading

repeatedly commented Oct 20, 2017

mpeltonen commented Oct 24, 2017

GLStephen commented Oct 31, 2017

repeatedly commented Nov 1, 2017

repeatedly commented Nov 1, 2017 •

edited

Loading

GLStephen commented Nov 1, 2017

Unexpected Multiple Server Behavior #1719

Unexpected Multiple Server Behavior #1719

Comments

GLStephen commented Oct 18, 2017 • edited Loading

repeatedly commented Oct 19, 2017

repeatedly commented Oct 19, 2017

GLStephen commented Oct 20, 2017

GLStephen commented Oct 20, 2017 • edited Loading

repeatedly commented Oct 20, 2017

mpeltonen commented Oct 24, 2017

GLStephen commented Oct 31, 2017

repeatedly commented Nov 1, 2017

repeatedly commented Nov 1, 2017 • edited Loading

GLStephen commented Nov 1, 2017

GLStephen commented Oct 18, 2017 •

edited

Loading

GLStephen commented Oct 20, 2017 •

edited

Loading

repeatedly commented Nov 1, 2017 •

edited

Loading