Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Multiple Server Behavior #1719

Closed
GLStephen opened this issue Oct 18, 2017 · 10 comments
Closed

Unexpected Multiple Server Behavior #1719

GLStephen opened this issue Oct 18, 2017 · 10 comments

Comments

@GLStephen
Copy link

GLStephen commented Oct 18, 2017

FluentD v14 - Technically google-fluentd, but observed in td-agent as well.
CentOS

<match tag.**>
  @type forward
<server>
    name fluentd-groupb-1
    host x.y.z.w
    port 24228
    weight 60
  </server>

  <server>
    host x.y.z.w
    port 24228
    weight 60
  </server>

  <server>
    host x.y.z.w
    port 24228
    weight 60
  </server>
.... other stuff ....
</match>

Problem: In this multiple server forwarding config if I take down any of the machines they will all eventually be marked as "no nodes available" even though two of them are available. Am I misunderstanding this multi-server configuration option?

@repeatedly
Copy link
Member

host x.y.z.w

Does all server have same host?

I tried reproduce this problem but it doesn't happen.

  • config
# out.conf
<source>
  @type forward
</source>

<match test.**>
  @type forward
  <buffer tag>
    @type memory
    flush_interval 5s
  </buffer>
  <server>
    host 127.0.0.1
    port 24225
    weight 60
  </server>
  <server>
    host 127.0.0.1
    port 24226
    weight 60
  </server>
  <server>
    host 127.0.0.1
    port 24227
    weight 60
  </server>
</match>

# in1.conf
<source>
  @type forward
  port 24225
</source>

<match test.**>
  @type stdout
</match>

# in2.conf
<source>
  @type forward
  port 24226
</source>

<match test.**>
  @type stdout
</match>

# in3.conf
<source>
  @type forward
  port 24227
</source>

<match test.**>
  @type stdout
</match>
  • log

I shutdown in3.conf instance but out_forward sends event to other nodes.

[...snip...]
2017-10-20 08:25:14 +0900 [info]: adding match pattern="test.**" type="forward"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server 'fluentd-groupb-1' host="127.0.0.1" port=24225 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server '127.0.0.1:24226' host="127.0.0.1" port=24226 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: #0 adding forwarding server '127.0.0.1:24227' host="127.0.0.1" port=24227 weight=60 plugin_id="object:3fedc8c4926c"
2017-10-20 08:25:14 +0900 [info]: adding source type="forward"
2017-10-20 08:25:14 +0900 [info]: #0 starting fluentd worker pid=95658 ppid=95630 worker=0
2017-10-20 08:25:14 +0900 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-10-20 08:25:14 +0900 [info]: #0 listening port port=24224 bind="0.0.0.0"
2017-10-20 08:25:14 +0900 [info]: #0 fluentd worker is now running worker=0
2017-10-20 08:29:06 +0900 [warn]: #0 detached forwarding server '' host="127.0.0.1" port=24227 phi=16.360984605887207 phi_threshold=16
2017-10-20 08:31:40 +0900 [warn]: #0 recovered forwarding server '' host="127.0.0.1" port=24227

So maybe, the problem is outside of out_forward's alive monitoring routine. No detached forwarding server in your log?

@repeatedly
Copy link
Member

If your network is sometimes unstable, phi_failure_detector false may help.

@GLStephen
Copy link
Author

@repeatedly I'm going to down one of the servers tomorrow morning. I will catch a more complete log when I do it.

@GLStephen
Copy link
Author

GLStephen commented Oct 20, 2017

@repeatedly Performed maintenance this morning and was able to catch it in the act. Logs of events happening. Yes, this is google-fluentd but I've personally seen this happen with td-agent also. Clearly downing the one server causes all of them to detach upstream. Why? I have no idea.

Also, all servers do not have the same host. I was just hiding the IPs. They are internal though, so I included them below.

=================
=== Web/Forwarder

2017-10-20 09:46:49 +0000 [info]: adding match pattern="prod.**" type="forward"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server '10.142.0.54:24228' host="10.142.0.54" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: #0 adding forwarding server '10.142.0.55:24228' host="10.142.0.55" port=24228 weight=60 plugin_id="object:3f8d9f375bc4"
2017-10-20 09:46:50 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-10-20 09:46:50 +0000 [info]: adding match pattern="**" type="google_cloud"
2017-10-20 09:46:50 +0000 [info]: #0 Detected GCE platform
2017-10-20 09:46:50 +0000 [info]: #0 Logs viewer address: https://redacted//
2017-10-20 09:46:50 +0000 [info]: adding source type="unix"
2017-10-20 09:46:50 +0000 [info]: adding source type="prometheus"
2017-10-20 09:46:50 +0000 [info]: adding source type="prometheus_monitor"
2017-10-20 09:46:50 +0000 [info]: #0 starting fluentd worker pid=1033 ppid=1028 worker=0
2017-10-20 09:46:50 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-10-20 09:46:50 +0000 [info]: #0 listening fluent socket on /var/run/td-agent/td-agent.sock
2017-10-20 09:46:50 +0000 [info]: #0 fluentd worker is now running worker=0
2017-10-20 12:05:12 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.55" port=24228 hard_timeout=true
=========== configuration changes to *one* of the aggregator servers ~7 minutes pass ===========
=========== things come back up just fine, only one server is lost during the process ===========
2017-10-20 12:12:37 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.55" port=24228

=========== first shutdown real clean and the aggregator was clean ===========
=========== below is second shutdown ===========
=========== in this shutdown the aggregator server deadlocks ===========
=========== somehow that downstream deadlock effects the status of *all* ===========
=========== of the servers in the set ===========

2017-10-20 12:40:20 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.065658695999446 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
2017-10-20 12:55:14 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.055869156998597 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
2017-10-20 12:56:09 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=31.072869617000833 slow_flush_log_threshold=20.0 plugin_id="object:3f8d9f375bc4"
=========== servers detach as a group ===========
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228 hard_timeout=true
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.54" port=24228 hard_timeout=true
2017-10-20 12:59:23 +0000 [warn]: #0 detached forwarding server '' host="10.142.0.55" port=24228 hard_timeout=true
2017-10-20 13:00:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-10-20 13:00:26 +0000 chunk="55bf9ed18fa6b9be5b9f5ae0902bee88" error_class=Errno::ETIMEDOUT error="Connection timed out - connect(2) for \"10.142.0.55\" port 24228"
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `initialize'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `new'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/socket.rb:59:in `socket_create_tcp'
  2017-10-20 13:00:25 +0000 [warn]: #0 /opt/google-fluentd/embedded/lib/ruby/gems/2.2.0/gems/fluentd-0.14.21/lib/fluent/plugin/out_forward.rb:334:in `create_transfer_socket'
2017-10-20 13:00:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-10-20 13:00:26 +0000 chunk="55bf9ed1957edd3630d919
85eb2be454" error_class=Errno::ETIMEDOUT error="Connection timed out - connect(2) for \"10.142.0.55\" port 24228"
2017-10-20 13:00:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-10-20 13:00:32 +0000 chunk="55bf9ed1957edd3630d919
85eb2be454" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:32 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:00:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-10-20 13:00:40 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:32 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:00:50 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-10-20 13:00:50 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:00:50 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:01:20 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-10-20 13:01:20 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:01:20 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:02:30 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-10-20 13:02:30 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:02:30 +0000 [warn]: #0 suppressed same stacktrace
2017-10-20 13:04:43 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-10-20 13:04:43 +0000 chunk="55bf9ed18fa6b9be5b9f5a
e0902bee88" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-10-20 13:04:43 +0000 [warn]: #0 suppressed same stacktrace
=========== servers recover as a group, only the one server was downed ===========
2017-10-20 13:08:30 +0000 [warn]: #0 recovered forwarding server 'fluentd-groupb-1' host="10.142.0.53" port=24228
2017-10-20 13:08:30 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.54" port=24228
2017-10-20 13:08:56 +0000 [warn]: #0 recovered forwarding server '' host="10.142.0.55" port=24228
2017-10-20 13:09:06 +0000 [warn]: #0 retry succeeded. chunk_id="55bf9ed18fa6b9be5b9f5ae0902bee88"



========================================================================
=== Aggregator Log - One Server of 3 Stopped and Started
========================================================================

=========== this is a clean shutdown, matches above in timing ===========

2017-10-20 12:04:10 +0000 [debug]: #0 calling after_shutdown on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:10 +0000 [debug]: #0 calling after_shutdown on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on input plugin type=:forward plugin_id="object:3f92aa9ec8bc"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:04:10 +0000 [debug]: #1 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:04:10 +0000 [info]: fluent/log.rb:316:info: Worker 1 finished with status 0

2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on input plugin type=:forward plugin_id="object:3fcec70ec1a0"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:04:24 +0000 [debug]: #0 calling terminate on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:04:24 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished with status 0

=========== the events below are the second shutdown above ===========
=========== look like the dealock we've seen and matches with 
=========== the recover and multiple failure ===========
=========== above *for all the forwarders*, but only one was taken down ===========

2017-10-20 12:58:16 +0000 [info]: fluent/log.rb:316:info: Received graceful stop
2017-10-20 12:58:16 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 12:58:16 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 12:58:16 +0000 [debug]: #0 getting start to shutdown main process
2017-10-20 12:58:16 +0000 [debug]: #1 getting start to shutdown main process
2017-10-20 12:58:16 +0000 [info]: #1 fluentd worker is now stopping worker=1
2017-10-20 12:58:16 +0000 [info]: #1 shutting down fluentd worker worker=1
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on input plugin type=:forward plugin_id="object:3fecc2623a10"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:58:16 +0000 [debug]: #1 calling stop on output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown input plugin type=:forward plugin_id="object:3fecc2623a10"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin1"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin2"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin3"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin4"
2017-10-20 12:58:16 +0000 [debug]: #1 preparing shutdown output plugin type=:bigquery plugin_id="bqPlugin5"
2017-10-20 12:58:16 +0000 [info]: #0 fluentd worker is now stopping worker=0
2017-10-20 12:58:16 +0000 [info]: #0 shutting down fluentd worker worker=0
=========== this stuff repeates a bit ===========
2017-10-20 13:00:16 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:00:16 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:01:12 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:01:12 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:02:17 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:02:17 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:03:32 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:03:32 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:04:57 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:04:57 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:06:32 +0000 [debug]: #1 fluentd main process get SIGTERM
2017-10-20 13:06:32 +0000 [debug]: #0 fluentd main process get SIGTERM
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin2] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007fd9857a3288@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f4841aed9c8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f4841aed6f8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin1] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007fd98593f0b0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin5] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007f4841aed4a0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #1 [bqPlugin1] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007fd98593ee30@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin4] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f48423b1758@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-10-20 13:08:17 +0000 [warn]: #0 [bqPlugin13] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f48422a1c28@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
=========== irrelevant log showing more deadloack sort of behavior ===========
2017-10-20 13:08:17 +0000 [info]: fluent/log.rb:316:info: Worker 1 finished with signal SIGQUIT
2017-10-20 13:08:17 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished with signal SIGQUIT
=========== configuration changes were made while server ===========
=========== winding down, so restart is nearly immediate ===========
2017-10-20 13:08:33 +0000 [info]: parsing config file is succeeded path="/etc/td-agent/td-agent.conf"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin1] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin2] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin3] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin4] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [warn]: [bqPlugin5] secondary type should be same with primary one primary="Fluent::Plugin::BigQueryOutput" secondary="Fluent::Plugin::FileOutput"
2017-10-20 13:08:35 +0000 [info]: using configuration file: <ROOT>

@repeatedly
Copy link
Member

I see. I will check timeout case with heartbeat.

@mpeltonen
Copy link

We see this same happening with SSL connection. At least in SSL case the reason seems to be that SSL socket connect is blocking and does not have timeout, and all heartbeat connects to the non-responding server block indefinitely.

I already tried a quick fix around sock.connect in socket_create_tls, but it seemed like something needs to be done also for the SO_LINGER setting, as closing the socket in timeout case blocks in the close call. Didn't debug much further, but I suspected that it may be because of SO_LINGER is set to @send_timeout, which is quite long by default, and SSLSocket has stuff to send in buffer already at that stage. Not trying to close the socket in timeout at all seemed to cause a resource leak and caused the whole fluentd process get stuck after a while.

@GLStephen
Copy link
Author

@repeatedly I know this issue likely flies under the radar a bit, but this issue makes managing a cluster of fluentd servers reliably nearly impossible. If you can't down one server without effectively downing them all, leading to negative upstream effects then you don't really have a fault tolerant cluster. You have a single point of failure that happens to be spread over multiple machines.

@repeatedly
Copy link
Member

this issue makes managing a cluster of fluentd servers reliably nearly impossible.

Yeah, we need to fix it and I considered several ways for writing the patch.
Very simple patch with hard_timeout is like below.

diff --git a/lib/fluent/plugin/out_forward.rb b/lib/fluent/plugin/out_forward.rb
index 2b5b4d7..24f790c 100644
--- a/lib/fluent/plugin/out_forward.rb
+++ b/lib/fluent/plugin/out_forward.rb
@@ -395,9 +395,6 @@ module Fluent::Plugin
 
     def on_timer
       @nodes.each {|n|
-        if n.tick
-          rebuild_weight_array
-        end
         begin
           log.trace "sending heartbeat", host: n.host, port: n.port, heartbeat_type: @heartbeat_type
           n.usock = @usock if @usock
@@ -407,6 +404,9 @@ module Fluent::Plugin
         rescue => e
           log.debug "unexpected error happen during heartbeat", host: n.host, port: n.port, heartbeat_type: @heartbeat_type, error: e
         end
+        if n.tick
+          rebuild_weight_array
+        end
       }
     end

The problem is heartbeat is handled on only 1 thread and the elapsed time of previous heartbeat affects other heartbeat.
Could you test above patch? I can't reproduce same environment of yours so I'm not sure this patch fixes your case correctly.

We have several approaches for this problem, e.g. calculate precise elapsed time across heartbeats, launching threads for each server, but it seems heavy...

@repeatedly
Copy link
Member

repeatedly commented Nov 1, 2017

Maybe, mpeltonen's case is other issue.
If you notice the bad code in socket handling/setting, please open new issue.

@GLStephen
Copy link
Author

@repeatedly I will test this patch today and get back to you ASAP

repeatedly added a commit that referenced this issue Nov 7, 2017
Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719
repeatedly added a commit that referenced this issue Nov 14, 2017
Move tick check to after heartbeat for avoiding the impact of other node heartbeat. fix #1719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants