Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATS ouput not reconnecting on failure #1953

Closed
m4ce opened this issue Oct 26, 2016 · 9 comments
Closed

NATS ouput not reconnecting on failure #1953

m4ce opened this issue Oct 26, 2016 · 9 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@m4ce
Copy link
Contributor

m4ce commented Oct 26, 2016

Looks like if the NATS server happens to be unreachable and later comes back, telegraf never recovers:

2016/10/26 10:35:10 I! Output [nats] buffer fullness: 10045 / 10000 metrics. Total gathered metrics: 650575. Total dropped metrics: 9125.
2016/10/26 10:35:10 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed
@sparrc
Copy link
Contributor

sparrc commented Oct 26, 2016

can you provide your config file, telegraf version, and steps to reproduce?

@wallyqs
Copy link

wallyqs commented Oct 26, 2016

nats client has MaxReconnect which if set to -1 means that it would never stop reconnecting, maybe could be adopted for the output plugin for cases like this?

@m4ce
Copy link
Contributor Author

m4ce commented Oct 26, 2016

The sender (version 1.1.0 - master):

# Send telegraf measurements to NATS
[[outputs.nats]]
  servers = ["nats://localhost:4222"]
  subject = "telegraf"
  data_format = "influx"

The receiver (version 1.0.1):

[[inputs.nats_consumer]]
  ## urls of NATS servers
  servers = ["nats://localhost:4222"]

  ## Use Transport Layer Security
  secure = false

  ## subject(s) to consume
  subjects = ["telegraf"]

  ## name a queue group
  queue_group = "telegraf_consumers"

  data_format = "influx"

NATS server running:

2016/10/26 12:58:40 I! Output [nats] buffer fullness: 81 / 10000 metrics. Total gathered metrics: 81. Total dropped metrics: 0.
2016/10/26 12:58:40 I! Output [nats] wrote batch of 81 metrics in 177.151µs

Stopping NATS server:

# docker stop nats
nats
# date
Wed Oct 26 12:59:09 UTC 2016

Telegraf receiver reports NATS server is down:

2016/10/26 12:40:04 Service for input nats_consumer failed to start, exiting
nats: no servers available for connection

Telegraf sender keeps sending although NATS server is down (not sure why this happens):

2016/10/26 12:59:10 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 216. Total dropped metrics: 0.
2016/10/26 12:59:10 I! Output [nats] wrote batch of 45 metrics in 88.885µs
2016/10/26 12:59:20 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 261. Total dropped metrics: 0.
2016/10/26 12:59:20 I! Output [nats] wrote batch of 45 metrics in 82.019µs
2016/10/26 12:59:30 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 306. Total dropped metrics: 0.
2016/10/26 12:59:30 I! Output [nats] wrote batch of 45 metrics in 104.788µs
2016/10/26 12:59:40 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 351. Total dropped metrics: 0.
2016/10/26 12:59:40 I! Output [nats] wrote batch of 45 metrics in 68.794µs
2016/10/26 12:59:50 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 396. Total dropped metrics: 0.
2016/10/26 12:59:50 I! Output [nats] wrote batch of 45 metrics in 110.215µs
2016/10/26 13:00:00 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 441. Total dropped metrics: 0.
2016/10/26 13:00:00 I! Output [nats] wrote batch of 45 metrics in 164.256µs
2016/10/26 13:00:10 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 486. Total dropped metrics: 0.
2016/10/26 13:00:10 I! Output [nats] wrote batch of 45 metrics in 81.29µs
2016/10/26 13:00:20 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 531. Total dropped metrics: 0.
2016/10/26 13:00:20 I! Output [nats] wrote batch of 45 metrics in 74.832µs
2016/10/26 13:00:30 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 576. Total dropped metrics: 0.
2016/10/26 13:00:30 I! Output [nats] wrote batch of 45 metrics in 73.991µs
2016/10/26 13:00:40 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 621. Total dropped metrics: 0.
2016/10/26 13:00:40 I! Output [nats] wrote batch of 45 metrics in 78.428µs
2016/10/26 13:00:50 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 666. Total dropped metrics: 0.
2016/10/26 13:00:50 I! Output [nats] wrote batch of 45 metrics in 168.461µs
2016/10/26 13:01:00 I! Output [nats] buffer fullness: 45 / 10000 metrics. Total gathered metrics: 711. Total dropped metrics: 0.
2016/10/26 13:01:00 I! Output [nats] wrote batch of 45 metrics in 72.772µs

After a minute, telegraf sender detects the NATS server is down and start reporting the errors:

2016/10/26 13:01:10 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed
2016/10/26 13:01:20 I! Output [nats] buffer fullness: 90 / 10000 metrics. Total gathered metrics: 801. Total dropped metrics: 0.
2016/10/26 13:01:20 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed
2016/10/26 13:01:30 I! Output [nats] buffer fullness: 135 / 10000 metrics. Total gathered metrics: 846. Total dropped metrics: 0.
2016/10/26 13:01:30 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed
2016/10/26 13:01:40 I! Output [nats] buffer fullness: 180 / 10000 metrics. Total gathered metrics: 891. Total dropped metrics: 0.
2016/10/26 13:01:40 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed
2016/10/26 13:01:50 I! Output [nats] buffer fullness: 225 / 10000 metrics. Total gathered metrics: 936. Total dropped metrics: 0.
2016/10/26 13:01:50 E! Error writing to output [nats]: FAILED to send NATS message: nats: connection closed

Starting NATS server back up:

# docker start nats
nats
# date
Wed Oct 26 13:02:49 UTC 2016

Both telegraf sender and receiver never recover.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Oct 26, 2016
@sparrc sparrc added this to the 1.2.0 milestone Oct 26, 2016
@sparrc
Copy link
Contributor

sparrc commented Oct 26, 2016

OK, looks like a bug, PRs appreciated

@sparrc sparrc modified the milestones: 1.1.0, 1.2.0 Oct 26, 2016
@m4ce
Copy link
Contributor Author

m4ce commented Oct 26, 2016

@sparrc, actually the nats consumer recovers at some point, only the sender does not. Sorry, missed to see that.

@pires
Copy link
Contributor

pires commented Oct 26, 2016

@wallyqs so why not set the default to -1? Anyway, it's a simple fix, on it.

@sparrc
Copy link
Contributor

sparrc commented Oct 26, 2016

@pires please make the change in the nats output and the nats_consumer input as well,

thanks!

@pires
Copy link
Contributor

pires commented Oct 26, 2016

@sparrc 👍

sparrc pushed a commit that referenced this issue Oct 26, 2016
* NATS output plug-in now retries to reconnect forever after a lost connection.

* NATS input plug-in now retries to reconnect forever after a lost connection.

* Fixes #1953
@wallyqs
Copy link

wallyqs commented Oct 26, 2016

Thanks @pires :) I think it is a case by case thing... I thought that for telegraf case not giving up when forwarding output was expected so suggested the never stop reconnecting option, but yeah by default client stops reconnecting when no servers available on max retries then closing the connection, triggering the close callback and letting user decide how to handle that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants