Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

nfirvine · 2017-05-31T00:13:53Z

Bug report

(See https://community.influxdata.com/t/need-help-triaging-a-potential-telegraf-1-3-0-bug-with-kafka-consumer/1123 for original report.)

Since upgrading to telegraf 1.3.0, we're encountering a recurring problem

pulling with kafka consumer into influxdb
6 telegraf consumers pulling from 6 kafka partitions

After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):

2017-05-30T22:07:04Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.kernel_vmstat]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (20s)
2017-05-30T22:07:06Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (20s)
...

At this point, telegraf stops consuming from kafka and writing to influxdb (including from the internal plugin).

A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.

The log message before the looping stuff says nothing out of the ordinary

Relevant telegraf.conf:

[global_tags]
  dc = "den"
  env = "prd"
  role = "xm_telegraf_dataproc"

[agent]
  hostname = "redacted"
  interval = "20s"
  round_interval = true
  metric_buffer_limit = 100000
  flush_buffer_when_full = true
  collection_jitter = "15s"
  flush_interval = "20s"
  flush_jitter = "15s"
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  password = "telegraf"
  urls = ["http://redacted:8086"]
  username = "telegraf"

#
# INPUTS:
#
[[inputs.conntrack]]
[[inputs.cpu]]
  fielddrop = ["time_*"]
  percpu = true
  totalcpu = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.kernel_vmstat]]
[[inputs.mem]]
[[inputs.net]]
  drop = ["net_icmp"]
  interfaces = ["eth0"]
[[inputs.netstat]]
[[inputs.ping]]
  count = 1
  timeout = 1.0
  urls = ["redacted"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

System info:

CentOS 7
Telegraf 1.3.0

Steps to reproduce:

Cannot repro reliably.

Additional info:

Stack trace, taken about 20 min after the problem started:
telegraf-stuck.gostack.txt

The text was updated successfully, but these errors were encountered:

danielnelson · 2017-05-31T00:37:06Z

From the stack it looks like the output is deadlocked:

goroutine 26116 [semacquire, 81 minutes]:
sync.runtime_SemacquireMutex(0xc420193b2c)
	/usr/local/go/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc420193b28)
	/usr/local/go/src/sync/mutex.go:87 +0x9d
github.com/influxdata/telegraf/internal/buffer.(*Buffer).Add(0xc420193b20, 0xc4203c6000, 0x3e8, 0x3e8)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/buffer/buffer.go:48 +0xfe
github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc4202440a0, 0x13f43a0, 0xc4206a7040)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/models/running_output.go:148 +0x30e
github.com/influxdata/telegraf/agent.(*Agent).flush.func1(0xc4206a7040, 0xc4202440a0)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:238 +0x59
created by github.com/influxdata/telegraf/agent.(*Agent).flush
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:243 +0xa5

Do you see anything in the InfluxDB logs?

nfirvine · 2017-05-31T00:48:33Z

No, nothing abnormal. After trimming out queries and httpd, it's basically this:

May 30 20:55:00 redacted influxd[55465]: [I] 2017-05-30T20:55:00Z executing continuous query kafka_offset_rate (2017-05-30 20:54:00 +0000 UTC to 2017-05-30 20:55:00 +0000 UTC) service=continuous_querier
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z compacted level 2 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002468-000000003.tsm.tmp (#0) engine=tsm1
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z compacted level 2 2 files into 1 files in 26.917691534s engine=tsm1
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z finished continuous query kafka_offset_rate (2017-05-30 20:54:00 +0000 UTC to 2017-05-30 20:55:00 +0000 UTC) in 528.868023ms service=continuous_querier
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z beginning level 1 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002469-000000001.tsm (#0) engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000001.tsm (#1) engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 7.171588542s engine=tsm1
May 30 20:55:47 redacted influxd[55465]: [I] 2017-05-30T20:55:47Z compacted level 1 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000002.tsm.tmp (#0) engine=tsm1
May 30 20:55:47 redacted influxd[55465]: [I] 2017-05-30T20:55:47Z compacted level 1 2 files into 1 files in 18.70101496s engine=tsm1
May 30 20:56:00 redacted influxd[55465]: [I] 2017-05-30T20:56:00Z executing continuous query kafka_offset_rate (2017-05-30 20:55:00 +0000 UTC to 2017-05-30 20:56:00 +0000 UTC) service=continuous_querier
May 30 20:56:00 redacted influxd[55465]: [I] 2017-05-30T20:56:00Z finished continuous query kafka_offset_rate (2017-05-30 20:55:00 +0000 UTC to 2017-05-30 20:56:00 +0000 UTC) in 1.995773ms service=continuous_querier
May 30 20:56:07 redacted influxd[55465]: [I] 2017-05-30T20:56:07Z compacted full group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002464-000000005.tsm.tmp (#0) engine=tsm1
May 30 20:56:07 redacted influxd[55465]: [I] 2017-05-30T20:56:07Z compacted full 2 files into 1 files in 2m37.272165524s engine=tsm1
May 30 20:56:10 redacted influxd[55465]: [I] 2017-05-30T20:56:10Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 7.908628018s engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z beginning level 1 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002471-000000001.tsm (#0) engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000001.tsm (#1) engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 6.127200818s engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacted level 1 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000002.tsm.tmp (#0) engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacted level 1 2 files into 1 files in 14.716663446s engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z beginning level 2 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacting level 2 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000002.tsm (#0) engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacting level 2 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000002.tsm (#1) engine=tsm1

We do have one measurement exceeding the 100000 values per tag max, but I can't see how that would be related.

nfirvine · 2017-05-31T19:13:42Z

Seems to be affecting the same VM out of the 6 telegrafs. Had an instance of the problem after only 10 minutes of running after the last restart. Am going to reboot the server to if it makes a difference

danielnelson · 2017-05-31T19:32:25Z

Crossing my fingers

nfirvine · 2017-06-01T20:28:11Z

Unfortunately that didn't fix it :( About 90 min after rebooting, one of the nodes had the problem. Got the stacktrace here

telegraf-stuck2.gostack.txt

This means I'll have to downgrade.

danielnelson · 2017-06-01T21:30:42Z

Anything in dmesg?

nfirvine · 2017-06-01T22:55:14Z

Nope.

danielnelson · 2017-06-01T23:04:49Z

Is it possible to narrow down the plugins? I'm especially interested if it will occur if the influxdb output is replaced with a file output as well as if you remove the kafka plugin.

nfirvine · 2017-06-02T18:11:03Z

Sorry, I've downgraded to 1.2 now. Reproing the problem is hard enough (takes hours or days, possibly depending on volume), let alone reproring and bisecting, so I'm going have to step away from this issue for the time-being. Will close for now.

nfirvine closed this as completed Jun 2, 2017

adrianlzt mentioned this issue Oct 10, 2017

telegraf taking too long to collect net metrics #3318

Closed

jgitlin-bt mentioned this issue Jan 2, 2018

Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect #3629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

nfirvine commented May 31, 2017

danielnelson commented May 31, 2017

nfirvine commented May 31, 2017 •

edited

Loading

nfirvine commented May 31, 2017

danielnelson commented May 31, 2017

nfirvine commented Jun 1, 2017

danielnelson commented Jun 1, 2017

nfirvine commented Jun 1, 2017

danielnelson commented Jun 1, 2017

nfirvine commented Jun 2, 2017

Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

Comments

nfirvine commented May 31, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Additional info:

danielnelson commented May 31, 2017

nfirvine commented May 31, 2017 • edited Loading

nfirvine commented May 31, 2017

danielnelson commented May 31, 2017

nfirvine commented Jun 1, 2017

danielnelson commented Jun 1, 2017

nfirvine commented Jun 1, 2017

danielnelson commented Jun 1, 2017

nfirvine commented Jun 2, 2017

nfirvine commented May 31, 2017 •

edited

Loading