Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? #2870

Closed
nfirvine opened this issue May 31, 2017 · 9 comments

Comments

@nfirvine
Copy link
Contributor

Bug report

(See https://community.influxdata.com/t/need-help-triaging-a-potential-telegraf-1-3-0-bug-with-kafka-consumer/1123 for original report.)

Since upgrading to telegraf 1.3.0, we're encountering a recurring problem

  • pulling with kafka consumer into influxdb
  • 6 telegraf consumers pulling from 6 kafka partitions

After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):

2017-05-30T22:07:04Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.kernel_vmstat]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (20s)
2017-05-30T22:07:06Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (20s)
...

At this point, telegraf stops consuming from kafka and writing to influxdb (including from the internal plugin).

A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.

The log message before the looping stuff says nothing out of the ordinary

Relevant telegraf.conf:

[global_tags]
  dc = "den"
  env = "prd"
  role = "xm_telegraf_dataproc"

[agent]
  hostname = "redacted"
  interval = "20s"
  round_interval = true
  metric_buffer_limit = 100000
  flush_buffer_when_full = true
  collection_jitter = "15s"
  flush_interval = "20s"
  flush_jitter = "15s"
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  password = "telegraf"
  urls = ["http://redacted:8086"]
  username = "telegraf"

#
# INPUTS:
#
[[inputs.conntrack]]
[[inputs.cpu]]
  fielddrop = ["time_*"]
  percpu = true
  totalcpu = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.kernel_vmstat]]
[[inputs.mem]]
[[inputs.net]]
  drop = ["net_icmp"]
  interfaces = ["eth0"]
[[inputs.netstat]]
[[inputs.ping]]
  count = 1
  timeout = 1.0
  urls = ["redacted"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

System info:

  • CentOS 7
  • Telegraf 1.3.0

Steps to reproduce:

Cannot repro reliably.

Additional info:

Stack trace, taken about 20 min after the problem started:
telegraf-stuck.gostack.txt

@danielnelson
Copy link
Contributor

From the stack it looks like the output is deadlocked:

goroutine 26116 [semacquire, 81 minutes]:
sync.runtime_SemacquireMutex(0xc420193b2c)
	/usr/local/go/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc420193b28)
	/usr/local/go/src/sync/mutex.go:87 +0x9d
github.com/influxdata/telegraf/internal/buffer.(*Buffer).Add(0xc420193b20, 0xc4203c6000, 0x3e8, 0x3e8)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/buffer/buffer.go:48 +0xfe
github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc4202440a0, 0x13f43a0, 0xc4206a7040)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/models/running_output.go:148 +0x30e
github.com/influxdata/telegraf/agent.(*Agent).flush.func1(0xc4206a7040, 0xc4202440a0)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:238 +0x59
created by github.com/influxdata/telegraf/agent.(*Agent).flush
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:243 +0xa5

Do you see anything in the InfluxDB logs?

@nfirvine
Copy link
Contributor Author

nfirvine commented May 31, 2017

No, nothing abnormal. After trimming out queries and httpd, it's basically this:

May 30 20:55:00 redacted influxd[55465]: [I] 2017-05-30T20:55:00Z executing continuous query kafka_offset_rate (2017-05-30 20:54:00 +0000 UTC to 2017-05-30 20:55:00 +0000 UTC) service=continuous_querier
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z compacted level 2 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002468-000000003.tsm.tmp (#0) engine=tsm1
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z compacted level 2 2 files into 1 files in 26.917691534s engine=tsm1
May 30 20:55:01 redacted influxd[55465]: [I] 2017-05-30T20:55:01Z finished continuous query kafka_offset_rate (2017-05-30 20:54:00 +0000 UTC to 2017-05-30 20:55:00 +0000 UTC) in 528.868023ms service=continuous_querier
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z beginning level 1 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002469-000000001.tsm (#0) engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000001.tsm (#1) engine=tsm1
May 30 20:55:28 redacted influxd[55465]: [I] 2017-05-30T20:55:28Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 7.171588542s engine=tsm1
May 30 20:55:47 redacted influxd[55465]: [I] 2017-05-30T20:55:47Z compacted level 1 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000002.tsm.tmp (#0) engine=tsm1
May 30 20:55:47 redacted influxd[55465]: [I] 2017-05-30T20:55:47Z compacted level 1 2 files into 1 files in 18.70101496s engine=tsm1
May 30 20:56:00 redacted influxd[55465]: [I] 2017-05-30T20:56:00Z executing continuous query kafka_offset_rate (2017-05-30 20:55:00 +0000 UTC to 2017-05-30 20:56:00 +0000 UTC) service=continuous_querier
May 30 20:56:00 redacted influxd[55465]: [I] 2017-05-30T20:56:00Z finished continuous query kafka_offset_rate (2017-05-30 20:55:00 +0000 UTC to 2017-05-30 20:56:00 +0000 UTC) in 1.995773ms service=continuous_querier
May 30 20:56:07 redacted influxd[55465]: [I] 2017-05-30T20:56:07Z compacted full group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002464-000000005.tsm.tmp (#0) engine=tsm1
May 30 20:56:07 redacted influxd[55465]: [I] 2017-05-30T20:56:07Z compacted full 2 files into 1 files in 2m37.272165524s engine=tsm1
May 30 20:56:10 redacted influxd[55465]: [I] 2017-05-30T20:56:10Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 7.908628018s engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z beginning level 1 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002471-000000001.tsm (#0) engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z compacting level 1 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000001.tsm (#1) engine=tsm1
May 30 20:56:45 redacted influxd[55465]: [I] 2017-05-30T20:56:45Z Snapshot for path /var/lib/influxdb/data/telegraf/telegraf/267 written in 6.127200818s engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacted level 1 group (0) into /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000002.tsm.tmp (#0) engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacted level 1 2 files into 1 files in 14.716663446s engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z beginning level 2 compaction of group 0, 2 TSM files engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacting level 2 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002470-000000002.tsm (#0) engine=tsm1
May 30 20:57:00 redacted influxd[55465]: [I] 2017-05-30T20:57:00Z compacting level 2 group (0) /var/lib/influxdb/data/telegraf/telegraf/267/000002472-000000002.tsm (#1) engine=tsm1

We do have one measurement exceeding the 100000 values per tag max, but I can't see how that would be related.

@nfirvine
Copy link
Contributor Author

Seems to be affecting the same VM out of the 6 telegrafs. Had an instance of the problem after only 10 minutes of running after the last restart. Am going to reboot the server to if it makes a difference

@danielnelson
Copy link
Contributor

Crossing my fingers

@nfirvine
Copy link
Contributor Author

nfirvine commented Jun 1, 2017

Unfortunately that didn't fix it :( About 90 min after rebooting, one of the nodes had the problem. Got the stacktrace here

telegraf-stuck2.gostack.txt

This means I'll have to downgrade.

@danielnelson
Copy link
Contributor

Anything in dmesg?

@nfirvine
Copy link
Contributor Author

nfirvine commented Jun 1, 2017

Nope.

@danielnelson
Copy link
Contributor

Is it possible to narrow down the plugins? I'm especially interested if it will occur if the influxdb output is replaced with a file output as well as if you remove the kafka plugin.

@nfirvine
Copy link
Contributor Author

nfirvine commented Jun 2, 2017

Sorry, I've downgraded to 1.2 now. Reproing the problem is hard enough (takes hours or days, possibly depending on volume), let alone reproring and bisecting, so I'm going have to step away from this issue for the time-being. Will close for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants