Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'systemctl restart telegraf' failing (related to logparser ?) #4610

Closed
bolek2000 opened this issue Aug 29, 2018 · 11 comments · Fixed by #4938 or #5169
Closed

'systemctl restart telegraf' failing (related to logparser ?) #4610

bolek2000 opened this issue Aug 29, 2018 · 11 comments · Fixed by #4938 or #5169
Assignees
Labels
area/tail bug unexpected problem or unintended behavior
Milestone

Comments

@bolek2000
Copy link

bolek2000 commented Aug 29, 2018

Relevant telegraf.conf:

Agent runs some basic system metric plus one logparser that parses a preprocessed (sorted) nginx acces.log. The preprocessing script runs every 10 minutes and extracts 10 minutes (in the past, to make sure nginx buffer is flushed) of logdata to a sorted file for ingestion into telegraf logparser (every run about 10.000 lines)

Logparser config (one file in /etc/telegraf/telegraf.d):

[[inputs.logparser]]
    tagexclude = ["path"]
   files = ["/var/log/nginx/access.log.sorted.10min"]
   from_beginning = true
   watch_method = "inotify"
   [inputs.logparser.grok]
        patterns = ['''
%{CLIENT:client_ip:drop} %{NOTSPACE:ident:drop} %{NOTSPACE:auth:drop} \[%{HTTPDATE:ts:ts-httpd}\] \"%{WORD:http_method:drop} %{PATHLEVEL1:pathlevel1:tag}(\/|\?)?.* HTTP\/%{NUMBER:http_version:drop}\" %{RESPONSE_CODE} (?:%{NUMBER:resp_bytes:int}|-) \"%{DATA:referer:drop}\" \"%{DATA:user_agent:drop}\" +(?:%{NUMBER:request_time:float}|-) %{LOAD_BALANCING_TIMES:lb_times:drop}(?:%{NUMBER:upstream_time:float}|-) %{GREEDYDATA:pipe:drop}+ \"%{GREEDYDATA:gzip_ratio:drop}\" \"(www.)?%{DOMAIN_MATCH:domain:tag}~%{DATA:mobile:tag} %{DATA:bot:tag} %{DATA:cache_status:tag}\"
        ''']
        custom_patterns = '''
        DOMAIN_MATCH (exampe1.com|example2.com|example3.com)
        PATHLEVEL1 \/([^ "/?]+)?
        LOAD_BALANCING_TIMES (((\d+\.\d+),\s)+)?
        '''
     measurement = "proxy_access_log"

System info:

telegraf 1.7.3 on Ubuntu 18.04 server

Steps to reproduce:

  1. systemctl restart telegraf

Expected behavior:

Restart succeeds

Actual behavior:

sometimes (not always) the restart doesn't finish and the process gets killed by systemd

Additional info:

Journal:

Aug 28 18:10:01 us-proxy1 CRON[30491]: (root) CMD (/usr/local/bin/ws/prep_telegraf_access_previous_10min.sh)
Aug 28 18:20:01 us-proxy1 CRON[31065]: (root) CMD (/usr/local/bin/ws/prep_telegraf_access_previous_10min.sh)
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: Killing process 30512 (telegraf) with signal SIGKILL.
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: Killing process 30524 (telegraf) with signal SIGKILL.
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: Failed with result 'timeout'.
Aug 28 18:21:32 us-proxy1 systemd[1]: telegraf.service: Triggering OnFailure= dependencies.
Aug 28 18:21:32 us-proxy1 systemd-unit-status-mail.sh[31089]: Status mail sent to: root for unit: telegraf.service

In the Log below one can see that telegraf restart is triggered 18:20:02 (from the script in CRON above) and then logparser tries to parse a line even before it flushes the cashed metrics. Also the Log-entry doesn't exist in that way and is somehow cut off. This is always seen when this problem occurs, so seems to be related to logparser.
After flushing the old metrics it takes the Systemd TimeoutSec before it gets killed and restarted (successfully).

telegraf.log:

2018-08-28T18:19:51Z D! Output [influxdb] wrote batch of 65 metrics in 271.871656ms
2018-08-28T18:20:02Z D! Output [influxdb] buffer fullness: 77 / 10000 metrics. 
2018-08-28T18:20:02Z D! Grok no match found for: " \"GET /some-url/file_mobile.png HTTP/2.0\" 200 10786 \"https://www.example.com/some-URL\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36\" +0.000 - .+ \"-\" \"www.example.com~0 0 HIT\""
2018-08-28T18:20:02Z I! Hang on, flushing any cached metrics before shutdown
2018-08-28T18:20:02Z D! Output [influxdb] buffer fullness: 5 / 10000 metrics. 
2018-08-28T18:20:02Z D! Output [influxdb] wrote batch of 77 metrics in 266.754045ms
2018-08-28T18:20:02Z D! Output [influxdb] wrote batch of 5 metrics in 133.367373ms
2018-08-28T18:21:32Z D! Attempting connection to output: influxdb
2018-08-28T18:21:32Z D! Successfully connected to output: influxdb
2018-08-28T18:21:32Z I! Starting Telegraf v1.7.3
2018-08-28T18:21:32Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.processes inputs.system inputs.internal inputs.nginx inputs.kernel inputs.swap inputs.net inputs.netstat inputs.procstat inputs.logparser
2018-08-28T18:21:32Z I! Loaded aggregators: 
2018-08-28T18:21:32Z I! Loaded processors: 
2018-08-28T18:21:32Z I! Loaded outputs: influxdb
2018-08-28T18:21:32Z I! Tags enabled: dc=us-vultr host=us-proxy1
2018-08-28T18:21:32Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"us-proxy1", Flush Interval:10s

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/tail labels Aug 29, 2018
@danielnelson
Copy link
Contributor

Does it behave the same if you use reload?

systemctl reload telegraf

@bolek2000
Copy link
Author

That is what I tried afterwards after I discovered these problems. I changed to reload, but this was even worse. telegraf totally stopped sending metrics. And a weird cut off log entry in the log again.

2018-08-28T19:59:51Z D! Output [influxdb] wrote batch of 49 metrics in 261.062396ms
2018-08-28T20:00:02Z D! Output [influxdb] buffer fullness: 91 / 10000 metrics. 
2018-08-28T20:00:02Z D! Grok no match found for: "Firefox/61.0\" +0.000 - .+ \"-\" \"example.com~0 0 HIT\""
2018-08-28T20:00:02Z I! Reloading Telegraf config
2018-08-28T20:00:02Z I! Hang on, flushing any cached metrics before shutdown
2018-08-28T20:00:02Z D! Output [influxdb] buffer fullness: 8 / 10000 metrics. 
2018-08-28T20:00:03Z D! Output [influxdb] wrote batch of 91 metrics in 1.015846591s
2018-08-28T20:00:03Z D! Output [influxdb] wrote batch of 8 metrics in 132.065843ms

Then the logfile ends and nothing is written to it anymore until I changed back to restart. As you can see, telegraf was still running, and doing reloads, but stopped working at all.

root@us-proxy1:/var/log/telegraf# systemctl status telegraf
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/telegraf.service.d
└─10-renice.conf
Active: active (running) since Tue 2018-08-28 19:50:03 UTC; 24h ago
Docs: https://github.com/influxdata/telegraf
Process: 15355 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 3940 (telegraf)
Tasks: 15 (limit: 4915)
CGroup: /system.slice/telegraf.service
└─3940 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf

Aug 29 20:00:02 us-proxy1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics in
Aug 29 20:00:02 us-proxy1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics int
Aug 29 20:10:02 us-proxy1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics in
Aug 29 20:10:02 us-proxy1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics int
Aug 29 20:20:02 us-proxy1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics in
Aug 29 20:20:02 us-proxy1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics int
Aug 29 20:30:02 us-proxy1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics in
Aug 29 20:30:02 us-proxy1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics int
Aug 29 20:40:02 us-proxy1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics in
Aug 29 20:40:02 us-proxy1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics int

@bolek2000
Copy link
Author

By the way I was using restart because of all the other bugs with logparser not realizing that a file is overwritten or a second logparser not working, when using reload etc. So now also my latest workaroud is broken.
#3573
#3906
So this alltogether renders the logparser plugin nearly impossible to use in a production environment.

@bolek2000
Copy link
Author

Now I tried:

systemctl stop telegraf
systemctl start telegraf
Aug 30 07:21:33 us-proxy1 systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Aug 30 07:21:33 us-proxy1 systemd[1]: telegraf.service: Killing process 21364 (telegraf) with signal SI
Aug 30 07:21:33 us-proxy1 systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Aug 30 07:21:33 us-proxy1 systemd[1]: telegraf.service: Failed with result 'timeout'.

@danielnelson
Copy link
Contributor

I think this might be addressed in a branch I am working on to fix #4283 and #4457. I will try to get the branch to a state that we can use it for testing.

@danielnelson danielnelson self-assigned this Aug 31, 2018
@bolek2000
Copy link
Author

bolek2000 commented Sep 6, 2018

On Ubuntu 14.04 the restart problem shows up diffferently. Here telegraf processes are not stopped properly (because of lack of systemd kill mechanisms ?) and are piling up. The problem is, that they still produce load and after a couple of weeks the load rises dramatically. I need to use SIGKILL to get rid of these old processes.
Luckily we always renice all telegraf processes. This problem I saw first after upgrade to 1.7.3. Now I am on 1.7.4, and will investigate further.

ps output:

telegraf 1563 0.0 0.0 1003892 28044 ? SNl Aug19 7:05 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 1935 0.0 0.0 1017372 27208 ? SNl Aug17 10:54 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 2039 0.0 0.0 944696 30632 ? SNl Aug15 13:29 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 3460 0.0 0.0 873876 21540 ? SNl Aug18 6:07 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 3962 0.0 0.0 1145272 23416 ? SNl Sep03 3:37 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 4625 0.0 0.0 1008120 20760 ? SNl Aug30 5:59 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 6201 0.0 0.0 800144 21940 ? SNl Sep03 1:47 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 7341 0.0 0.0 1116200 25868 ? SNl Aug15 8:18 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 7397 0.0 0.0 989612 21636 ? SNl Aug22 4:45 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 7486 0.0 0.0 1175912 20688 ? SNl Aug12 5:49 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 7676 0.0 0.0 1061232 20804 ? SNl Aug11 7:17 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 8980 0.0 0.0 996752 22380 ? SNl Aug27 3:45 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 9331 0.0 0.0 882072 22780 ? SNl Aug28 2:44 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 11023 0.0 0.0 1148444 21196 ? SNl Aug29 7:37 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 13973 0.0 0.0 865680 22664 ? SNl Aug16 5:39 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 16798 0.0 0.0 1002036 20988 ? SNl Aug14 10:22 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 16949 0.0 0.0 1193360 20604 ? SNl Aug16 5:44 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 17206 0.1 0.1 1278460 42116 ? SNl Sep03 7:13 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 17448 0.1 0.1 1070228 44864 ? SNl Sep05 2:49 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 20516 0.6 0.1 868852 44124 ? SNl Sep05 4:09 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 20925 0.0 0.0 1201556 28128 ? SNl Aug18 8:07 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 21015 88.5 0.1 943640 57636 ? SNl 06:40 5:34 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 21112 0.0 0.0 1144216 23932 ? SNl Aug09 8:11 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 21371 0.0 0.0 987500 25088 ? SNl Aug26 3:19 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 21423 0.0 0.0 988300 27048 ? SNl Aug15 10:01 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 21671 0.2 0.1 860656 44532 ? SNl Sep04 5:33 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 22608 0.0 0.0 1222176 22664 ? SNl Aug17 10:03 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 22740 0.0 0.0 1193360 20756 ? SNl Aug31 2:57 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 23460 0.0 0.0 1070484 31452 ? SNl Aug18 7:39 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 24589 0.0 0.0 914568 27468 ? SNl Aug17 7:54 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 26609 26.4 0.1 1295908 60244 ? SNl 03:10 57:19 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 27268 0.0 0.0 1069428 22580 ? SNl Aug15 5:26 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 27924 0.0 0.0 1004948 20480 ? SNl Aug22 3:52 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 28619 0.0 0.0 932272 23776 ? SNl Aug08 8:07 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 29501 0.0 0.0 1077624 20536 ? SNl Aug27 3:02 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 30318 0.0 0.1 781704 46588 ? SNl Sep03 1:13 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 32118 0.4 0.1 1066516 42432 ? SNl Sep05 5:31 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

On every telegraf restart this error mesage shows up:
telegraf.log:

2018-09-06T07:20:08Z E! Unable to remove pidfile: remove /var/run/telegraf/telegraf.pid: no such file or directory
2018-09-06T07:30:09Z E! Unable to remove pidfile: remove /var/run/telegraf/telegraf.pid: no such file or directory

Still the files are created on startup:

-rw-r--r-- 1 telegraf telegraf 4 Sep 6 07:20 /var/run/telegraf/telegraf.pid
-rw-r--r-- 1 telegraf telegraf 5 Sep 6 07:30 /var/run/telegraf/telegraf.pid

@bolek2000
Copy link
Author

I migrated all my logparsers to the tail plugin with grok data_format and that worked very well. Also I am on telegraf 1.8 on all hosts now. Then I switched to reload in my preprocessing script, what also works.
But with restart I still have this behavior sometimes, but it seems less than with logparser.

@danielnelson
Copy link
Contributor

Glad to hear the tail plugin is working better than logparser, though I am somewhat surprised that it helped with the rotation issue. I am working on a redesign of the shutdown procedure for 1.9 which I think will fix this issue.

@bolek2000
Copy link
Author

So far I had telegraf 1.8 and tail plugin running on 3 LTS versions of Ubuntu on 7 servers. On some machines I didn't have reload nor restart enabled on some either periodic restart or reload. This mixed test setup ran for over a week now and on all machines the input files were overwritten with new preprocessed logfile entries every 10 minutes. If this is comparable to a normal logrotation, I could not see the rotation issue anymore during this time.
But I still have the issue that sometimes the plugin/telegraf have hickups, where I see the cut off logentries (like described above) and here: influxdata/influxdb#9664 (reportd in wrong project).
This behavior I found so far only on instances doing periodic restarts and on machines, where telegraf was continiously running. Also I discovered, that in the last case the memory consuption started rising after these "hickups". If I do periodic restarts/reloads I don't realize this leaking of course. So I can confirm, that there is some kind of leaking still happening in this setup. Now I have all instances on periodic reloads and will report, if I also see the leaking there...

@danielnelson
Copy link
Contributor

I think this is the same as reported in #4457

@bolek2000
Copy link
Author

bolek2000 commented Dec 10, 2018

I have now 1.10.0~a50fb5a2-0 (and had 1.9.0 with memory leak) but still see this behavior on Ubuntu 14.04. Before upgrade to 1.9.0. I did a periodic reload of the agent, what worked mostly (sometimes the agent just stopped sending metrics). Couldn't test 1.9+ on Ubuntu 18.04/16.04 yet. Also only with periodic reload (and without) of the agent the tail plugin is loosing a lot of metrics, what worked better in 1.8. , thats why I switched to restart again.

2018-12-08T07:00:01Z D! [outputs.influxdb] wrote batch of 196 metrics in 15.322917ms
2018-12-08T07:00:01Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2018-12-08T07:00:02Z D! [agent] Stopping service inputs
2018-12-08T07:00:02Z D! [inputs.tail] tail removed for file: /var/log/nginx/wp_access.log.sorted.10min
2018-12-08T07:00:02Z D! [agent] Input channel closed
2018-12-08T07:00:02Z D! [agent] Processor channel closed
2018-12-08T07:00:02Z I! [agent] Hang on, flushing any cached metrics before shutdown
2018-12-08T07:00:02Z D! [outputs.influxdb] wrote batch of 573 metrics in 23.730696ms
2018-12-08T07:00:02Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2018-12-08T07:00:02Z D! [agent] Closing outputs
2018-12-08T07:00:02Z E! Unable to remove pidfile: remove /var/run/telegraf/telegraf.pid: no such file or directory
2018-12-08T07:00:04Z I! Starting Telegraf 
2018-12-08T07:00:04Z I! Loaded inputs: inputs.nginx inputs.cpu inputs.disk inputs.diskio inputs.kernel inputs.processes inputs.swap inputs.net inputs.procstat inputs.mem inputs.system inputs.internal inputs.netstat inputs.memcached inputs.mysql inputs.httpjson inputs.phpfpm inputs.tail
2018-12-08T07:00:04Z I! Loaded aggregators: 
2018-12-08T07:00:04Z I! Loaded processors: regex regex regex
2018-12-08T07:00:04Z I! Loaded outputs: influxdb

ubnt14_04_problems_telegraf

In the metrics you can see:

  1. rising process numbers (I already had a limit of 10 telegraf processes in place)
  2. rising written metrics (because these hanging processes still read tail-plugin logs, I guess)
  3. high load while the metrics output is high (load drops after I killed all dangling telegraf processes)

Now I downgraded to 1.8.3 and periodic reloads for this host to get some kind of stability for production. The most important metrics comes from reading the webserver logs, what did'nt work well with 1.9+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tail bug unexpected problem or unintended behavior
Projects
None yet
3 participants