outputs.influxdb not buffering points on telegraf 1.19.1 #9514

tesibelda · 2021-07-18T17:26:42Z

With telegraf 1.19.1, internal_write reports values of buffer_size for outputs.influxdb that are smaller than metric_batch_size even when the influxdb instance is down and more points have been generated. Here is a configuration for comparison also with http output which buffers points correctly (as with previous versions).

Relevant telegraf.conf:

[global_tags]
  region = "eu-west-1"

[agent]
  metric_buffer_limit = 100000
  flush_interval = "2s"

[[inputs.internal]]
  interval="4ms"

[[outputs.influxdb]]
  urls = [ "http://localhost:8086" ]
  skip_database_creation = true

[[outputs.http]]
  url = "http://127.0.0.1:8080/telegraf"

[[outputs.file]]
  files = [ "telegraf.out" ]

System info:

Linux on AMD64 with telegraf 1.19.1 (https://dl.influxdata.com/telegraf/releases/telegraf-1.19.1_linux_amd64.tar.gz) and 1.18.3 (https://dl.influxdata.com/telegraf/releases/telegraf-1.18.3_linux_amd64.tar.gz)
No processes are listening on localhost:8086 nor 127.0.0.1:8080.

Steps to reproduce:

Use the above configuration with 1.19.1 binary and run it for 20s. Rename telegraf.out to telegraf-1.19.1.out
Then run the same configuration with 1.18.3 version binary for 20s, and rename telegraf.out to telegraf-1.18.3.out
Compare the last lines of telegraf-1.19.1.out and telegraf-1.18.3.out
$ tail telegraf-1.19.1.out | grep write | tail -3
internal_write,host=xxxxx,output=influxdb,region=eu-west-1,version=1.19.1 metrics_written=29000i,metrics_dropped=0i,buffer_size=682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=1823419i,errors=29i,metrics_added=29682i 1626627449853000000
internal_write,host=xxxxx,output=http,region=eu-west-1,version=1.19.1 metrics_dropped=0i,buffer_size=29682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=11507427i,errors=0i,metrics_added=29682i,metrics_written=0i 1626627449853000000
internal_write,host=xxxxx,output=file,region=eu-west-1,version=1.19.1 metrics_dropped=0i,buffer_size=682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=17995062i,errors=0i,metrics_added=29682i,metrics_written=29000i 1626627449853000000
$ tail telegraf-1.18.3.out | grep write | tail -3
internal_write,host=xxxxx,output=influxdb,region=eu-west-1,version=1.18.3 metrics_dropped=0i,buffer_size=29718i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=3341971i,errors=29i,metrics_added=29718i,metrics_written=0i 1626627746465000000
internal_write,host=xxxxx,output=http,region=eu-west-1,version=1.18.3 buffer_limit=100000i,metrics_filtered=0i,write_time_ns=13574467i,errors=0i,metrics_added=29718i,metrics_written=0i,metrics_dropped=0i,buffer_size=29718i 1626627746465000000
internal_write,host=xxxxx,output=file,region=eu-west-1,version=1.18.3 metrics_added=29718i,metrics_written=29000i,metrics_dropped=0i,buffer_size=718i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=20608019i,errors=0i 1626627746465000000

Expected behavior:

internal_write should report similar numbers for output=influxdb and output=http, in particular for metrics_written and buffer_size, just as when using telegraf 1.18.3. Gathered metric points should be buffered for both unavailable outputs, but only http output is buffered if using 1.19.1.

Actual behavior:

When using 1.19.1 binary internal_write reports high number for metrics_written even though influxdb is not up and low number for buffer_size as it wasn't buffering the gathered points.
internal_write,host=xxx,output=influxdb,...,version=1.19.1 metrics_written=29000i,...,buffer_size=682i

On the contrary, 1.18.3 binary works as expected.
internal_write,host=xxxxx,output=influxdb,...,version=1.18.3 ...buffer_size=29718i,...,metrics_written=0i

http output also works as expected for both binary versions:
internal_write,host=xxxxx,output=http,...,version=1.19.1 ...buffer_size=29682i,...,metrics_written=0i
internal_write,host=xxxxx,output=http,...,version=1.18.3 ...,metrics_written=0i,...,buffer_size=29718i

tesibelda · 2021-07-18T17:28:12Z

Maybe related with #9296

MyaLongmire · 2021-07-21T16:06:02Z

I ran your exact same config without localhost:8086 running and get a connection refused error.

2021-07-21T16:01:53Z I! Starting Telegraf 1.19.1
2021-07-21T16:01:53Z I! Loaded inputs: internal
2021-07-21T16:01:53Z I! Loaded aggregators: 
2021-07-21T16:01:53Z I! Loaded processors: 
2021-07-21T16:01:53Z I! Loaded outputs: file influxdb
2021-07-21T16:01:53Z I! Tags enabled: host=pop-os
2021-07-21T16:01:53Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"pop-os", Flush Interval:2s
2021-07-21T16:01:53Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused
2021-07-21T16:01:54Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused

When I start up localhost:8086 I get metrics_written=0i every time. Here is my output:

internal_memstats,host=pop-os mallocs=94407i,heap_sys_bytes=66289664i,heap_in_use_bytes=14401536i,heap_objects=32843i,num_gc=5i,sys_bytes=76366856i,total_alloc_bytes=18766256i,pointer_lookups=0i,frees=61564i,heap_alloc_bytes=12239736i,heap_idle_bytes=51888128i,heap_released_bytes=50503680i,alloc_bytes=12239736i 1626883163489000000
internal_agent,go_version=1.16.5,host=pop-os,version=1.19.1 metrics_written=0i,metrics_dropped=0i,metrics_gathered=1i,gather_errors=0i 1626883163489000000
internal_gather,host=pop-os,input=internal,version=1.19.1 metrics_gathered=1i,gather_time_ns=0i,errors=0i 1626883163489000000
internal_write,host=pop-os,output=influxdb,version=1.19.1 metrics_dropped=0i,buffer_size=0i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=0i,errors=0i,metrics_added=0i,metrics_written=0i 1626883163489000000
internal_write,host=pop-os,output=file,version=1.19.1 buffer_size=0i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=0i,errors=0i,metrics_added=0i,metrics_written=0i,metrics_dropped=0i 1626883163489000000
internal_memstats,host=pop-os mallocs=94680i,frees=61585i,heap_alloc_bytes=12253216i,heap_idle_bytes=51888128i,alloc_bytes=12253216i,total_alloc_bytes=18779736i,pointer_lookups=0i,heap_released_bytes=50503680i,heap_objects=33095i,num_gc=5i,sys_bytes=76366856i,heap_sys_bytes=66289664i,heap_in_use_bytes=14401536i 1626883163492000000
internal_agent,go_version=1.16.5,host=pop-os,version=1.19.1 gather_errors=0i,metrics_written=0i,metrics_dropped=0i,metrics_gathered=6i 1626883163492000000
internal_gather,host=pop-os,input=internal,version=1.19.1 errors=0i,metrics_gathered=6i,gather_time_ns=101613i 1626883163492000000

Can you clarify what you mean by, "No processes are listening on localhost:8086"?

tesibelda · 2021-07-21T18:42:13Z

The idea of this configuration is to test the buffering mechanism of outputs.influxdb with telegraf 1.19.1 compared with 1.18.3 and also with outputs.http. In order to do that the test is done with no influxdb is listening in localhost:8086 nor endpoint listening 127.0.0.1:8080, we just run telegraf and the OS at the test machine. As a lot of metrics are collected, they should be entering to the buffer and no metrics should be reported as written.

Telegraf 1.18.3 does exactly this as reported with:
internal_write,host=xxxxx,output=influxdb,...,version=1.18.3 ...buffer_size=29718i,...,metrics_written=0i

Telegraf 1.19.1 doesn't seem to buffer metrics correctly for outputs.influxdb as reported with:
internal_write,host=xxx,output=influxdb,...,version=1.19.1 metrics_written=29000i,...,buffer_size=682i
Here buffer_size should be something like 29k and metrics_written should be 0.
But telegraf 1.19.1 does it well for outputs.http as reported with:
internal_write,host=xxxxx,output=http,...,version=1.19.1 ...buffer_size=29682i,...,metrics_written=0i

If internal_write is reporting real values, telegraf 1.19.1 agents that for some reason lose connectivity with influxdb will not properly buffer the metrics and if the connectivity problems take time to recover most metrics will be lost (buffer_size will not reach 'metric_buffer_limit' metrics regardless of the time the connection is down because of this behavior, it will not even reach 'metric_batch_size')

MyaLongmire · 2021-07-22T15:17:36Z

Please try pr #9526 and see if it fixes your problem.

tesibelda · 2021-07-24T14:23:29Z

pr #9526 artifacts behave as expected.
The test result was:
internal_write,host=xxxx,output=influxdb,...,version=unknown ...,buffer_size=24865i,..,metrics_written=0i
Thanks

tesibelda added the bug unexpected problem or unintended behavior label Jul 18, 2021

telegraf-tiger bot added the area/tail label Jul 18, 2021

helenosheaa added area/influxdb and removed area/tail labels Jul 21, 2021

MyaLongmire mentioned this issue Jul 21, 2021

Fix metrics reported as written but not actually written #9526

Merged

2 tasks

reimda closed this as completed in #9526 Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

tesibelda commented Jul 18, 2021

tesibelda commented Jul 18, 2021

MyaLongmire commented Jul 21, 2021

tesibelda commented Jul 21, 2021 •

edited

Loading

MyaLongmire commented Jul 22, 2021

tesibelda commented Jul 24, 2021

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

Comments

tesibelda commented Jul 18, 2021

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

tesibelda commented Jul 18, 2021

MyaLongmire commented Jul 21, 2021

tesibelda commented Jul 21, 2021 • edited Loading

MyaLongmire commented Jul 22, 2021

tesibelda commented Jul 24, 2021

tesibelda commented Jul 21, 2021 •

edited

Loading