Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

Closed
tesibelda opened this issue Jul 18, 2021 · 5 comments · Fixed by #9526
Closed

outputs.influxdb not buffering points on telegraf 1.19.1 #9514

tesibelda opened this issue Jul 18, 2021 · 5 comments · Fixed by #9526
Labels
area/influxdb bug unexpected problem or unintended behavior

Comments

@tesibelda
Copy link

With telegraf 1.19.1, internal_write reports values of buffer_size for outputs.influxdb that are smaller than metric_batch_size even when the influxdb instance is down and more points have been generated. Here is a configuration for comparison also with http output which buffers points correctly (as with previous versions).

Relevant telegraf.conf:

[global_tags]
  region = "eu-west-1"

[agent]
  metric_buffer_limit = 100000
  flush_interval = "2s"

[[inputs.internal]]
  interval="4ms"

[[outputs.influxdb]]
  urls = [ "http://localhost:8086" ]
  skip_database_creation = true

[[outputs.http]]
  url = "http://127.0.0.1:8080/telegraf"

[[outputs.file]]
  files = [ "telegraf.out" ]

System info:

Linux on AMD64 with telegraf 1.19.1 (https://dl.influxdata.com/telegraf/releases/telegraf-1.19.1_linux_amd64.tar.gz) and 1.18.3 (https://dl.influxdata.com/telegraf/releases/telegraf-1.18.3_linux_amd64.tar.gz)
No processes are listening on localhost:8086 nor 127.0.0.1:8080.

Steps to reproduce:

  1. Use the above configuration with 1.19.1 binary and run it for 20s. Rename telegraf.out to telegraf-1.19.1.out
  2. Then run the same configuration with 1.18.3 version binary for 20s, and rename telegraf.out to telegraf-1.18.3.out
  3. Compare the last lines of telegraf-1.19.1.out and telegraf-1.18.3.out
    $ tail telegraf-1.19.1.out | grep write | tail -3
    internal_write,host=xxxxx,output=influxdb,region=eu-west-1,version=1.19.1 metrics_written=29000i,metrics_dropped=0i,buffer_size=682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=1823419i,errors=29i,metrics_added=29682i 1626627449853000000
    internal_write,host=xxxxx,output=http,region=eu-west-1,version=1.19.1 metrics_dropped=0i,buffer_size=29682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=11507427i,errors=0i,metrics_added=29682i,metrics_written=0i 1626627449853000000
    internal_write,host=xxxxx,output=file,region=eu-west-1,version=1.19.1 metrics_dropped=0i,buffer_size=682i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=17995062i,errors=0i,metrics_added=29682i,metrics_written=29000i 1626627449853000000
    $ tail telegraf-1.18.3.out | grep write | tail -3
    internal_write,host=xxxxx,output=influxdb,region=eu-west-1,version=1.18.3 metrics_dropped=0i,buffer_size=29718i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=3341971i,errors=29i,metrics_added=29718i,metrics_written=0i 1626627746465000000
    internal_write,host=xxxxx,output=http,region=eu-west-1,version=1.18.3 buffer_limit=100000i,metrics_filtered=0i,write_time_ns=13574467i,errors=0i,metrics_added=29718i,metrics_written=0i,metrics_dropped=0i,buffer_size=29718i 1626627746465000000
    internal_write,host=xxxxx,output=file,region=eu-west-1,version=1.18.3 metrics_added=29718i,metrics_written=29000i,metrics_dropped=0i,buffer_size=718i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=20608019i,errors=0i 1626627746465000000

Expected behavior:

internal_write should report similar numbers for output=influxdb and output=http, in particular for metrics_written and buffer_size, just as when using telegraf 1.18.3. Gathered metric points should be buffered for both unavailable outputs, but only http output is buffered if using 1.19.1.

Actual behavior:

When using 1.19.1 binary internal_write reports high number for metrics_written even though influxdb is not up and low number for buffer_size as it wasn't buffering the gathered points.
internal_write,host=xxx,output=influxdb,...,version=1.19.1 metrics_written=29000i,...,buffer_size=682i

On the contrary, 1.18.3 binary works as expected.
internal_write,host=xxxxx,output=influxdb,...,version=1.18.3 ...buffer_size=29718i,...,metrics_written=0i

http output also works as expected for both binary versions:
internal_write,host=xxxxx,output=http,...,version=1.19.1 ...buffer_size=29682i,...,metrics_written=0i
internal_write,host=xxxxx,output=http,...,version=1.18.3 ...,metrics_written=0i,...,buffer_size=29718i

@tesibelda tesibelda added the bug unexpected problem or unintended behavior label Jul 18, 2021
@tesibelda
Copy link
Author

Maybe related with #9296

@MyaLongmire
Copy link
Contributor

I ran your exact same config without localhost:8086 running and get a connection refused error.

2021-07-21T16:01:53Z I! Starting Telegraf 1.19.1
2021-07-21T16:01:53Z I! Loaded inputs: internal
2021-07-21T16:01:53Z I! Loaded aggregators: 
2021-07-21T16:01:53Z I! Loaded processors: 
2021-07-21T16:01:53Z I! Loaded outputs: file influxdb
2021-07-21T16:01:53Z I! Tags enabled: host=pop-os
2021-07-21T16:01:53Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"pop-os", Flush Interval:2s
2021-07-21T16:01:53Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused
2021-07-21T16:01:54Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused

When I start up localhost:8086 I get metrics_written=0i every time. Here is my output:

internal_memstats,host=pop-os mallocs=94407i,heap_sys_bytes=66289664i,heap_in_use_bytes=14401536i,heap_objects=32843i,num_gc=5i,sys_bytes=76366856i,total_alloc_bytes=18766256i,pointer_lookups=0i,frees=61564i,heap_alloc_bytes=12239736i,heap_idle_bytes=51888128i,heap_released_bytes=50503680i,alloc_bytes=12239736i 1626883163489000000
internal_agent,go_version=1.16.5,host=pop-os,version=1.19.1 metrics_written=0i,metrics_dropped=0i,metrics_gathered=1i,gather_errors=0i 1626883163489000000
internal_gather,host=pop-os,input=internal,version=1.19.1 metrics_gathered=1i,gather_time_ns=0i,errors=0i 1626883163489000000
internal_write,host=pop-os,output=influxdb,version=1.19.1 metrics_dropped=0i,buffer_size=0i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=0i,errors=0i,metrics_added=0i,metrics_written=0i 1626883163489000000
internal_write,host=pop-os,output=file,version=1.19.1 buffer_size=0i,buffer_limit=100000i,metrics_filtered=0i,write_time_ns=0i,errors=0i,metrics_added=0i,metrics_written=0i,metrics_dropped=0i 1626883163489000000
internal_memstats,host=pop-os mallocs=94680i,frees=61585i,heap_alloc_bytes=12253216i,heap_idle_bytes=51888128i,alloc_bytes=12253216i,total_alloc_bytes=18779736i,pointer_lookups=0i,heap_released_bytes=50503680i,heap_objects=33095i,num_gc=5i,sys_bytes=76366856i,heap_sys_bytes=66289664i,heap_in_use_bytes=14401536i 1626883163492000000
internal_agent,go_version=1.16.5,host=pop-os,version=1.19.1 gather_errors=0i,metrics_written=0i,metrics_dropped=0i,metrics_gathered=6i 1626883163492000000
internal_gather,host=pop-os,input=internal,version=1.19.1 errors=0i,metrics_gathered=6i,gather_time_ns=101613i 1626883163492000000

Can you clarify what you mean by, "No processes are listening on localhost:8086"?

@tesibelda
Copy link
Author

tesibelda commented Jul 21, 2021

The idea of this configuration is to test the buffering mechanism of outputs.influxdb with telegraf 1.19.1 compared with 1.18.3 and also with outputs.http. In order to do that the test is done with no influxdb is listening in localhost:8086 nor endpoint listening 127.0.0.1:8080, we just run telegraf and the OS at the test machine. As a lot of metrics are collected, they should be entering to the buffer and no metrics should be reported as written.

Telegraf 1.18.3 does exactly this as reported with:
internal_write,host=xxxxx,output=influxdb,...,version=1.18.3 ...buffer_size=29718i,...,metrics_written=0i

Telegraf 1.19.1 doesn't seem to buffer metrics correctly for outputs.influxdb as reported with:
internal_write,host=xxx,output=influxdb,...,version=1.19.1 metrics_written=29000i,...,buffer_size=682i
Here buffer_size should be something like 29k and metrics_written should be 0.
But telegraf 1.19.1 does it well for outputs.http as reported with:
internal_write,host=xxxxx,output=http,...,version=1.19.1 ...buffer_size=29682i,...,metrics_written=0i

If internal_write is reporting real values, telegraf 1.19.1 agents that for some reason lose connectivity with influxdb will not properly buffer the metrics and if the connectivity problems take time to recover most metrics will be lost (buffer_size will not reach 'metric_buffer_limit' metrics regardless of the time the connection is down because of this behavior, it will not even reach 'metric_batch_size')

@MyaLongmire
Copy link
Contributor

Please try pr #9526 and see if it fixes your problem.

@tesibelda
Copy link
Author

pr #9526 artifacts behave as expected.
The test result was:
internal_write,host=xxxx,output=influxdb,...,version=unknown ...,buffer_size=24865i,..,metrics_written=0i
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants