-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905
Comments
My current thoughts on this are that we should provide a way for an input to know when metrics have been successfully been emitted by either one output or all outputs. The input can then ack queue messages or advance its offset.
This is the current behavior.
Metrics cannot be aggregated in general, but should be an option when more aggregators are added.
I'm not very interested in this, think it is more trouble than it is worth.
Yeah maybe, maybe this is covered by the ack signal mentioned above. |
Absolutely, such a signal that can be captured by "interested" plugins would be a very flexible solution |
Persisting to disk could be a very useful failover feature for Telegraf. For me personally it would be the best improvement I could imagine. I use Telegraf to monitor a bunch of automated devices/"things" in my home and at other places. And the machines running Telegraf don't have any internet-failover. All metrics collected during an internet blackout are gone in this case. Another use-case would be a failure of output target, e.g. the InfluxDB instance. The file system of the machine running Telegraf is something you can rely on, even when internet fails, the output machine fails or other uncommon events happen. So I'm not agreeing with @danielnelson that this would be more trouble than it is worth. Please consider implementing a feature like mentioned in #802. |
Dropping into the discussion here. @thannaske would not using Kafka or any of the MQ in/outputs solve this if the "ack / congestion" issue is fixed in Telegraf? I would prefer that Telegraf did not acquire features that other products made for the task could do much better. |
@Anderen2 I don't know how the most users are using Telegraf but for my case I would prefer not setting up an additional software stack just to buffer data in case of loss of connectivity. I'm running Telegraf on many small embedded *nix-systems to collect sensor statistics. The systems would not be capable running Kafka or something other in addition to Telegraf. I also think that you can't compare a configurable buffer with a fully-functional message pipeline. That seems a little overkill for me. |
@thannaske Hmm, I see your point. However this is quite ineffective when Telegraf keeps eating the queue, and tosses everything away when it's internal buffer is filled. To avoid IO congestion we'd also prefer that the Telegraf buffers could be in-memory only (as I assume that it will not be as write-effective as Kafka is). But I have nothing against persisting buffers as long as they are toggle-able in the configuration. As a sidenote, have you looked into MQTT? It's a quite lightweight protocol created specifically to use for telemetry from low-end devices. There exists tons of brokers for it, both feature-rich and lightweight ones. |
This has been addressed for the 1.9.0 release in #4938. The kafka consumer, and other queue consumers, have a new option: This is a pretty big change for the queue consumers, so I would appreciate any testing that can be done. |
I was investigating how we might introduce Kafka in our dataflow (application data, monitoring and operational commands, to and from iot devices, influx and our docker cluster) to improve the robustness of the data streams, only to find out that, despite all the required pieces being available, telegraf would seem to be a bad fit for passing data from one stage to the next.
If I understand correctly, during maintenance windows, network glitches, downtimes, etc telegraf would continue to poll/read inputs and if/when the metric_buffer_limit is reached, drop all new data. Making this work transparently is why we're looking at Kafka in the first place. (Related issue: #2265)
I do not know the telegraf code and @sparrc points out in #2240 that inputs and outputs are designed to be independent and getting feedback from one to the other would require fundamental changes.
So I was hoping to take a step back, put a few thoughts out there and see if this is useful input.
Handling full buffers
The output buffer is obviously an awesome feature, but I'd love to be able to mitigate a full buffer by:
Reacting to output plugin state
If it knows what is going on in other telegraf processes, something like a message queue input plugin might have better options to deal with downstream congestion (stop polling / tell source to keep replay info / store offset / keep feeding for the benefit of still-active outputs / ...)
Buffers at other stages
Moving or duplicating the buffer feature to input/processor/aggregator plugins might be useful too, but only makes sense if there are triggers that would tell the respective plugins to start filling their buffers.
The text was updated successfully, but these errors were encountered: