Possible truncation outputting to udp in influx format #2881

danielnelson · 2017-06-03T00:53:37Z

Bug report

Triggered by reports of malformed metrics in #2862 and based on a code inspection, it appears to me that when using udp with the socket_writer, points will be truncated when serializing in influx format.

This could occur anywhere we have a fixed output buffer.

Relevant telegraf.conf:

N/A

System info:

1.3.1

Steps to reproduce:

I have not test this!

Send a very large point in influxdb format using udp socket_writer
Metrics are truncated?

Expected behavior:

Use field splitting where possible to fit into buffer, warn if cannot possibly fit.

Actual behavior:

I think it will be truncated.

Additional info:

#2880 (comment)

danielnelson · 2017-06-03T00:55:03Z

cc @oplehto

phemmer · 2017-06-03T03:08:41Z

Valid case, but should be rather hard to trigger. The IP protocol will split the packet if it's too large. Maximum payload size is a little under 64KiB, which is an insanely huge point.

Edit: Actually I think what's likely to happen is that the write operation will throw an error. So nothing will be written. Not truncation.

oplehto · 2017-06-03T20:27:52Z

There are no errors thrown, at least in our case. When the bug is triggered by a single large metric in a metrics batch (what is gathered during an interval), the rest of the batch is quietly dropped. Thus a subset of the metrics will be sent cleanly but there are random gaps in data.

phemmer · 2017-06-03T21:44:29Z

That's not what I get. When I test I get:

2017-06-03T21:35:21Z E! Error writing to output [socket_writer]: write udp 127.0.0.1:62494->127.0.0.1:9999: write: message too long

danielnelson · 2017-06-07T20:09:59Z

It looks like we send one point per packet, so there shouldn't be gaps in data due to this. Maybe the network is being overloaded due to the batching causing packets to not being received?

Looks like I was concerned about nothing here. We could run the output through the Split function but I think it's not needed at this time.

phemmer · 2017-06-07T20:16:28Z

@danielnelson
Actually there are gaps unfortunately. The socket_writer plugin takes the batch passed to Write() and splits it up. If a write error occurs on any of the points, the Write() returns an error. Meaning the rest of the points are skipped.

Solving this is really tricky. The reason to return an error is that telegraf will re-call the Write() with the same points again, as a retry mechanism. However in the case of a bad point, it will fail each and every time. We could skip points that fail to write, but in the case of issues where it's not the point, but say on the remote end, we'd be skipping a lot of points where a retry might work.

danielnelson · 2017-06-07T21:10:19Z

All of points should be resent, so what would happen in this case is the socket_writer would be completely stuck, right? A possible exception to this would be an icmp error for a previous send.

phemmer · 2017-06-07T21:12:07Z

When I looked at the code, it seemed like telegraf would eventually give up on the write, and skip to the next batch. It looked to be driven by new data coming in, but the code was a somewhat hard to follow, so didn't dig into it too deeply.

sspaink · 2022-01-18T22:42:11Z

Closing due to inactivity, this might still be an issue but at this point after so many years a new issue with a reproducible use case would be better to help resolve this.

danielnelson added the bug unexpected problem or unintended behavior label Jun 3, 2017

danielnelson added this to the 1.4.0 milestone Jun 3, 2017

danielnelson removed this from the 1.4.0 milestone Jun 7, 2017

sspaink closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible truncation outputting to udp in influx format #2881

Possible truncation outputting to udp in influx format #2881

danielnelson commented Jun 3, 2017

danielnelson commented Jun 3, 2017

phemmer commented Jun 3, 2017 •

edited

Loading

oplehto commented Jun 3, 2017

phemmer commented Jun 3, 2017

danielnelson commented Jun 7, 2017

phemmer commented Jun 7, 2017

danielnelson commented Jun 7, 2017

phemmer commented Jun 7, 2017

sspaink commented Jan 18, 2022

Possible truncation outputting to udp in influx format #2881

Possible truncation outputting to udp in influx format #2881

Comments

danielnelson commented Jun 3, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jun 3, 2017

phemmer commented Jun 3, 2017 • edited Loading

oplehto commented Jun 3, 2017

phemmer commented Jun 3, 2017

danielnelson commented Jun 7, 2017

phemmer commented Jun 7, 2017

danielnelson commented Jun 7, 2017

phemmer commented Jun 7, 2017

sspaink commented Jan 18, 2022

phemmer commented Jun 3, 2017 •

edited

Loading