Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible truncation outputting to udp in influx format #2881

Closed
danielnelson opened this issue Jun 3, 2017 · 9 comments
Closed

Possible truncation outputting to udp in influx format #2881

danielnelson opened this issue Jun 3, 2017 · 9 comments
Labels
bug unexpected problem or unintended behavior

Comments

@danielnelson
Copy link
Contributor

Bug report

Triggered by reports of malformed metrics in #2862 and based on a code inspection, it appears to me that when using udp with the socket_writer, points will be truncated when serializing in influx format.

This could occur anywhere we have a fixed output buffer.

Relevant telegraf.conf:

N/A

System info:

1.3.1

Steps to reproduce:

I have not test this!

  1. Send a very large point in influxdb format using udp socket_writer
  2. Metrics are truncated?

Expected behavior:

Use field splitting where possible to fit into buffer, warn if cannot possibly fit.

Actual behavior:

I think it will be truncated.

Additional info:

#2880 (comment)

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jun 3, 2017
@danielnelson danielnelson added this to the 1.4.0 milestone Jun 3, 2017
@danielnelson
Copy link
Contributor Author

cc @oplehto

@phemmer
Copy link
Contributor

phemmer commented Jun 3, 2017

Valid case, but should be rather hard to trigger. The IP protocol will split the packet if it's too large. Maximum payload size is a little under 64KiB, which is an insanely huge point.

Edit: Actually I think what's likely to happen is that the write operation will throw an error. So nothing will be written. Not truncation.

@oplehto
Copy link
Contributor

oplehto commented Jun 3, 2017

There are no errors thrown, at least in our case. When the bug is triggered by a single large metric in a metrics batch (what is gathered during an interval), the rest of the batch is quietly dropped. Thus a subset of the metrics will be sent cleanly but there are random gaps in data.

@phemmer
Copy link
Contributor

phemmer commented Jun 3, 2017

That's not what I get. When I test I get:

2017-06-03T21:35:21Z E! Error writing to output [socket_writer]: write udp 127.0.0.1:62494->127.0.0.1:9999: write: message too long

@danielnelson
Copy link
Contributor Author

It looks like we send one point per packet, so there shouldn't be gaps in data due to this. Maybe the network is being overloaded due to the batching causing packets to not being received?

Looks like I was concerned about nothing here. We could run the output through the Split function but I think it's not needed at this time.

@danielnelson danielnelson removed this from the 1.4.0 milestone Jun 7, 2017
@phemmer
Copy link
Contributor

phemmer commented Jun 7, 2017

@danielnelson
Actually there are gaps unfortunately. The socket_writer plugin takes the batch passed to Write() and splits it up. If a write error occurs on any of the points, the Write() returns an error. Meaning the rest of the points are skipped.

Solving this is really tricky. The reason to return an error is that telegraf will re-call the Write() with the same points again, as a retry mechanism. However in the case of a bad point, it will fail each and every time. We could skip points that fail to write, but in the case of issues where it's not the point, but say on the remote end, we'd be skipping a lot of points where a retry might work.

@danielnelson
Copy link
Contributor Author

All of points should be resent, so what would happen in this case is the socket_writer would be completely stuck, right? A possible exception to this would be an icmp error for a previous send.

@phemmer
Copy link
Contributor

phemmer commented Jun 7, 2017

When I looked at the code, it seemed like telegraf would eventually give up on the write, and skip to the next batch. It looked to be driven by new data coming in, but the code was a somewhat hard to follow, so didn't dig into it too deeply.

@sspaink
Copy link
Contributor

sspaink commented Jan 18, 2022

Closing due to inactivity, this might still be an issue but at this point after so many years a new issue with a reproducible use case would be better to help resolve this.

@sspaink sspaink closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants