-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor application socket buffers #3436
Comments
Seems like this would also relate to #3039 |
If we can use The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this? For dealing with pids, I feel like we just need to something fundamentally similar to what we have in procstat but much better. You define a query and the name to map to it. |
I was just showing where you could see the numbers. I personally would detest telegraf shelling out to external utilities to gather this information.
For my use case I would not be able to use this. The objective is to know when there is congestion somewhere. If I have 999 clients with a 0-length buffer, and 1 client with a non-0-length buffer, any sort of average, percentile, etc, isn't going to indicate an issue. |
@phemmer planning to implement this and wanted to confirm my planned metric format... When enabling this feature, I would emit a new metric series in the form (line-protocol format)
Would that work for you? I plan to allow for config filter-settings for the protocol type and the state... |
The problem with that format is going to be the key. If the application has 2 open sockets, they're going to overwrite each other. That's what all this was about in the original report:
|
@phemmer yeah I know, so you need to use the |
I don't know. I don't have a solution which makes me feel all warm and fuzzy. Even if the data is in telegraf without being de-duped, I don't know that there's much use to that. Telegraf doesn't have the advanced capabilities for doing analysis and aggregation that you can do once the data is in a database of some sort. Yes you could feed it through an external custom processor, but at that point I might question why have telegraf gather the metrics, as if you have to create a custom processor, why not have it just gather the metrics itself, then feed that to telegraf? Since I'm obviously not able to come to a decision, I'd say go for whatever you want. |
Will try to implement that and then think about some kind of processor to do the indexing that you suggested earlier. I think that is a good idea anyway... The fact that Telegraf does not squash the metrics is good as your database might be capable of just inserting the raw data like into separate rows. The same for JSON output (or the like), you will get the metric gathered without dedup... Thanks for your thoughts and comments! Very much appreciated! |
Feature Request
Opening a feature request kicks off a discussion.
Proposal:
Telegraf could monitor application socket send/recv buffer sizes.
Current behavior:
No such feature
Desired behavior:
Such a feature
Use case: [Why is this important (helps with prioritizing requests)]
The thoughts on this are that if there is some sort of congestion somewhere, the buffers will start filling up. On a local application, if the application isn't processing incoming data fast enough, the receive buffer will start to fill up. If the remote application isn't receiving fast enough, or if there is network congestion, the send buffer will start filling up.
These numbers are visible in the
Recv-Q
andSend-Q
columns innetstat
output. Also in/proc/net/tcp
tx_queue
/rx_queue
.The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.
My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.
The text was updated successfully, but these errors were encountered: