Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor application socket buffers #3436

Closed
phemmer opened this issue Nov 6, 2017 · 9 comments · Fixed by #15423
Closed

Monitor application socket buffers #3436

phemmer opened this issue Nov 6, 2017 · 9 comments · Fixed by #15423
Assignees

Comments

@phemmer
Copy link
Contributor

phemmer commented Nov 6, 2017

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Telegraf could monitor application socket send/recv buffer sizes.

Current behavior:

No such feature

Desired behavior:

Such a feature

Use case: [Why is this important (helps with prioritizing requests)]

The thoughts on this are that if there is some sort of congestion somewhere, the buffers will start filling up. On a local application, if the application isn't processing incoming data fast enough, the receive buffer will start to fill up. If the remote application isn't receiving fast enough, or if there is network congestion, the send buffer will start filling up.

These numbers are visible in the Recv-Q and Send-Q columns in netstat output. Also in /proc/net/tcp tx_queue/rx_queue.

The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.

My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.

@phemmer
Copy link
Contributor Author

phemmer commented Nov 6, 2017

Seems like this would also relate to #3039

@danielnelson
Copy link
Contributor

If we can use /proc/net/tcp will probably be cheaper than calling netstat, otherwise I guess we should try using ss and iproute2 utilities.

The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?

For dealing with pids, I feel like we just need to something fundamentally similar to what we have in procstat but much better. You define a query and the name to map to it.

@phemmer
Copy link
Contributor Author

phemmer commented Nov 7, 2017

If we can use /proc/net/tcp will probably be cheaper than calling netstat, otherwise I guess we should try using ss and iproute2 utilities.

I was just showing where you could see the numbers. I personally would detest telegraf shelling out to external utilities to gather this information.

The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?

For my use case I would not be able to use this. The objective is to know when there is congestion somewhere. If I have 999 clients with a 0-length buffer, and 1 client with a non-0-length buffer, any sort of average, percentile, etc, isn't going to indicate an issue.

@telegraf-tiger telegraf-tiger bot closed this as completed Apr 8, 2021
@influxdata influxdata deleted a comment from telegraf-tiger bot Apr 8, 2021
@sspaink sspaink reopened this Apr 8, 2021
@srebhan srebhan self-assigned this Nov 1, 2023
@srebhan
Copy link
Member

srebhan commented Nov 3, 2023

@phemmer planning to implement this and wanted to confirm my planned metric format... When enabling this feature, I would emit a new metric series in the form (line-protocol format)

prostat_netstat,host=prash-laptop,pattern=influxd,process_name=influxd,user=root,proto=tcp,status=listen local_addr="127.0.0.1",local_port=8086u,remote_addr="192.168.0.1",remote_port=63012u,tx_queue=0u,rx_queue=0u,timeout=0u <timestamp>

Would that work for you? I plan to allow for config filter-settings for the protocol type and the state...

@srebhan srebhan added the waiting for response waiting for response from contributor label Nov 3, 2023
@phemmer
Copy link
Contributor Author

phemmer commented Nov 3, 2023

The problem with that format is going to be the key. If the application has 2 open sockets, they're going to overwrite each other.

That's what all this was about in the original report:

The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.

My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 3, 2023
@srebhan
Copy link
Member

srebhan commented Nov 3, 2023

@phemmer yeah I know, so you need to use the converter processor to choose the fields that should be tags. This is done to avoid the cardinality explosion. You still get multiple metrics, one per socket/connection in Telegraf but you need additional handling if you want to send that to e.g. InfluxDB. This can be aggregation, or indexing as you suggest or something else to make metrics distinguishable...

@phemmer
Copy link
Contributor Author

phemmer commented Nov 3, 2023

I don't know. I don't have a solution which makes me feel all warm and fuzzy. Even if the data is in telegraf without being de-duped, I don't know that there's much use to that. Telegraf doesn't have the advanced capabilities for doing analysis and aggregation that you can do once the data is in a database of some sort. Yes you could feed it through an external custom processor, but at that point I might question why have telegraf gather the metrics, as if you have to create a custom processor, why not have it just gather the metrics itself, then feed that to telegraf?

Since I'm obviously not able to come to a decision, I'd say go for whatever you want.

@srebhan
Copy link
Member

srebhan commented Nov 3, 2023

Will try to implement that and then think about some kind of processor to do the indexing that you suggested earlier. I think that is a good idea anyway...

The fact that Telegraf does not squash the metrics is good as your database might be capable of just inserting the raw data like into separate rows. The same for JSON output (or the like), you will get the metric gathered without dedup...

Thanks for your thoughts and comments! Very much appreciated!

@srebhan
Copy link
Member

srebhan commented May 31, 2024

@phemmer please test the binary in PR #15423 and let me know if this is what you intended!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants