Monitor application socket buffers #3436

phemmer · 2017-11-06T19:47:05Z

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Telegraf could monitor application socket send/recv buffer sizes.

Current behavior:

No such feature

Desired behavior:

Such a feature

Use case: [Why is this important (helps with prioritizing requests)]

The thoughts on this are that if there is some sort of congestion somewhere, the buffers will start filling up. On a local application, if the application isn't processing incoming data fast enough, the receive buffer will start to fill up. If the remote application isn't receiving fast enough, or if there is network congestion, the send buffer will start filling up.

These numbers are visible in the Recv-Q and Send-Q columns in netstat output. Also in /proc/net/tcp tx_queue/rx_queue.

The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.

My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.

The text was updated successfully, but these errors were encountered:

phemmer · 2017-11-06T22:38:06Z

Seems like this would also relate to #3039

danielnelson · 2017-11-07T01:00:13Z

If we can use /proc/net/tcp will probably be cheaper than calling netstat, otherwise I guess we should try using ss and iproute2 utilities.

The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?

For dealing with pids, I feel like we just need to something fundamentally similar to what we have in procstat but much better. You define a query and the name to map to it.

phemmer · 2017-11-07T01:46:12Z

If we can use /proc/net/tcp will probably be cheaper than calling netstat, otherwise I guess we should try using ss and iproute2 utilities.

I was just showing where you could see the numbers. I personally would detest telegraf shelling out to external utilities to gather this information.

The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?

For my use case I would not be able to use this. The objective is to know when there is congestion somewhere. If I have 999 clients with a 0-length buffer, and 1 client with a non-0-length buffer, any sort of average, percentile, etc, isn't going to indicate an issue.

srebhan · 2023-11-03T16:47:28Z

@phemmer planning to implement this and wanted to confirm my planned metric format... When enabling this feature, I would emit a new metric series in the form (line-protocol format)

prostat_netstat,host=prash-laptop,pattern=influxd,process_name=influxd,user=root,proto=tcp,status=listen local_addr="127.0.0.1",local_port=8086u,remote_addr="192.168.0.1",remote_port=63012u,tx_queue=0u,rx_queue=0u,timeout=0u <timestamp>

Would that work for you? I plan to allow for config filter-settings for the protocol type and the state...

phemmer · 2023-11-03T17:02:21Z

The problem with that format is going to be the key. If the application has 2 open sockets, they're going to overwrite each other.

That's what all this was about in the original report:

The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.

My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.

srebhan · 2023-11-03T17:09:36Z

@phemmer yeah I know, so you need to use the converter processor to choose the fields that should be tags. This is done to avoid the cardinality explosion. You still get multiple metrics, one per socket/connection in Telegraf but you need additional handling if you want to send that to e.g. InfluxDB. This can be aggregation, or indexing as you suggest or something else to make metrics distinguishable...

phemmer · 2023-11-03T17:18:14Z

I don't know. I don't have a solution which makes me feel all warm and fuzzy. Even if the data is in telegraf without being de-duped, I don't know that there's much use to that. Telegraf doesn't have the advanced capabilities for doing analysis and aggregation that you can do once the data is in a database of some sort. Yes you could feed it through an external custom processor, but at that point I might question why have telegraf gather the metrics, as if you have to create a custom processor, why not have it just gather the metrics itself, then feed that to telegraf?

Since I'm obviously not able to come to a decision, I'd say go for whatever you want.

srebhan · 2023-11-03T17:53:08Z

Will try to implement that and then think about some kind of processor to do the indexing that you suggested earlier. I think that is a good idea anyway...

The fact that Telegraf does not squash the metrics is good as your database might be capable of just inserting the raw data like into separate rows. The same for JSON output (or the like), you will get the metric gathered without dedup...

Thanks for your thoughts and comments! Very much appreciated!

srebhan · 2024-05-31T08:36:05Z

@phemmer please test the binary in PR #15423 and let me know if this is what you intended!

phemmer mentioned this issue Nov 10, 2017

add systemd unit pid matching to procstat #3459

Merged

3 tasks

helenosheaa added the area/procstat label Jan 29, 2021

telegraf-tiger bot closed this as completed Apr 8, 2021

influxdata deleted a comment from telegraf-tiger bot Apr 8, 2021

sspaink reopened this Apr 8, 2021

srebhan self-assigned this Nov 1, 2023

srebhan added the waiting for response waiting for response from contributor label Nov 3, 2023

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 3, 2023

srebhan mentioned this issue May 29, 2024

feat(inputs.procstat): Add ability to collect per-process socket statistics #15423

Merged

1 task

powersj closed this as completed in #15423 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor application socket buffers #3436

Monitor application socket buffers #3436

phemmer commented Nov 6, 2017 •

edited

Loading

phemmer commented Nov 6, 2017

danielnelson commented Nov 7, 2017

phemmer commented Nov 7, 2017 •

edited

Loading

srebhan commented Nov 3, 2023

phemmer commented Nov 3, 2023

srebhan commented Nov 3, 2023 •

edited

Loading

phemmer commented Nov 3, 2023

srebhan commented Nov 3, 2023 •

edited

Loading

srebhan commented May 31, 2024

Monitor application socket buffers #3436

Monitor application socket buffers #3436

Comments

phemmer commented Nov 6, 2017 • edited Loading

Feature Request

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

phemmer commented Nov 6, 2017

danielnelson commented Nov 7, 2017

phemmer commented Nov 7, 2017 • edited Loading

srebhan commented Nov 3, 2023

phemmer commented Nov 3, 2023

srebhan commented Nov 3, 2023 • edited Loading

phemmer commented Nov 3, 2023

srebhan commented Nov 3, 2023 • edited Loading

srebhan commented May 31, 2024

phemmer commented Nov 6, 2017 •

edited

Loading

phemmer commented Nov 7, 2017 •

edited

Loading

srebhan commented Nov 3, 2023 •

edited

Loading

srebhan commented Nov 3, 2023 •

edited

Loading