Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should system.net.tcp.retrans_segs actually use a monotonic count? #2630

Closed
jalaziz opened this issue Jun 28, 2016 · 6 comments · Fixed by DataDog/integrations-core#1551
Closed

Comments

@jalaziz
Copy link
Contributor

jalaziz commented Jun 28, 2016

The documentation seems to indicate that system.net.tcp.retrans_segs and similar metrics are measured as gauges. However, it seems that it's actually being measured as a rate in the network check. I understand that rates are stored as gauges, and that a rate is simply storing the difference between the previous and current value, but since it's time normalized it can result in odd values.

For example, we typically see things like 10.62 retransmitted segments even when using a rollup with sum. This could also simply be to due to the fact that the measured interval using for normalization could be different than the flush interval pass into the flush method.

Using a monotonic count seems more appropriate here. Although, I could also see the rate being useful as well.

@olivielpeau
Copy link
Member

@jalaziz Indeed, currently system.net.tcp.retrans_segs is sent from the network check using the rate function, which means that each point of a system.net.tcp.retrans_segs time series represents the number of retransmitted segments per second, as an average since the previous point (i.e. since the previous collection).

So from Datadog there's no easy way to query, for instance, the number of segment retransmits that occurred over a certain timeframe. If I understand correctly this would be the kind of query that you'd like to be able to make? In that case a monotonic counter would make sense yes.

As an aside, since changing the metric type would not be backwards-compatible, if we decide to make the change we'll need to either use another metric name for the counter or wait for a major release of the Agent.

@jalaziz
Copy link
Contributor Author

jalaziz commented Jun 28, 2016

Totally understand that we'd have to use a different name.

Some things that could be done now though are:

  1. Fix the documentation here to indicate that these metrics are actually rates. Interestingly, the documentation for the system.net.udp metrics correctly identifies them as rates.
  2. Fix the default units for these metrics. Again, the system.net.udp series of metrics have the correct units. I can always fix it myself for my account at least.

For being able to query the number of segments (and datagrams for udp), I agree that a new set of metrics using a monotonic counter would be ideal. I'm happy to to submit a PR for this. I'll try to come up with a reasonable name for the new metrics. Maybe simply suffix them with _count?

@olivielpeau
Copy link
Member

olivielpeau commented Jul 6, 2016

@jalaziz : The default units for the system.net.udp metrics are correct now, and the units in the documentation have also been updated.

Feel free to open a PR for the new metrics. A _count suffix sounds reasonable to me. Thanks!

jalaziz added a commit to jalaziz/dd-agent that referenced this issue Sep 7, 2016
Add monotonic counts for tcp segments and udp datagrams. This allows more
precise counting of incoming and outgoing segments and datagrams for a given
time period.

Fixes DataDog#2630
@ejholmes
Copy link
Contributor

Could it also be a flag in network.yaml? I'd prefer not to have the rates at all, since they aren't as useful as counts, and _count postfix metrics polute the namespace a bit.

@jalaziz
Copy link
Contributor Author

jalaziz commented Sep 27, 2016

@ejholmes Are you suggesting that the flag would disable the rates and not use the suffix? Or it would just disable the rates?

The reason for the suffix is to allow these metrics to be backwards compatible. If we remove the suffix, you'd probably have to fix the units in your account manually. @olivielpeau thoughts?

@ejholmes
Copy link
Contributor

@jalaziz yeah, I'm advocating for the default to use rate, but to be able to use a monotonic_count if configured in network.yaml. Doing this, then changing the metric values in DataDog would be preferable IMO (but I can go either way).

jalaziz added a commit to jalaziz/datadog-integrations-core that referenced this issue May 15, 2018
Add monotonic counts for tcp segments and udp datagrams. This allows
more precise counting of incoming and outgoing segments and datagrams
for a given time period.

Count and rate metrics are also configurable, allowing users to prefer
one type over the other.

Fixes DataDog/dd-agent#2630
jalaziz added a commit to jalaziz/datadog-integrations-core that referenced this issue May 15, 2018
Add monotonic counts for tcp segments and udp datagrams. This allows
more precise counting of incoming and outgoing segments and datagrams
for a given time period.

Count and rate metrics are also configurable, allowing users to prefer
one type over the other.

Fixes DataDog/dd-agent#2630
ofek pushed a commit to DataDog/integrations-core that referenced this issue May 22, 2018
* [network] Add monotonic counts for some metrics

Add monotonic counts for tcp segments and udp datagrams. This allows
more precise counting of incoming and outgoing segments and datagrams
for a given time period.

Count and rate metrics are also configurable, allowing users to prefer
one type over the other.

Fixes DataDog/dd-agent#2630

* [network] Use '.count' as the count suffix

* [network] Add new count metrics to metadata.csv

* [network] Fix typo in example config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants