Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Reducing confusion around the System.Runtime gc-heap-size EventCounter #77530

Closed
noahfalk opened this issue Oct 27, 2022 · 8 comments
Assignees
Labels
area-Diagnostics-coreclr discussion enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@noahfalk
Copy link
Member

noahfalk commented Oct 27, 2022

In our set of System.Runtime counters we have a bunch of GC related counters and one of them is 'gc-heap-size'. I want to propose deprecating it or making it less initially visible and instead encourage developers to use the 'gc-committed' counter to measure the size of the GC heap. I'm opening this issue to solicit feedback and discuss if and how we should do this.

Why deprecate/move/hide it?

My main goal is to reduce confusion and simplify the diagnostics story around the GC:

  1. I believe when most developers are tracking memory usage and they care about the GC, they want to start with a single canonical size metric. However today we publish two values: "gc-committed" and "gc-heap-size". These values measure different things and the reported value could vary drastically. Even with a careful reading of the documentation I expect most developers would struggle to explain what differs about these measurements and they won't know which one is better for their needs.
  2. gc-heap-size excludes all accounting for fragmentation and it excludes other committed memory that is being managed by the GC such as capacity for new allocations in the near future. This means gc-heap-size can give a value that can be substantially less than the total amount of memory the GC has acquired from the OS. Developers that rely solely on the gc-heap-size counter also wouldn't be alerted to memory growth caused by fragmentation or large caches of committed memory.

Potential for confusion aside, there is nothing inherently wrong with the gc-heap-size counter for developers who can interpret its meaning correctly. While we could certainly improve the docs as a mitigation, the scenario will be more robust if it doesn't require all developers to read docs to avoid the sharp edges.

What would we change?

I've thought of a few options, and I'd be glad to hear other suggestions.

  1. We could create a new category of metrics, for example "System.Runtime.GC", and move the counter there. We get other requests for various detailed/niche GC counters and this might be a place to put them that allows advanced users/scenarios to get access to them without drowning simple use-cases in unneeded complexity. If this change proved low impact it might provide precedent to move more GC counters out of the default set. Currently over half of all System.Runtime counters are for the GC. A downside of this approach is it might feel odd that this new GC category doesn't have all the GC metrics?
  2. We could update the documentation to remove mention of gc-heap-size or mark it as being obsolete without making any changes to the product. This may not much make impact however because it is unclear how many devs consult the documentation in the first place.
  3. We could remove the counter from System.Runtime and remove the docs for it, then document the breaking change as part of .NET 8.
  4. We could mark the counter with some metadata indicating it is deprecated and update the dotnet-counters tool not to display counters with these markings. The downside of this approach is that it wouldn't impact any other tools without further work and even without that it is already getting somewhat complex. It becomes unclear if or when we would ever completely remove the counter.
  5. Instead of moving or removing the counter we could create a default view in dotnet-counters and document a set of recommended initial counters for developers monitoring their services. Such a recommendation would be purely optional so it is unclear how many other monitoring tools would choose to adopt it.

What is impact of moving/removing a counter?

As far as I am aware we have never done this in the BCL since EventCounters were introduced so we don't have a clear precedent to learn from. This is roughly what I would expect:

  • Apps that upgrade to .NET 8 continue to build as before (no breaks at compile time)
  • Most if not all apps would continue running fine. App behavior would only be affected for an app that set up an EventListener to self-monitor its own counters and the logic in the listener took a hard dependency on specific named counters being present. This is possible, but probably rare.
  • dotnet-counters would cease reporting the gc-heap-size counter in its default output
  • Other monitoring tools could either gracefully degrade with missing data for the gc-heap-size counter, or they might fail in a more dramatic way if they took a hard dependency on the counter being present. A hard dependency seems less likely to me, but it is impossible to know for sure without investigating each such tool individually.

Your feedback?

Currently the options I am leaning towards are 5 followed by 1, but I could really imagine this going in many different directions and feedback is much appreciated!

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 27, 2022
@noahfalk noahfalk self-assigned this Oct 27, 2022
@ghost
Copy link

ghost commented Oct 27, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

In our set of System.Runtime counters we have a bunch of GC related counters and one of them is 'gc-heap-size'. I want to propose deprecating it or making it less initially visible and instead encourage developers to use the 'gc-committed' counter to measure the size of the GC heap. I'm opening this issue to solicit feedback and discuss if and how we should do this.

Why deprecate/move/hide it?

My main goal is to reduce confusion and simplify the diagnostics story around the GC:

  1. I believe when most developers are tracking memory usage and they care about the GC, they want to start with a single canonical size metric. However today we publish two values: "gc-committed" and "gc-heap-size". These values measure different things and the reported value could vary drastically. Even with a careful reading of the documentation I expect most developers would struggle to explain what differs about these measurements and they won't know which one is better for their needs.
  2. gc-heap-size excludes all accounting for fragmentation and it excludes other committed memory that is being managed by the GC such as capacity for new allocations in the near future. This means gc-heap-size can give a value that can be substantially less than the total amount of memory the GC has acquired from the OS. Developers that rely solely on the gc-heap-size counter also wouldn't be alerted to memory growth caused by fragmentation or large caches of committed memory.

Potential for confusion aside, there is nothing inherently wrong with the gc-heap-size counter for developers who can interpret its meaning correctly. While we could certainly improve the docs as a mitigation, the scenario will be more robust if it doesn't require all developers to read docs to avoid the sharp edges.

What would we change?

I've thought of a few options, and I'd be glad to hear other suggestions.

  1. We could create a new category of metrics, for example "System.Runtime.GC", and move the counter there. We get other requests for various detailed/niche GC counters and this might be a place to put them that allows advanced users/scenarios to get access to them without drowning simple use-cases in unneeded complexity. If this change proved low impact it might provide precedent to move more GC counters out of the default set. Currently over half of all System.Runtime counters are for the GC. A downside of this approach is it might feel odd that this new GC category doesn't have all the GC metrics?
  2. We could update the documentation to remove mention of gc-heap-size or mark it as being obsolete without making any changes to the product. This may not much make impact however because it is unclear how many devs consult the documentation in the first place.
  3. We could remove the counter from System.Runtime and remove the docs for it, then document the breaking change as part of .NET 8.
  4. We could mark the counter with some metadata indicating it is deprecated and update the dotnet-counters tool not to display counters with these markings. The downside of this approach is that it wouldn't impact any other tools without further work and even without that it is already getting somewhat complex. It becomes unclear if or when we would ever completely remove the counter.
  5. Instead of moving or removing the counter we could create a default view in dotnet-counters and document a set of recommended initial counters for developers monitoring their services. Such a recommendation would be purely optional so it is unclear how many other monitoring tools would choose to adopt it.

What is impact of moving/removing a counter?

As far as I am aware we have never done this in the BCL since EventCounters introduced so we don't have a clear precedent to learn from. This is roughly what I would expect:

  • Apps that upgrade to .NET 8 continue to build as before (no breaks at compile time)
  • Most if not all apps would continue running fine. App behavior would only be affected for an app that set up an EventListener to self-monitor its own counters and the logic in the listener took a hard dependency on specific named counters being present. This is possible, but probably very rare.
  • dotnet-counters would cease reporting the gc-heap-size counter in its default output
  • Other monitoring tools could either gracefully degrade with missing data for the gc-heap-size counter, or they might fail in a more dramatic way if they took a hard dependency on the counter being present. A hard dependency seems less likely to me, but it is impossible to know for sure without investigating each such tool individually.

Your feedback?

Currently the options I am leaning towards are 5 followed by 1, but I could really imagine this going in many different directions and feedback is much appreciated!

Author: noahfalk
Assignees: -
Labels:

area-GC-coreclr, untriaged

Milestone: -

@ghost
Copy link

ghost commented Oct 27, 2022

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

In our set of System.Runtime counters we have a bunch of GC related counters and one of them is 'gc-heap-size'. I want to propose deprecating it or making it less initially visible and instead encourage developers to use the 'gc-committed' counter to measure the size of the GC heap. I'm opening this issue to solicit feedback and discuss if and how we should do this.

Why deprecate/move/hide it?

My main goal is to reduce confusion and simplify the diagnostics story around the GC:

  1. I believe when most developers are tracking memory usage and they care about the GC, they want to start with a single canonical size metric. However today we publish two values: "gc-committed" and "gc-heap-size". These values measure different things and the reported value could vary drastically. Even with a careful reading of the documentation I expect most developers would struggle to explain what differs about these measurements and they won't know which one is better for their needs.
  2. gc-heap-size excludes all accounting for fragmentation and it excludes other committed memory that is being managed by the GC such as capacity for new allocations in the near future. This means gc-heap-size can give a value that can be substantially less than the total amount of memory the GC has acquired from the OS. Developers that rely solely on the gc-heap-size counter also wouldn't be alerted to memory growth caused by fragmentation or large caches of committed memory.

Potential for confusion aside, there is nothing inherently wrong with the gc-heap-size counter for developers who can interpret its meaning correctly. While we could certainly improve the docs as a mitigation, the scenario will be more robust if it doesn't require all developers to read docs to avoid the sharp edges.

What would we change?

I've thought of a few options, and I'd be glad to hear other suggestions.

  1. We could create a new category of metrics, for example "System.Runtime.GC", and move the counter there. We get other requests for various detailed/niche GC counters and this might be a place to put them that allows advanced users/scenarios to get access to them without drowning simple use-cases in unneeded complexity. If this change proved low impact it might provide precedent to move more GC counters out of the default set. Currently over half of all System.Runtime counters are for the GC. A downside of this approach is it might feel odd that this new GC category doesn't have all the GC metrics?
  2. We could update the documentation to remove mention of gc-heap-size or mark it as being obsolete without making any changes to the product. This may not much make impact however because it is unclear how many devs consult the documentation in the first place.
  3. We could remove the counter from System.Runtime and remove the docs for it, then document the breaking change as part of .NET 8.
  4. We could mark the counter with some metadata indicating it is deprecated and update the dotnet-counters tool not to display counters with these markings. The downside of this approach is that it wouldn't impact any other tools without further work and even without that it is already getting somewhat complex. It becomes unclear if or when we would ever completely remove the counter.
  5. Instead of moving or removing the counter we could create a default view in dotnet-counters and document a set of recommended initial counters for developers monitoring their services. Such a recommendation would be purely optional so it is unclear how many other monitoring tools would choose to adopt it.

What is impact of moving/removing a counter?

As far as I am aware we have never done this in the BCL since EventCounters introduced so we don't have a clear precedent to learn from. This is roughly what I would expect:

  • Apps that upgrade to .NET 8 continue to build as before (no breaks at compile time)
  • Most if not all apps would continue running fine. App behavior would only be affected for an app that set up an EventListener to self-monitor its own counters and the logic in the listener took a hard dependency on specific named counters being present. This is possible, but probably very rare.
  • dotnet-counters would cease reporting the gc-heap-size counter in its default output
  • Other monitoring tools could either gracefully degrade with missing data for the gc-heap-size counter, or they might fail in a more dramatic way if they took a hard dependency on the counter being present. A hard dependency seems less likely to me, but it is impossible to know for sure without investigating each such tool individually.

Your feedback?

Currently the options I am leaning towards are 5 followed by 1, but I could really imagine this going in many different directions and feedback is much appreciated!

Author: noahfalk
Assignees: noahfalk
Labels:

area-Diagnostics-coreclr, area-GC-coreclr, discussion, untriaged

Milestone: -

@noahfalk noahfalk removed area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Oct 27, 2022
@noahfalk
Copy link
Member Author

@davidfowl @jkotas @Maoni0 @davmason @sebastienros @reyang @cijothomas @samsp-msft @jander-msft - You all might be some folks who have an opinion on this or you know others who do?

@jkotas
Copy link
Member

jkotas commented Oct 27, 2022

We get other requests for various detailed/niche GC counters and this might be a place to put them that allows advanced users/scenarios to get access to them without drowning simple use-cases in unneeded complexity.

Would this also reduce the overhead of the monitoring the counters if you just want simple default set?

A downside of this approach is it might feel odd that this new GC category doesn't have all the GC metrics?

We can duplicate the simple counters in both sets.

I like option 1. I would do all counter adjustments in one bigger breaking change instead of a numbing trickle of smaller breaking changes over several releases. What would be the GC counters to keep in the simple set and GC counters to move or add to the advanced set?

@tommcdon tommcdon added this to the 8.0.0 milestone Oct 27, 2022
@reyang
Copy link

reyang commented Oct 28, 2022

My vote for 5.

@noahfalk would you consider keeping existing things unchanged, and introducing new categories where things are better sorted out, then encouraging folks (via documentation, examples) folks to move to the newly introduced categories? I feel this is the same when we compare Win32 perf counters vs. event counters, it seems scary if certain Win32 perf counters can be renamed or disappear.

Regarding the tooling, I don't know if users would treat dotnet-counters as an ad-hoc troubleshooting tool (e.g. like exception message, in general a developer shouldn't expect the message to follow specific format and parse it), or they would treat the output as established contract (e.g. windbg where the output has been used by multiple extensions). Seems dotnet-counters can be used for both scenarios when I looked at the command line arguments it is offering.

@noahfalk
Copy link
Member Author

Would this also reduce the overhead of the monitoring the counters if you just want simple default set?

Yep, although I expect it is a really small difference. Most of these counters probably generate a value in under 100ns, and a typical app might poll them on a timer once every 10-60 seconds. So this is 0.00001% kind of perf savings.

We can duplicate the simple counters in both sets.

True :)

What would be the GC counters to keep in the simple set and GC counters to move or add to the advanced set?

Probably a very subjective question that could be a topic of discussion all on its own. If we were going to do something like that I'd be tempted to be a little bold and move most of our current GC related stats. My starting proposal would be we keep the minimal set of info a dev needs to look at it to determine if the GC is impacting their app negatively. IMO that is only three metrics:

  1. Is the GC causing higher VM usage higher than I want? gc-commited
  2. Is the GC causing the service to lose too much throughput? %-time-in-gc (and fix the issues in the current way this stat is calculated)
  3. Is the GC inducing more latency than I want? Add a GC pause histogram which doesn't currently exist

Everything else like generation sizes, special heap sizes, collection counts, budgets, fragmentation, allocation rates, etc. would go in the dedicated GC section and you look at it when you have need to go deeper.

would you consider keeping existing things unchanged, and introducing new categories where things are better sorted out, then encouraging folks (via documentation, examples) folks to move to the newly introduced categories?

Certainly it seems technically possible to do a two phase plan where first we add/duplicate things where we want it + update docs, then later we do a second phase where we remove/de-dupe things we don't want. Doing it in one phase vs. two phases probably hinges on how disruptive we believe the change is and I'm hoping this discussion is one avenue that we get some feedback on it.

Seems dotnet-counters can be used for both scenarios when I looked at the command line arguments it is offering.

Yeah I believe that is accurate. "dotnet-counters monitor" is designed for ad-hoc live monitoring by a human and "dotnet-counters collect" is designed for offline review. You could run it as a simple form of free production monitoring that pre-dates more comprehensive options like dotnet-monitor or OpenTelemetry.

Thanks! And more feedback definitely welcome.

@CodeBlanch
Copy link
Contributor

@noahfalk

I like aspects of 1 & 5. Really like the idea of "recommended" counters. I'm imaging docs like...

Counter Description Recommendation First available in
GC Committed Bytes (gc-committed) The number of bytes committed by the GC Recommended mechanism for tracking memory consumption by the GC. .NET 6
% Time in GC since last GC (time-in-gc) The percent of time in GC since the last GC Recommended mechanism for tracking throughput loss due to time spent performing GC. .NET Core 3.1
GC Heap Size (gc-heap-size) The number of megabytes thought to be allocated based on GC.GetTotalMemory(Boolean) Not recommended. Does not include fragmentation or standby memory reserved for future allocations. .NET Core 3.1

...which seems much clearer. Doesn't require obsoleting anything, but does require reading docs 😄

I could also imagine groups of counters like System.Runtime.GC.Recommended \ System.Runtime.GC.Full.

@hoyosjs hoyosjs added the enhancement Product code improvement that does NOT require public API changes/additions label Oct 31, 2022
@tommcdon tommcdon modified the milestones: 8.0.0, 9.0.0 Aug 7, 2023
@tommcdon
Copy link
Member

Closing this issue. Feel free to open new issues for documentation gaps or other specific suggestions.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr discussion enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

6 participants