-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventCounter vs PerformanceCounter documentation & guidance #346
Comments
The short answer: We want to make it easy to use EventCounter connected to a persistent store and I expect most scenarios will do so, but EventCounter isolated from all other parts of the end to end workflow has very minimal intrinsic persistence. To return to the original question that was asked, how does persistence play a role here? There is a tiny amount - When component X called WriteMetric() to update the counter that value was not immediately transmitted. Instead it was stored in memory so that it could be aggregated with other updates. The listener decides what the duration of this persistence is, typical values are likely to range from 1 second to 10 minutes but that isn't required. If the process emitting the counters terminates then all in-memory data is naturally lost, but updates that were already emitted to the logging system may or may not be persisted depending on how the end-to-end flow has been set up. It is relatively easy to configure ETW, Lttng, or EventPipe to log into a file so that is one way the data might be persisted. A more common way performance counter data is typically persisted is that the updates are transmitted to a time series database. I imagine that most people using counters in other contexts already have some form of persistence and probably a graphical viewer so I see our role as making it easy to ingest the counter messages from the logging system into the user's persistent store. Your interest in persistence on the twitter thread seems a little different than what I expect most users are looking for. Typically I expect users to want persistence so they can do diagnostic / analytics over the historical data. If I understand correctly you are looking to restore the value back into memory of a service instance so that it can survive reboots and arguably the data is functional at this point - the correctness of future counter values depends on the restoration of this state. Trying to do this with EventCounters would likely run into two issues:
On the counter emitting side your code can update the counter value immediately after creating it. On the counter receiving side the listener needs to specify an update rate when it begins listening. I think we have a hard-coded lower bound of 1 second right now. In theory you could imagine we lowered that bound in the future but EventCounter doesn't store any data until a listener indicates interest. If the counter producer hasn't logged any data in between when listening started and when the first update is transmitted then the listener is getting statistics from a sample of size 0. Not an error but probably not useful.
There are two tasks to do:
I don't have exact numbers to quote, but I can offer my basic mental model. We've got plans to do some real performance investigation - this is just educated guesswork and I make no promises you would see this if you experimented today:
Overall my sense is that unless you go nuts with huge numbers of counters, lots of parallel listeners or run an app on some very constrained hardware the performance overheads of counters probably won't be a meaningful concern. Anecdotally I can tell you that the ASP.Net team turns on all the default runtime and ASP.Net counters emitting once per second when they do performance benchmarking for Tech Empower. It is a pretty sensitive benchmark and they don't have enough measurement precision to discern any difference in the results. Sorry we got a little long there, but hopefully that was more useful than the twitter size answer : ) Cheers! |
Thank you for the detailed response @noahfalk ... I've read it several times and wanted to confirm my layman's understanding + interpretations are correct, apologies in advance if they are not 🙂:
Some questions remain around my specific scenario: I'm still not sure if it would be possible & practical to wire-up re-loading of state from a durable persistence store? For example the below passages imply even if on-start one was to reload state from somewhere, the value is discarded unless there is a listener already present and attached, which is almost impossible to guarantee without blocking. In the PerformanceCounter design this is not an issue, as either an emitter or listener keep the current value alive in memory. Say my code emits a PerformanceCounter NumberOfItems64 and sets the raw value to 5. Even if there is no listener attached, the value is 5, and when a listener does attach, it will get the value of 5 and report that to Azure Monitor.
PS. If my understanding of above is correct, whilst I understand many of the trade-offs in favor of the new design, the nature and properties of EventCounters is very different to PerformanceCounters. That is, by design you do not want to use memory mapped files in the new EventCounter approach as it's not cross platform, but there is a loss of functionality as PerformanceCounters do remain alive, at least in memory, as long as either 1 listener or 1 emitter is attached and running... couldn't we combine the new design and keep (even if opt-in) the properties of the old design (memory mapped files) 🙂 ? |
No worries at all. I'm actually using this opportunity to figure out if my descriptions are good and where people might get confused. I've got a TODO item to write documentation for some of this new work and you are giving me a dry-run at it with free fast feedback ; )
The
If a listener hooks up it could ask to receive updates for your counter every 10 seconds. This will cause the callback When I was refering to discarding state if there was no listener, I was talking about the
In the example above you could always replace MyData::g_myValue with a piece of memory that is backed by a memory mapped file. There is also nothing which prevents a listener from remembering the last logged value which it observed, regardless if the emitting process has terminated and no new updates are being sent. If you used one of those approaches does that get you the properties of a solution you are looking for? |
So in the new .Net Core approach & to continue with the Service Fabric scenario, example, wish-list 🙂 I will have an EventSource inside of which I have a PollingCounter instance called DeadLetteredMessagesPollingCounter. On microservice start-up I initialize the value of the DeadLetteredMessagesPollingCounter with the dead lettered messages count I read from a durable & persisted store. A Log Analytics agent running either on Linux or Windows (roadmap) will then be able to subscribe to the DeadLetteredMessagesPollingCounter, configured via the Azure Portal, specifying how often the callback-report loop occurs? In the same microservice, each time a dead lettered message scenario is encountered, application logic needs to increment the g_myValue variable / object. Moreover each successful re-try of a dead lettered message application logic needs to decrement the value of g_myValue. Wondering if this means application code also has to coordinate access to g_myValue or are you planning to provide some level of support for this within the EventCounter types, including PollingCounter? For example from the old PerformanceCounter documentation we have the following choice:
PS. A thought on above, would be most valuable to formalize & document the initialize scenario/pattern for PollingCounter, hopefully with examples in C#. |
I've got this flagged to come back to, just wanted to give you the heads up that there is a flurry of activity I have to attend to getting changes wrapped for Preview7. Once that calms down (couple days?) I'll be back to this : ) |
Yes, with a healthy dollop of hand-waving : ) We are still pretty early exploring the Azure integration part of the puzzle so I won't have anything concrete, but my super rough goal is you deploy the app to Azure, you configure something saying you want the counters (maybe via portal? maybe something in the app?), and then the counters show up in Azure logs/graphs/reports some place that makes sense.
In PollingCounter you have total control over g_myValue so you get to decide what synchronization primitives, if any, are going to be used.
We are definitely planning to get this stuff documented as part of our 3.0 work. If you are looking for examples some already exist inside the runtime: https://github.com/dotnet/coreclr/blob/37ff0f54f4259e2e9629c62dfc7602c37ee3a97a/src/System.Private.CoreLib/src/System/Diagnostics/Eventing/RuntimeEventSource.cs#L56. I expect we'll make something more simplified for demo purposes though. |
To continue the journey, to go another round, depending on your point of view 🙂, as interlocked operations won't be part of the PollingCounter, documentation and an advanced sample would be the next best thing. An Advanced PollingCounter with interlocked operations C# sample please 😄 Regarding consumers; given EventCounter is built upon ETW if hosting on Windows, if you modify the Log Analytics agent to consume EventCounter traces, in theory this means the Log Analytics agent or similar consumers, will then be able to consume and forward any ETW trace? I think the Log Analytics Agent currently does not support consuming ETW traces as per below image from Azure. I hope your scope includes any ETW trace as opposed to specific ETW traces of schema type EventCounter? If the above is way down your road-map though, I would like to implement my own monitoring microservice to consume ETW traces (including EventCounter), which other processes on the Node emit. Wondering if you plan to also provide documentation & samples for how to write high performance consumers for EventCounter? For example I have been investigating / planning to use KrabsETW which is used in production by the Office 365 Security team. |
Request noted : ) #368
ETW is one option but EventPipe is a new 2nd option. I can't predict which one we would try to use in this scenario yet. It could depend on what Log Analytics already has in place and whether we are building a single xplat solution or two different Windows/non-Windows solutions.
Certainly I don't want to scope it smaller than it needs to be, but there are a few factors that might be an issue if we aimed for the bigger scope: Whether these will ultimately be an issue I don't know, its just want comes to mind.
I don't have a great sense of the timing because I need to reach out to partner teams I haven't worked with much before and I don't know what their timetables are going to be. From the outside I often hear that Microsoft seems like an atomic unit, but if you think of it as a group of 100,000 employees you it makes sense that statistically any given person only works with a tiny fraction of the company and it changes over time. Thankfully its still a friendly and supportive bunch. I'd be pretty surprised if it was less than six months but I have no idea what the upper limit is.
Yeah, we'll need to. One current example that probably isn't bad is the dotnet-counters command line tool. It is a simple viewer that prints the counter values to the console. Full source is available here: https://github.com/dotnet/diagnostics/tree/master/src/Tools/dotnet-counters
You should be able to use any parser though I'm not familiar with that one specifically. The one we use is called TraceEvent available here: https://github.com/microsoft/perfview/tree/master/src/TraceEvent Cheers! |
Roger... really appreciate the above engagement & think it's time for me to digest the info further, incorporate it into my R&D! For completeness and in the hope it influences the work yet to come, from the Hidden Treasure: Intrusion Detection with ETW (Part 2) article the following stands out:
Have linked to Tracing and Counters Interest Group - Announcements |
Great to see this is scheduled for the 5.0 milestone! |
Going to close this issue because the doc work is being tracked by #515 and this issue was primarily about answering questions. If there is anything I missed just let me know we can reopen/open a new issue as appropriate. |
On .Net Full Framework, a PerformanceCounter value seems to survive as long as one producer or one listener exists. Should all producers & listeners of the PerformanceCounter crash, restart, node restarts etc, the value is lost. Does EventCounter running on the .Net Core run-time behave in the same manner?
Official guidance states that PerformanceCounters should not be created & immediately used due to latency time to enable the counters, is the same true for EventCounter running on the .Net Core run-time?
What is the envisaged pattern to re-load an EventCounter on server restart (in light of previous question on delay of usage after creation)?
Lastly any guidance on limits & resource utilization for how many EventCounter counters can be monitored per host on Windows / Linux, for example:
4.1 Can we collect/report on 1000 individual EventCounter counters every 10 seconds. What is the size (1 GB / day) compared to PerformanceCounter?
4.2 Impact on host / container CPU and memory whilst monitoring 1000 individual EventCounter counters every 10 seconds etc?
The text was updated successfully, but these errors were encountered: