-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventSource::SendManifest causes lost events which result in unmergable ETL files #77014
Comments
Starting with .NET 6, @davmason, this will likely require producing a minimal repro and reaching out to the ETW folks. The additional overhead of the manifest events, especially when there are many processes is also a good example of one of the negative side effects of emitting the manifest. |
@brianrob: I guess that some lost events are part of the manifest which are incomplete which will cause the merge error. Since I cannot debug into Windows if the ETW infrastructure knows more about non manifest based providers during merge I can only speculate. There were no buffers but some events (e.g. 4) dropped. I do not know any other key events during merging which could cause ETW to say that the data is corrupt. That could be OS dependent which would be even worse because then we would need a fix on several Windows versions ... |
It's super interesting to me that dropped events would cause a merge error, but I suppose anything is possible. The manifest events themselves aren't special, other than that they're much larger than most events. They actually aren't used at all by xperf as xperf doesn't know about them. Either way, if you can repro this reliably, it does feel like it's worth investigating. @davmason, is this something you can take a look at? |
@AloisKraus how often do you see the merge error? I played around with it, and when there are many 6.0 processes I can reliably see the dropped events but I haven't been able to reproduce the merge error. |
@davmason: I have opened a case TrackingID#2210140040001578 which contains an example ETW which is broken. Yes we can reliable reproduce the issue when we start our application which starts ca. 50+ .NET 4.8 and some .NET 6.0 processes. When we start/stop profiling after the start the first time it causes reliably unmergeable ETL files. |
Thanks @AloisKraus. The .net team does not own xperf, I am trying to get your case routed to the appropriate people who can help. I have no particular knowledge about xperf itself. For an immediate workaround have you considered using dotnet-trace? It only will work on .net core 3.1 and forward, but you can trace on a per process level instead of per machine level. |
Hi @davmason I will try out dotnet-trace for docker containers, but the majority is still .NET 4.8 and larger amounts of C++. Using dotnet-trace is therefore not an option if I want to understand system level issues. We use it to automate perf trending tests to see the impact of Antivirus, OS Changes, our software and many other things. In fact I have written ETWAnalyzer to mass analyze ETW data which is working very well. Perhaps it is also something to consider for some key metrics for dotnet itself? The FileWriter example shows how it can be used to generate mass ETW data. |
Sorry to hear.
Thanks for the suggestion @AloisKraus! Feel free to open another issue to track this as an enhancement.
Since this is an issue external to .NET Core, I mark this with the |
@tommcdon: I have found the issue of the xperf merge error. I was running xperf 10.0.22621 which had this merge bug which is no longer happening with the latest Win 11 Package with version 10.0.22621. Still the event loss issue stays. Would it be possible to add a variable to skip on trace session start the manifest sending so one can do later a rundown where everything can be collected also with lost events. I do this already for .NET processes which works well. The rundown session always gets lost events but for simplicity I reset the lost event counter of the rundown session etl to get WPA to open these files without warning messages. |
Hi @AloisKraus, We can use this issue as tracking for the request to introduce a configuration flag to not emit the events. For the existing runtimes, are you able to avoid lost events by increasing your buffer size while tracing? |
Pre .NET 6 including .NET 4.8 does not emit many large events. By default I use 512 KB buffers which were working fine for many years. When I look at the ETW manifest it is already 537 KB. So this will cause lost events whenever more than one process is trying to write chuncked manifest data to ETW:
As far as I know lost events are happening when an ETW buffer becomes full and no other buffer was already prepared to switch over. Then all logged events for that buffer are discarded. This will not scale with the number of .NET 6+ processes. |
@davmason, does this manifest always match the native provider? If it does, this feels like a good reason to not emit this event, like pre .NET 6, as the events may be coming from an unregistered native provider or the EventSource. Thus, tools must know abou the provider without depending on it being in the event stream. When NativeRuntimeEventSource was initially created, it was special-cased out of emitting its manifest because it was an implementation detail that it was an EventSource, and not actually an EventSource that represented its own set of events. |
After discussion, it should be safe to not emit the manifest for NativeRuntimeEventSource |
Description
When doing ETW recording on Windows I have found that whenever a .NET 6.0 application is traced it will emit the ETW manifest whenever a .NET runtime session is started. These events are large (30+KB) and will cause from multiple processes (I have many .NET services running on that box) an event storm which will cause ETW to cause event drops.
That in itself is not too bad, but sometimes the ETW files become corrupt so that the merge operation with xperf or PerfView will fail. That is an issue.
Reproduction Steps
Expected behavior
No lost events
Actual behavior
Lost events and sometimes unmergable ETL files
I can provide example ETL files which are broken on a private channel.
Regression?
This was not the case with .NET Core 3.1
Known Workarounds
So far none. Does filtereing the event "fix" this when I later do a proper rundown?
tracelog -enableex ClrSession -guid #e13c0d23-ccbc-4e12-931b-d9cc2eee27e4 -EventIdFilter -out 1 65534
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: