Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/8.0-staging] Fixes deadlock for IncrementingPollingCounter callbacks #108648

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

noahfalk
Copy link
Member

@noahfalk noahfalk commented Oct 8, 2024

Modified backport of #105548 to release/8.0-staging

/cc @noahfalk @eterekhin

Customer Impact

  • Customer reported
  • Found internally

The servicing request comes from Microsoft Exchange team via internal email. This bug causes their service to occasionally hang at startup when a monitoring tool has enabled listening to the System.Runtime EventCounters. We've already had variants of this bug reported by multiple external customers, for example #93175.

The underlying issue is a deadlock caused by a lock ordering issue between the static constructor lock and the EventListener lock. It is fixed by changing the thread we issue the IncrementingPollingCounter callback on so that the EventListener lock isn't held when the callback runs.

Regression

  • Yes
  • No

To the best of my understanding this bug has been present since the counters were first introduced in .NET Core 3. However its possible that specific details have shifted over time allowing the bug to be hit more easily.

Testing

I manually tested in a debugger stepping through all the modified code and verifying the expected behavior.

Risk

Low - I have guarded all the changed behavior with an opt-in AppContext switch (System.Diagnostics.Tracing.CounterCallbackOnTimerThread) and verified in the debugger that the switch operates as expected. The code change is also relatively isolated and has gotten some testing in our 9.0 development branches.

More details about the code change

This is a modified backport of #105548. It mostly preserves the logic of the original fix in .NET 9 with a few adjustments:

  • Added a config switch System.Diagnostics.Tracing.CounterCallbackOnTimerThread that must be set to true to opt-in to the fix behavior. The .NET 9 fix was documented as a breaking change because it slighly modifies the timing and thread used for first call to an IncrementingPollingCounter callback. I did not want anyone in 8.0 to opted into this by default.
  • The opt-in switch sets the property CounterCallbackOnTimerThread and I added this condition to several of the if checks in the code. Its more than would be strictly necessary just to make it obvious when code reviewing individual methods that the new code paths are unreachable unless the app opts in.
  • The original 9.0 change had a bit more refactoring that wasn't essential (renaming a method, removing an unneeded lock() scope) and I removed that here to reduce the code delta.

This is a modified backport of dotnet#105548. It mostly preserves the logic of the original fix in .NET 9 with a few adjustments:
- Added a config switch System.Diagnostics.Tracing.CounterCallbackOnTimerThread that must be set to true to opt-in to the fix behavior. The .NET 9 fix
  was documented as a breaking change because it slighly modifies the timing and thread used for first call to an IncrementingPollingCounter callback. I did not want anyone in 8.0 to opted into this by default.
- The opt-in switch sets the property CounterCallbackOnTimerThread and I added this condition to several of the if checks in the code. Its more than would be strictly necessary just to make it obvious when code reviewing individual methods that the new code paths are unreachable unless the app opts in.
- The original 9.0 change had a bit more refactoring that wasn't essential (renaming a method, removing an unneeded lock() scope) and I removed that here to reduce the code delta.
Copy link
Contributor

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

@noahfalk noahfalk changed the title Fixes deadlock for IncrementingPollingCounter callbacks [release/8.0-staging] Fixes deadlock for IncrementingPollingCounter callbacks Oct 8, 2024
@noahfalk
Copy link
Member Author

noahfalk commented Oct 8, 2024

@brianrob @davmason @jkotas @tarekgh - could someone take a look at this servicing change? Thanks!

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. please get a code review. we will take for consideration in 8.0.x

@jeffschwMSFT jeffschwMSFT added this to the 8.0.x milestone Oct 8, 2024
@jeffschwMSFT jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Oct 8, 2024
@leecow leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Oct 10, 2024
@leecow leecow modified the milestones: 8.0.x, 8.0.11 Oct 10, 2024
@noahfalk noahfalk merged commit 6c5d00f into dotnet:release/8.0-staging Oct 11, 2024
175 of 180 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants