Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/9.0] bgc deadlock fix #108774

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Oct 11, 2024

Backport of #108773 to release/9.0

/cc @Maoni0

Customer Impact

  • Customer reported
  • Found internally

This was found as a hang during the build process, specifically always when it executes this command -

"some_dir\dotnet.exe" "repo_root\artifacts\bin\ILLink.Tasks\Release\net9.0\illink.dll" @"C:\Users\andya\AppData\Local\Temp\MSBuildTemp\tmp.rsp"

some_dir could be repo_root\.dotnet or C:\Program Files\dotnet.
This will using the runtime from the 9.0 RC1 build.

This issue can cause applications to deadlock on startup when using DATAS (on by default in 9.0 for SVR workloads).

Regression

  • Yes
  • No

this was a regression introduced in #105545. this would only manifest during the beginning of a process when we needed to grow the # of heaps (and can grow) and we haven't done a gen2 GC yet so we set it to doing one.

the problem is the check I made did not include the ephemeral GC that may happen at the beginning of a BGC before we set gc_background_running to true. so at the end of that eph GC, we are in calculate_new_heap_count and set trigger_initial_gen2_p to true, not realizing we are already in a BGC.

then during the joined_generation_to_condemn at the beginning of the next GC, if our conclusion was still doing an eph GC we'd make it a BGC due to trigger_initial_gen2_p which obviously would cause problem if a BGC is already in progress (I do have an assert for this but we haven't seen this in a dbg build...).

the fix is to simply use the right check - is_bgc_in_progress() instead of background_running_p() which includes that eph GC case.

Testing

#105545 went through the GC team's stress testing but this particular issue was not observed (this manifests as a hang in retail and an assert in debug but neither was observed). this basically only happens during the beginning of a process. I could repro it with that particular illink command so I tested it using that command. will be adding more testing specifically for this area in .NET 10.

Risk

Low, this missed a case where we should be checking during a possible ephemeral GC at the beginning of a BGC and the fix is to add that check.

IMPORTANT: If this backport is for a servicing release, please verify that:

  • The PR target branch is release/X.0-staging, not release/X.0.

Copy link
Member

@mrsharm mrsharm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mangod9
Copy link
Member

mangod9 commented Oct 11, 2024

@Maoni0 @markples there was a build break due to factoring changes around distribute_free_list which I have tried to fix, please take a look.

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. we will take for consideration in 9 GA

@jeffschwMSFT jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Oct 11, 2024
@jeffschwMSFT jeffschwMSFT added this to the 9.0.0 milestone Oct 11, 2024
@markples
Copy link
Member

@Maoni0 @markples there was a build break due to factoring changes around distribute_free_list which I have tried to fix, please take a look.

This is a correct fix. The build break is due to calling is_bgc_in_progress from an ISOLATED_ method (rather than an instance one). The distribute_free_list work coincidentally also did this, which is why the code change was already available; it had been applied and reverted in 9 branch.

@jeffschwMSFT jeffschwMSFT added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Oct 11, 2024
@jeffschwMSFT jeffschwMSFT merged commit 6aafadb into release/9.0 Oct 11, 2024
10 checks passed
@jkotas jkotas deleted the backport/pr-108773-to-release/9.0 branch October 12, 2024 04:40
@github-actions github-actions bot locked and limited conversation to collaborators Nov 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants