Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable crash dumps in AzDo Build pipelines #14440

Open
2 tasks done
JulieLeeMSFT opened this issue Jan 31, 2024 · 11 comments
Open
2 tasks done

Enable crash dumps in AzDo Build pipelines #14440

JulieLeeMSFT opened this issue Jan 31, 2024 · 11 comments

Comments

@JulieLeeMSFT
Copy link
Member

JulieLeeMSFT commented Jan 31, 2024

  • This issue is blocking
  • This issue is causing unreasonable pain

Recently, we had an intermittent crash in runtime in VMR build for preview 1, but it took multiple days to reproduce the issue and pinpoint what caused the crash. Since there is no infrastructure currently to get crash dumps in AzDO for build pipelines, it was an extremely painful process.

With a complex build such as VMR, it is essential to make diagnosable system and have crash dumps capability in AzDo builds.

It was especially painful to identity the exact VMR commit that introduced the regression. VMR doesn't have a single commit that corresponds to a single commit from the runtime repo. A commit in the VMR represents a batch of commits, one for each repo flowing into installer. So, it was not possible to simply checkout commits in the VMR to identify the specific offending commit in runtime.

cc @markwilkie @agocke @jkotas @mthalman @MichaelSimons @hoyosjs @tommcdon

@garath
Copy link
Member

garath commented Jan 31, 2024

Is the crash happening during the build or during tests? Which pipeline in particular are you interested in?

Helix does collect dumps so I'd like to get some details to understand why that wasn't working here. Some docs on how it works may be found here: https://github.com/dotnet/arcade/blob/b4e9225c6c2f9da42fbb611a5e8942a08476fe89/Documentation/Dumps/Dumps.md

@agocke
Copy link
Member

agocke commented Jan 31, 2024

This is about AzDO builds, so Helix doesn't help.

@jkotas
Copy link
Member

jkotas commented Jan 31, 2024

Related / partial duplicate: dotnet/dnceng#1290

@riarenas
Copy link
Member

riarenas commented Feb 7, 2024

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

@markwilkie
Copy link
Member

Another one to make sure we consider in triage @ilyas1974 and @garath

@dougbu
Copy link
Member

dougbu commented Mar 21, 2024

Would it help at all if we looked into this feature from 1ES? https://eng.ms/docs/cloud-ai-platform/devdiv/one-engineering-system-1es/1es-docs/1es-hosted-azure-devops-pools/hold-machine-for-debugging?

Extracting dumps is one of the scenarios that is specifically called out.

Note this option sounds costly because it means all build machines get held for a while after a build completes. It sounds like holding a machine only after a failure may eventually be implemented though using the feature may remain expensive even with that.

@missymessa
Copy link
Member

This a feature that would go well if added to the Arcade SDK.

@missymessa missymessa added this to the Tracking for other teams milestone May 30, 2024
@ericstj
Copy link
Member

ericstj commented Jul 26, 2024

Adding a ref-count to this. It would have been super useful for the recent bug we were chasing in 9.0 where it only reproduced in the build and not any of the tests (because it involved R2R and long lived process / stress).

With the latest built-in crash support in .NET I think we could make this work for everyone by having arcade set it up at the build entrypoint, and have arcade templates ensure they pulled the dumps from artifacts. @hoyosjs @ellahathaway

Here's the way @ellahathaway was doing this for VMR: dotnet/sdk#42320

I think that could be generalized, and the rough edges (like crossgen error) fixed.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 26, 2024

I have been thinking about this - but I am not sure dumps uploaded as artifacts is a good idea in internal builds. Those machines are filled with secrets and the dump would need to go through a compliant pipeline. Perhaps for PRs + testing this is OK.

@ericstj
Copy link
Member

ericstj commented Jul 26, 2024

I think that's a concern for logs from the official build machines already. I'm not sure we can say that build outputs from official builds will never have secrets - we just need to make sure that they land in a place that's secure. I wonder if AzDo has a way to classify the outputs of the build to make some things more sensitive than others. Maybe some outputs can require some sort of JIT elevation to access.

I agree that we get some good coverage with CI and PR validation, but I don't want us to be shy about considering the log problem from official builds. There will always be problems unique to official builds.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 26, 2024

Secrets are usually env settings that get loaded in the msbuild processes that crash. binlogs now get scrubbed - I am hesitant given the history of dumps and secrets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants