Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossgen-comparison Linux arm checked fails with timeouts #1282

Closed
sandreenko opened this issue Jan 3, 2020 · 18 comments
Closed

crossgen-comparison Linux arm checked fails with timeouts #1282

sandreenko opened this issue Jan 3, 2020 · 18 comments
Assignees
Labels
area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner

Comments

@sandreenko
Copy link
Contributor

sandreenko commented Jan 3, 2020

It happens in all runs, the timeout is 60 minutes. We have not seen that in coreclr.

cc @echesakovMSFT @BruceForstall

Log example:

Starting: Run native crossgen and compare output to baseline (Unix)
==============================================================================
Task         : Command line
Description  : Run a command line script using Bash on Linux and macOS and cmd.exe on Windows
Version      : 2.151.2
Author       : Microsoft Corporation
Help         : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/command-line
==============================================================================
Generating script.
Script contents:
$BUILD_SOURCESDIRECTORY/eng/common/msbuild.sh $BUILD_SOURCESDIRECTORY/eng/common/helixpublish.proj /restore /t:Test /bl:$BUILD_SOURCESDIRECTORY/artifacts/log/$BuildConfig/SendToHelix.binlog
========================== Starting Command Output ===========================
/bin/bash --noprofile --norc /__w/_temp/7553b9eb-4204-4db6-9f64-3d08371fe862.sh
/__w/3/s/.dotnet/sdk/5.0.100-alpha1-015772/MSBuild.dll /nologo -distributedlogger:Microsoft.DotNet.Tools.MSBuild.MSBuildLogger,/__w/3/s/.dotnet/sdk/5.0.100-alpha1-015772/dotnet.dll*Microsoft.DotNet.Tools.MSBuild.MSBuildForwardingLogger,/__w/3/s/.dotnet/sdk/5.0.100-alpha1-015772/dotnet.dll -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/3/s/artifacts/log/Checked/SendToHelix.binlog /clp:Summary /nr:true /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=false /restore /t:Test /warnaserror /__w/3/s/eng/common/helixpublish.proj
  Restore completed in 599.36 ms for /__w/3/s/eng/common/helixpublish.proj.
  Restore completed in 2.35 ms for /__w/3/s/eng/common/helixpublish.proj.
  Starting Azure Pipelines Test Run (Ubuntu.1804.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-30f6673-20190814153226
  Uploading payloads for Job on (Ubuntu.1804.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-30f6673-20190814153226...
  Finished uploading payloads for Job on (Ubuntu.1804.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-30f6673-20190814153226...
  Sending Job to (Ubuntu.1804.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-30f6673-20190814153226...
  Sent Helix Job 40b0be17-33d6-41c3-bdce-fd02aba91a98
  Waiting for completion of job 40b0be17-33d6-41c3-bdce-fd02aba91a98
##[error]The operation was canceled.
Finishing: Run native crossgen and compare output to baseline (Unix)

https://dev.azure.com/dnceng/public/_build/results?buildId=471248

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 3, 2020
@BruceForstall
Copy link
Member

Is this a regression, or has it always failed in runtime?

@sandreenko
Copy link
Contributor Author

@jashook
Copy link
Contributor

jashook commented Jan 3, 2020

Generally this is failure happens due to us running at capacity for the helix queue. With the amount of machines that were added last month, there is most likely something that has changed with what jobs are submitted and from where to cause this, or there are a lot of machine that have gone offline.

@jashook
Copy link
Contributor

jashook commented Jan 3, 2020

Is this a regression, or has it always failed in runtime?

This has not always failed, although the amount of jobs submitted has probably increased this month over last month.

@sandreenko
Copy link
Contributor Author

sandreenko commented Jan 3, 2020

Thanks, @jashook . We can make a quick surgical fix and stop running everything twice (under runtime and under runtime-coreclr pipelines, delete duplicates from https://github.com/dotnet/runtime/blob/master/eng/pipelines/runtime-official.yml and https://github.com/dotnet/runtime/blame/master/eng/pipelines/coreclr/pr.yml) or wait for a few weeks when @safern replaces existing pipelines.

@jashook
Copy link
Contributor

jashook commented Jan 3, 2020

Seems like the wait times are not too high in the queue, if I remember right these jobs ran close to the timeout. I would suggest upping the timeout by 15 to 30 minutes.

@safern
Copy link
Member

safern commented Jan 6, 2020

or wait for a few weeks when @safern replaces existing pipelines.

I should be removing duplication this week. Hope to have a PR before Wednesday.

trylek added a commit to trylek/runtime that referenced this issue Jan 13, 2020
Disable crossgen comparison runs that systematically fail in
CoreCLR outerloop runs.

Tracking issue: dotnet#1282

Thanks

Tomas
trylek added a commit that referenced this issue Jan 14, 2020
Disable crossgen comparison runs that systematically fail in
CoreCLR outerloop runs.

Tracking issue: #1282

Thanks

Tomas
@safern
Copy link
Member

safern commented Jan 16, 2020

I just merged: #1473 which removes the duplication of pipelines and it will now only be ran once per PR and once on CI. Will leave this open to monitor the outcome after this change and react if needed.

@jaredpar jaredpar added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Feb 25, 2020
@jaredpar
Copy link
Member

This is presently impacting about 9% of our builds

Build Kind Timeline Record
534481 PR #2107 Test crossgen-comparison Linux arm checked
534999 PR #2107 Test crossgen-comparison Linux arm checked
535931 PR #32402 Test crossgen-comparison Linux arm checked
535806 PR #32794 Test crossgen-comparison Linux arm checked
534710 PR #32762 Test crossgen-comparison Linux arm checked
534703 PR #32330 Test crossgen-comparison Linux arm checked
534497 PR #32499 Test crossgen-comparison Linux arm checked
534413 PR #32626 Test crossgen-comparison Linux arm checked
534345 Rolling Test crossgen-comparison Linux arm checked

@BruceForstall
Copy link
Member

@jashook You thought before that this was failing due to overloaded Linux arm hardware, leading to the timeouts. Do you still believe that?

@jashook
Copy link
Contributor

jashook commented Feb 25, 2020

Yes, I think there is more pressure on the hardware with mono running arm now.

@echesakov
Copy link
Contributor

It look that we missed a real regression in crossgen_comparison due to our inability to run this test without timing out #32951

@jashook Can we increase timeouts for this job?

@jaredpar
Copy link
Member

jaredpar commented Mar 2, 2020

Removing blocking label now that mitigations are in place.

@jaredpar jaredpar removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Mar 2, 2020
@ViktorHofer
Copy link
Member

Is this issue still actionable now that the mitigation is in place? Should we open another issue or link to an existing one which tracks the ARM hardware capacity issue?

@jashook jashook closed this as completed Mar 2, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
@BruceForstall
Copy link
Member

@echesakovMSFT I noticed that the crossgen-comparison job is still commented out in ci.yml, so it's not running in the outerloop job. Should we fix that?

@echesakov
Copy link
Contributor

@BruceForstall I think it is okay.

We still have crossgen-comparison job triggered in PRs (if they affect code under src/coreclr). One of the latest runs is green - https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-48830-merge-9b43a99dd41241aaa5/WorkItem/console.42781747.log?sv=2019-07-07&se=2021-03-23T08%3A59%3A24Z&sr=c&sp=rl&sig=Eu7gjhAYzP2ozwKcOMC%2FsAzV9caUqAe2qehEdfwvEjM%3D

In addition to that, @davidwrighton added crossgen2-outerloop pipeline that I believe runs similar comparison jobs but with crossgen2. In theory, the changes that break JIT cross-bitness, cross-architecture or cross-os compatibility should be caught by these jobs. In reality, these jobs have been red for a while (e.g. https://dev.azure.com/dnceng/public/_build?definitionId=701). I opened #49077 to track the issue.

@davidwrighton
Copy link
Member

@echesakovMSFT I intend to get around to fixing some of those crossgen2 comparision issues soon.

@echesakov
Copy link
Contributor

@davidwrighton Thank you, David!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

9 participants