Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang/timeout in tests for Browser wasm Release AllSubsets_Mono #40550

Closed
danmoseley opened this issue Aug 7, 2020 · 13 comments
Closed

Hang/timeout in tests for Browser wasm Release AllSubsets_Mono #40550

danmoseley opened this issue Aug 7, 2020 · 13 comments
Assignees
Labels
arch-wasm WebAssembly architecture area-Codegen-Interpreter-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@danmoseley
Copy link
Member

2020-08-07T16:57:55.4437411Z   Sending Job to Ubuntu.1804.Amd64.Open...
2020-08-07T17:02:10.6813852Z   Sent Helix Job e7fed000-94bb-45ab-950e-2403ad3753b5
2020-08-07T17:02:11.7013998Z   Waiting for completion of job e7fed000-94bb-45ab-950e-2403ad3753b5 on Ubuntu.1804.Amd64.Open
2020-08-07T17:30:35.4257384Z   Job e7fed000-94bb-45ab-950e-2403ad3753b5 on Ubuntu.1804.Amd64.Open is completed with 209 finished work items.
2020-08-07T17:30:36.1014417Z   Stopping Azure Pipelines Test Run -Browser-Release-wasm-Mono_Release-Ubuntu.1804.Amd64.Open
2020-08-07T17:30:36.3391733Z /__w/1/s/.packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20374.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(76,5): error : Work item e7fed000-94bb-45ab-950e-2403ad3753b5/System.Runtime.Tests in job e7fed000-94bb-45ab-950e-2403ad3753b5 has failed. [/__w/1/s/src/libraries/sendtohelixhelp.proj]
2020-08-07T17:30:36.3393772Z /__w/1/s/.packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20374.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(76,5): error : Failure log: https://helix.dot.net/api/2019-06-17/jobs/e7fed000-94bb-45ab-950e-2403ad3753b5/workitems/System.Runtime.Tests/console [/__w/1/s/src/libraries/sendtohelixhelp.proj]
2020-08-07T17:30:36.3483318Z ##[error].packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20374.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(76,5): error : (NETCORE_ENGINEERING_TELEMETRY=Test) Work item e7fed000-94bb-45ab-950e-2403ad3753b5/System.Runtime.Tests in job e7fed000-94bb-45ab-950e-2403ad3753b5 has failed.
Failure log: https://helix.dot.net/api/2019-06-17/jobs/e7fed000-94bb-45ab-950e-2403ad3753b5/workitems/System.Runtime.Tests/console
2020-08-07T17:30:36.3509024Z 
2020-08-07T17:30:36.3509528Z Build FAILED.
2020-08-07T17:30:36.3509769Z 
2020-08-07T17:30:36.3511286Z /__w/1/s/.packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20374.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(76,5): error : Work item e7fed000-94bb-45ab-950e-2403ad3753b5/System.Runtime.Tests in job e7fed000-94bb-45ab-950e-2403ad3753b5 has failed. [/__w/1/s/src/libraries/sendtohelixhelp.proj]
2020-08-07T17:30:36.3513185Z /__w/1/s/.packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20374.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(76,5): error : Failure log: https://helix.dot.net/api/2019-06-17/jobs/e7fed000-94bb-45ab-950e-2403ad3753b5/workitems/System.Runtime.Tests/console [/__w/1/s/src/libraries/sendtohelixhelp.proj]
2020-08-07T17:30:36.3514081Z     0 Warning(s)
2020-08-07T17:30:36.3514365Z     1 Error(s)
2020-08-07T17:30:36.3514602Z 
2020-08-07T17:30:36.3514876Z Time Elapsed 00:32:49.13
2020-08-07T17:30:36.3803881Z ##[error]Build failed (exit code '1').
2020-08-07T17:30:36.3839022Z 
2020-08-07T17:30:36.3881917Z ##[error]Bash exited with code '1'.
2020-08-07T17:30:36.4332912Z ##[section]Finishing: Send to Helix

https://helix.dot.net/api/2019-06-17/jobs/e7fed000-94bb-45ab-950e-2403ad3753b5/workitems/System.Runtime.Tests/console
https://dev.azure.com/dnceng/public/_build/results?buildId=761619&view=logs&jobId=108d2c4a-8a62-5a58-8dad-8e1042acc93c&j=108d2c4a-8a62-5a58-8dad-8e1042acc93c&t=09a92b4c-3493-52a9-1172-0680b62c9e30

last lines of console

      
[17:18:21] dbug: test[0]
      [PASS] System.Tests.ArrayTests.CreateInstance_Advanced(elementType: typeof(int), lengths: [1], lowerBounds: [0], repeatedValue: 0)
      
Killed

I guess whatever the next test was (not logged) just hung and then it timed out 12 mins later? Can we make it log the test before it starts it, so we have a clue which?

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 7, 2020
@ghost
Copy link

ghost commented Aug 7, 2020

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member Author

cc @lewing

@lewing lewing added this to the 5.0.0 milestone Aug 7, 2020
@safern
Copy link
Member

safern commented Aug 7, 2020

I've noticed System.Runtime tests are huge and they take an average of 9-12 mins. The timeout in helix is 12 minutes. Maybe we should bump the timeout for wasm workitems?

@safern safern removed the untriaged New issue has not been triaged by the area owner label Aug 7, 2020
@safern
Copy link
Member

safern commented Aug 7, 2020

Actually looking at the timestamps it looks like a legit hang. It just ran for 3 mins.

[17:15:27] dbug: test[0]
      Arguments: --run,WasmTestRunner.dll,System.Runtime.Tests.dll,-notrait,category=IgnoreForCI,-notrait,category=OuterLoop,-notrait,category=failing

...

[17:18:21] dbug: test[0]
      [PASS] System.Tests.ArrayTests.CreateInstance_Advanced(elementType: typeof(int), lengths: [1], lowerBounds: [0], repeatedValue: 0)

@lewing
Copy link
Member

lewing commented Aug 7, 2020

cc @vargaz

@danmoseley
Copy link
Member Author

The main log has

2020-08-07T17:02:11.7013998Z   Waiting for completion of job e7fed000-94bb-45ab-950e-2403ad3753b5 on Ubuntu.1804.Amd64.Open
2020-08-07T17:30:35.4257384Z   Job e7fed000-94bb-45ab-950e-2403ad3753b5 on Ubuntu.1804.Amd64.Open is completed with 209 finished work items.

Hence my assumption is that between the 17:18 and 17:30 timestamp there was a test jammed. (?)

@lewing
Copy link
Member

lewing commented Aug 8, 2020

it is likely #40214

@danmoseley
Copy link
Member Author

danmoseley commented Aug 12, 2020

@lewing another CI hang this time in System.Collections.NonGeneric.Tests

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-40729-merge-e1632a8cfaf84b54b4/System.Collections.NonGeneric.Tests/console.ea2cd2a6.log?sv=2019-02-02&se=2020-09-01T19%3A56%3A19Z&sr=c&sp=rl&sig=2LL%2FZ2N0oTEL%2Bzz8AVt30UdsfanM%2BU6O%2FViivjmiE1k%3D

[20:13:48] dbug: test[0]
      [PASS] System.Collections.Tests.HashtableBasicTests.IDictionary_NonGeneric_IDictionaryEnumerator_Current_FromStartToFinish(count: 75)
      
[20:13:48] dbug: test[0]
      [PASS] System.Collections.Tests.HashtableBasicTests.IEnumerable_NonGeneric_Enumerator_Current_ReturnsSameValueOnRepeatedCalls(count: 0)
      
[20:13:48] dbug: test[0]
      [PASS] System.Collections.Tests.HashtableBasicTests.IEnumerable_NonGeneric_Enumerator_Current_ReturnsSameValueOnRepeatedCalls(count: 1)
      
[20:13:48] dbug: test[0]
      [PASS] System.Collections.Tests.HashtableBasicTests.IEnumerable_NonGeneric_Enumerator_Current_ReturnsSameValueOnRepeatedCalls(count: 75)
      
[20:13:48] dbug: test[0]
      [PASS] System.Collections.Tests.HashtableBasicTests.ICollection_NonGeneric_CopyTo_ArrayOfIncorrectValueType(count: 0)
      
Killed

There is no dump so I'm not sure how we can tell whether it's #40214.

was in #40729

@danmoseley danmoseley added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Aug 12, 2020
@lewing
Copy link
Member

lewing commented Aug 12, 2020

@danmosemsft we are actively working some wasm GC related corruption problems. Blazor will ship using the same GC mode it did in 3.2 which should not have these problems. We plan to keep the on demand GC mode enabled for the runtime tests.

@danmoseley
Copy link
Member Author

Thanks @lewing

I'm trying to improve my kusto querying powers, so I did

Jobs 
| where Started > todatetime("08/10/20")
| join WorkItems on JobId
| where QueueAlias == "ubuntu.1804.amd64.open"
| extend IsWasm = Properties contains "wasm"
| summarize count() by Status, IsWasm

says

       
  Status IsWasm count_
  Pass 0 188786
  Pass 1 51838
  Fail 0 159
  BadExit 0 2151
  Timeout 0 307
  Timeout 1 29
  Fail 1 41
  Error 1 5
  InfraRetry 0 97
  BadExit 1 421
  InfraRetry 1 7

which I think means that the hangs are not a super big problem.

@SamMonoRT
Copy link
Member

@danmosemsft @lewing - considering hangs are not a big issue, can we remove the blocking-clean-ci tag. We are working on GC improvements, but will ship as full GC disabled just like 3.2 release. We will keep chipping away at fixing issues found via tests, but not targeting 5.0 RC submissions unless we identify a very nasty issue. Moving this to 6.0 Milestone and removing the blocking-clean-ci tag maybe the right approach here.

@marek-safar marek-safar modified the milestones: 5.0.0, 6.0.0 Aug 24, 2020
@jaredpar
Copy link
Member

jaredpar commented Oct 6, 2020

Closing as I don't see any recent occurences of this issue.

@jaredpar jaredpar closed this as completed Oct 6, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture area-Codegen-Interpreter-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests

10 participants