-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI testing randomly getting stuck in CI (OSOE-464) #228
Comments
I couldn't reproduce this yet with 140 builds, see the https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/workflows/build-and-test.yml?query=branch%3Aissue%2FOSOE-464. However, the currently added logging shows the details of UI test execution, even if the build is canceled due to a timeout, see here. So, I think we should merge the new logging, and wait for the issue to randomly arise again. At that point, we'll be able to determine the cause better. |
Notes: Until now, This happened exclusively with the NuGetTest build under Windows. This is a new case that might be the same: With the new, streaming test logging, this failed for the root solution and under Ubuntu (i.e. the polar opposite...). The log here shows that out of the 6 test projects, only tests for 2 were run:
The following tests I checked and indeed have run (and they all passed) before the build timed out. There are actually 28 "Finishing execution of" runs for the supposedly 26 tests!
In the end, all tests have run from the project, just a couple of them more than once. The above list contains 24 tests + 2 Edge executions (normal until now) + 2 duplicated "Finishing execution" messages. Log: So, nothing strange here. It's just that nothing happens after the "Total time" line. Then we should have a "Test successful" or "Test failed". But, this never arrives, the script is stuck between this line and this one. I guess it doesn't get stuck on |
BTW @dministro since this hanging seems to correlate with #186, does something ring a bell? |
Yes, i faced deadlock in NuGetTests some times on TC, but newer GHA. I investigated it, and replaced all |
So you could repro by basically running I'm testing disabling node reuse under Lombiq/Open-Source-Orchard-Core-Extensions#273. We'll see if it reliably fixes the issue, but what's for sure, is that for the root solution, build takes about twice as much (~4 minutes vs ~8)... If it helps, then we can check if removing any of the switches makes it faster but still retains the fix (or perhaps only leave the env var for the |
I just remembered about |
Yes exactly. |
OK, thanks. |
Happened again here. So, until now, disabled node reuse (or rather, shutting down build servers) and input redirection didn't help. |
Perhaps the problem will get some attention from Microsoft now? microsoft/vstest#2080 (comment) |
|
Apparently, the xUnit Timing out after starting executing tests in the project but no output from the actual tests: Getting stuck between tests: Interesting, and this makes a case for the "fix for snyc-over-async in I'll wait for some more data to come in. |
@0liver you told about this one happening, thanks. Since this wasn't a NuGetTest build, nor did it use the |
This run got stuck between two tests. This is suggestive of the |
Now we wait to see if the |
And it doesn't, this build failed. It hung between two tests. So, the xUnit I'm out of ideas since I've tried everything that I could think of and that everyone suggested. Perhaps any other suggestions from you @BenedekFarkas @DAud-IcI @dministro @0liver? |
This has been beyond me for quite some time. So, unfortunately, no. |
|
The only thing that comes to my mind, running the tests locally in a loop until the error comes, then attaching the debugger to the running process. It looks like the issue affects windows runner which means for me, we can reproduce it on windows(locally) so this simplifies it. What about the distribution between standard and larger runners? Is there a significant difference? |
Thank you for your tips! I'll follow up with these if the issue still surfaces. Because I just noticed that the NuGetTest build, that recently failed, wasn't actually using the new code, since I didn't NuGet-publish and update the UI Testing Toolbox... I just did that now, it's in Yes, this is by far the most of an issue with the Windows NuGetTest build, which happens on the 2-core standard runners, It much more rarely happens, but sometimes it does, with the full solution builds, also under Ubuntu (I don't know it every happening for the NuGetTest build under Ubuntu though). So yeah, we should be able to reproduce if constrained to a 2-core CPU. |
I’ve had Tests hanging in a root build just last night, on the dev branch (Windows 2core): https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3700330664/jobs/6268653535#step:10:3968 And another root build timed out just now, on a Ubuntu-22.04-4core runner: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3704714311/jobs/6277678502#step:10:63 |
Thank you. This makes it clear that the |
This will come handy: https://www.meziantou.net/generating-a-dump-file-when-tests-hang-on-a-ci-machine.htm |
We'll definitely consider him as well during our yearly review of OSS projects/contributors to sponsor. |
@meziantou would you be open to investigating and fixing this |
So, what's the plan @dministro @sarahelsaig? Builds continue to randomly fail in |
There's a hang here with a huge hang dump @dministro: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7776721615?pr=691 Download it before it gets cleaned up. |
BTW now I expect more of these, since we'll have to run issue branch Windows builds on standard runners too after GitRunners shutting down and us having to use Standard Runners for issue branch builds too: Lombiq/GitHub-Actions#320 |
Nah, these are red herrings, and the build now runs to completion after changing |
Surprising: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7791752953 The only running process: We will see. |
Looks like an actual hang: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7877646651?pr=699 |
|
I see. I made that test inspect only a single page under https://github.com/Lombiq/UI-Testing-Toolbox/pull/342/files#diff-56d06a46558b30038bf02e6c2e823aa208dc42bd2cb8125aab592bf4a8ef67a9 as opposed to randomly potentially the whole admin, so that shouldn't happen anymore. |
Can the output be streaming instead? |
I'm not sure. I'll check it in the upcoming days. |
Thanks! |
Awesome! |
So, I guess this can be closed, but please see my previous message. |
Not really, here are the execution times per method from
Maybe we can increase the timeout again. |
That failure was from just before I merged the latest PR, so probably not yet. |
Ah OK, thanks. Do we still need the |
Almost finished. Using
I would highlight this one: Do we have any hisorical data about execution time for each methods? |
Lets keep the current configuration for a whilw (1-2 months). I created an issue for this to keep track. Lombiq/Open-Source-Orchard-Core-Extensions#736 |
Here's the log of ShuttingDownIdleTenantsShouldWork output.txt Note these lines:
(the first timestamp is when the test's log was flushed to the output, the second timestamp matters) I.e., the test started together with all the others, then waited ~3 minutes for the common setup to be finished, by the test that happened to start that the first time ( So, this looks normal, though the setup itself was slow: it should finish within seconds, just the tenant setup finished within 5s. However, I'd chalk this up to a random slowness of the GitHub-hosted runner, which is a random Azure VM running Windows (which is like two times slower for these already). |
So, this doesn't seem to be fixed. @dministro let's continue the investigation here instead of Lombiq/Open-Source-Orchard-Core-Extensions#736 since this is where the rest of the stuff is. As mentioned under Lombiq/Open-Source-Orchard-Core-Extensions#80, Lombiq/GitHub-Actions#373 can help you diagnose this. This run here has the same issue, but with test diag logs: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9980235741 This looks telling:
Looks related: microsoft/vstest#2952. Especially microsoft/vstest#2952 (comment) and the "infamous 100ms bug". That looks like this line. Setting the
With 60s it still hangs. So, it seems that this is an actual hang. The last message is always "Finishing execution of Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorBlogBaseThemeTests.ContentMenuItemShouldWorkCorrectly". @sarahelsaig check this out, it might help you too. |
See e.g. this build. The NuGetTest UI testing of OSOCE got stuck and thus timed out in 30 minutes, while all the other builds worked fine. Rerunning just that build fixed the issue. This problem started way before these recent examples and the Orchard Core 1.5 upgrade. Reruns resolve the issue every time.
Other examples unrelated to the 1.5 upgrade: here, here, here, here.
Troubleshooting this is made harder by it apparently not being possible to finish the
test-dotnet
action and upload artifacts when the build is canceled, see Lombiq/GitHub-Actions#77.It seems this bug didn't solve itself in the end: Lombiq/Open-Source-Orchard-Core-Extensions#126
Jira issue
See comments for notes.
To be done:
Task.Delay()
potentially having an effect. - No, this may also happen if there are no failing tests, see here.WebApplicationFactory
. See:ui-test-parallelism: 0
for NuGetTest builds. Note though that this is already the case for root builds using the larger runners, and it still happened there under Ubuntu."maxParallelThreads": -1
.Stuck tests outside the scope of
blame-hang
The text was updated successfully, but these errors were encountered: