Add Retries to Button tests which can be flaky #2102

joperezr · 2023-07-13T17:42:02Z

We've seen that some of our Button tests can be flaky some times, so adding the RetryHelper we use for hardware tests should help reducing that noise. For review, it is easier to make sure you check the option of ignoring spaces which would show the real diff which is only adding RetryHelper to wrap the methods.

Microsoft Reviewers: Open in CodeFlow

raffaeler · 2023-07-13T17:57:26Z

Got it. Could you please post here the failure? Even if it is random, it could be useful to adopt a different strategy.
When I added the most recent tests for the button, I simply used the same strategy to minimize failures.

krwq · 2023-07-13T18:13:31Z

do we know what driver is it using? there should really be no flakiness there unless it's SysFs (there is 100ms delay there which should be good enough for effectively anything). IMO there is little value in retrying this vs removing

krwq · 2023-07-14T14:25:10Z

I did some quick skim through the code. There is absolutely no reason why this test would fail - it doesn't use GPIO at all.

IMO this looks like product bug. If we're not planning to fix this right now I'd disable test project for now and open issue against this.

joperezr · 2023-07-14T14:27:31Z

I'm fine with that too.

raffaeler · 2023-07-14T14:27:43Z

I did some quick skim through the code. There is absolutely no reason why this test would fail - it doesn't use GPIO at all.

IMO this looks like product bug. If we're not planning to fix this right now I'd disable test project for now and open issue against this.

This is why I need to know which are the failing tests and if the behavior is randomic.
Since I was the one who made the latest changes, I can take care of this, but please tell me what is happening.

krwq · 2023-07-14T14:29:42Z

I'm running them in a loop as we speak, you might wanna try the same - possibly slower machine is needed

krwq · 2023-07-14T14:32:54Z

run-until-failure D:\src\iot\.dotnet\dotnet.exe test --no-build (run dotnet test once before this)

run-until-failure.bat:

@echo off
:repeat
%*
IF %errorlevel% == 0 GOTO :repeat

krwq · 2023-07-14T14:42:31Z

Did we see this failure on macOS only? perhaps ticks are different there and our defaults are wrong? (just brainstorming without reading code)

krwq · 2023-07-14T20:27:21Z

I've run this in a loop for a while (probably around 2h) on Windows without success. At one point I even added regular for loop on each test to make it more likely for something to show up but no success 😞

raffaeler · 2023-07-16T08:49:49Z

I left the batch running for more than 1 day on Windows without seeing any failure.

Then I tried to run dotnet test on a RPi with .NET 7 installed and I see this failing when looking for .NET 3.1.0.
Not sure if the error depends on the environment or what else.
@krwq suggestions?

pi@pibuster32:~/iot/src/devices/Button/tests $ dotnet test
  Determining projects to restore...
  All projects are up-to-date for restore.
  System.Device.Gpio -> /home/pi/iot/artifacts/bin/System.Device.Gpio/Debug/netstandard2.0/System.Device.Gpio.dll
  System.Device.Model -> /home/pi/iot/src/devices/System.Device.Model/bin/Debug/netstandard2.0/System.Device.Model.dll
  Common -> /home/pi/iot/src/devices/Common/bin/Debug/netcoreapp3.1/Common.dll
  Button -> /home/pi/iot/src/devices/Button/bin/Debug/netcoreapp3.1/Button.dll
  Button.tests -> /home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll
Test run for /home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll (.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 17.6.3 (arm)
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
Testhost process for source(s) '/home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll' exited with error: You must install or update .NET to run this application.
App: /home/pi/.nuget/packages/microsoft.testplatform.testhost/16.9.4/lib/netcoreapp2.1/testhost.dll
Architecture: arm
Framework: 'Microsoft.NETCore.App', version '3.1.0' (arm)
.NET location: /home/pi/dotnet/
The following frameworks were found:
  7.0.9 at [/home/pi/dotnet/shared/Microsoft.NETCore.App]
Learn about framework resolution:
https://aka.ms/dotnet/app-launch-failed
To install missing framework, download:
https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=3.1.0&arch=arm&rid=raspbian.10-arm
. Please check the diagnostic logs for more information.

Test Run Aborted.
pi@pibuster32:~/iot/src/devices/Button/tests $

pgrawehr · 2023-07-16T12:10:07Z

@raffaeler Are you sure you're running the latest master version? Button.tests.dll has <TargetFramework>$(DefaultTestTfms)</TargetFramework>, and $(DefaultTestTfms) is defined as <DefaultTestTfms>net6.0</DefaultTestTfms> in src\devices\Directory.build.props, So I see no reason why it should be compiling against .NET core 3.1.

There was however an issue with an incorrect runtime selection by the test runner, caused by a bug in the build infrastructure. I'm not sure that is relevant here, though.

raffaeler · 2023-07-16T15:17:37Z

There was however an issue with an incorrect runtime selection by the test runner, caused by a bug in the build infrastructure. I'm not sure that is relevant here, though.

I saw you removed 3.1 from the very latest, but I assumed it should work with the version specified by global.json and this is not the case.

Now I did git pull and get the same with error, but with .NET 6:

Testhost process for source(s) '/home/pi/iot/src/devices/Button/tests/bin/Debug/net6.0/Button.tests.dll' exited with error: You must install or update .NET to run this application.
App: /home/pi/iot/src/devices/Button/tests/bin/Debug/net6.0/testhost.dll
Architecture: arm
Framework: 'Microsoft.NETCore.App', version '6.0.0' (arm)
.NET location: /home/pi/dotnet/
The following frameworks were found:
  7.0.9 at [/home/pi/dotnet/shared/Microsoft.NETCore.App]
Learn about framework resolution:
https://aka.ms/dotnet/app-launch-failed
To install missing framework, download:
https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=6.0.0&arch=arm&rid=raspbian.10-arm
. Please check the diagnostic logs for more information.

Test Run Aborted.

raffaeler · 2023-07-16T15:20:59Z

Now I modified the global.json to run on .NET 6 and pointed my global installation to .NET 6 as well.
The dotnet test now is successful. I will try to re-run indefinitely to see if it breaks over time

BTW I don't like this infrastructure, it's weird.

raffaeler · 2023-07-16T19:52:16Z

I will leave the tests running on the RPi all night long, but they are currently running for more than 4 hours and never failed.
I still can't get what was going wrong.
@joperezr can you give me some hint from the past logs?

raffaeler · 2023-07-17T09:30:57Z

The tests run for more than 2000 times on the Rpi without any failure.
Now I am stopping them and wait for some more info.

krwq · 2023-07-17T12:39:40Z

perhaps let's just disable them on macOS if this is macOS specific? I don't like doing this but I don't see a value of tests running in CI if we can't even get a log...

joperezr · 2023-07-17T18:48:34Z

Another option is to add more telemetry or logs so that we can understand what is going on and investigate. One way of doing that would be to have the test log some output to a file, and then to have a target that runs after running the tests which checks for this file and if it exists it prints it out. This would guarantee that we would get those logs even if the thread running the tests didn't have the console.

raffaeler · 2023-07-17T20:54:58Z

@joperezr If you mean logging the test output, isn't this by design in dotnet test?
I expect to see a failed assert or an exception, right?
And this should be in the pipeline logs.

If you mean something else, please explain how do you believe we/I should modify the tests.
TIA

pgrawehr · 2023-07-18T06:12:18Z

@raffaeler The problem is a bug in the test runner. For some reason (Jose knows more) - when the tests run on multiple CPU cores simultaneously - the output of only one CPU ends in the log, the other parts are lost somewhere. The probem would be mitigated if we found a way around this.

Much of the build scripting is hidden in the arcade scripts though, and I think we don't even have direct access to that, or it would need to be fixed there.

krwq · 2023-07-18T09:27:08Z

Most likely this bug: microsoft/vstest#2224 (please upvote)

joperezr · 2023-07-27T16:03:03Z

[Triage] We talked about this during triage and decided that we don't want to add retries as we want to ideally be able to reproduce the failure for now in order to be able to fix the test, so I'm going to go ahead and close this for now. Hopefully microsoft/vstest#2702 can get merged soon which would help us knowing which test is the culprit.

Add Retries to Button tests which can be flaky

6ab4b63

joperezr requested review from krwq, raffaeler and pgrawehr July 13, 2023 17:42

ghost added the area-device-bindings Device Bindings for audio, sensor, motor, and display hardware that can used with System.Device.Gpio label Jul 13, 2023

joperezr closed this Aug 16, 2023

github-actions bot locked and limited conversation to collaborators Dec 8, 2023

joperezr deleted the joperezr/RetryButtonTests branch June 13, 2024 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Retries to Button tests which can be flaky #2102

Add Retries to Button tests which can be flaky #2102

joperezr commented Jul 13, 2023 •

edited by ghost

Loading

raffaeler commented Jul 13, 2023

krwq commented Jul 13, 2023 •

edited

Loading

krwq commented Jul 14, 2023

joperezr commented Jul 14, 2023

raffaeler commented Jul 14, 2023

krwq commented Jul 14, 2023

krwq commented Jul 14, 2023 •

edited

Loading

krwq commented Jul 14, 2023

krwq commented Jul 14, 2023 •

edited

Loading

raffaeler commented Jul 16, 2023

pgrawehr commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 17, 2023

krwq commented Jul 17, 2023 •

edited

Loading

joperezr commented Jul 17, 2023

raffaeler commented Jul 17, 2023

pgrawehr commented Jul 18, 2023

krwq commented Jul 18, 2023

joperezr commented Jul 27, 2023

Add Retries to Button tests which can be flaky #2102

Add Retries to Button tests which can be flaky #2102

Conversation

joperezr commented Jul 13, 2023 • edited by ghost Loading

Microsoft Reviewers: Open in CodeFlow

raffaeler commented Jul 13, 2023

krwq commented Jul 13, 2023 • edited Loading

krwq commented Jul 14, 2023

joperezr commented Jul 14, 2023

raffaeler commented Jul 14, 2023

krwq commented Jul 14, 2023

krwq commented Jul 14, 2023 • edited Loading

krwq commented Jul 14, 2023

krwq commented Jul 14, 2023 • edited Loading

raffaeler commented Jul 16, 2023

pgrawehr commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 16, 2023

raffaeler commented Jul 17, 2023

krwq commented Jul 17, 2023 • edited Loading

joperezr commented Jul 17, 2023

raffaeler commented Jul 17, 2023

pgrawehr commented Jul 18, 2023

krwq commented Jul 18, 2023

joperezr commented Jul 27, 2023

joperezr commented Jul 13, 2023 •

edited by ghost

Loading

krwq commented Jul 13, 2023 •

edited

Loading

krwq commented Jul 14, 2023 •

edited

Loading

krwq commented Jul 14, 2023 •

edited

Loading

krwq commented Jul 17, 2023 •

edited

Loading