Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retries to Button tests which can be flaky #2102

Closed
wants to merge 1 commit into from

Conversation

joperezr
Copy link
Member

@joperezr joperezr commented Jul 13, 2023

We've seen that some of our Button tests can be flaky some times, so adding the RetryHelper we use for hardware tests should help reducing that noise. For review, it is easier to make sure you check the option of ignoring spaces which would show the real diff which is only adding RetryHelper to wrap the methods.

Microsoft Reviewers: Open in CodeFlow

@joperezr joperezr requested review from krwq, raffaeler and pgrawehr July 13, 2023 17:42
@ghost ghost added the area-device-bindings Device Bindings for audio, sensor, motor, and display hardware that can used with System.Device.Gpio label Jul 13, 2023
@raffaeler
Copy link
Contributor

Got it. Could you please post here the failure? Even if it is random, it could be useful to adopt a different strategy.
When I added the most recent tests for the button, I simply used the same strategy to minimize failures.

@krwq
Copy link
Member

krwq commented Jul 13, 2023

do we know what driver is it using? there should really be no flakiness there unless it's SysFs (there is 100ms delay there which should be good enough for effectively anything). IMO there is little value in retrying this vs removing

@krwq
Copy link
Member

krwq commented Jul 14, 2023

I did some quick skim through the code. There is absolutely no reason why this test would fail - it doesn't use GPIO at all.

IMO this looks like product bug. If we're not planning to fix this right now I'd disable test project for now and open issue against this.

@joperezr
Copy link
Member Author

I'm fine with that too.

@raffaeler
Copy link
Contributor

I did some quick skim through the code. There is absolutely no reason why this test would fail - it doesn't use GPIO at all.

IMO this looks like product bug. If we're not planning to fix this right now I'd disable test project for now and open issue against this.

This is why I need to know which are the failing tests and if the behavior is randomic.
Since I was the one who made the latest changes, I can take care of this, but please tell me what is happening.

@krwq
Copy link
Member

krwq commented Jul 14, 2023

I'm running them in a loop as we speak, you might wanna try the same - possibly slower machine is needed

@krwq
Copy link
Member

krwq commented Jul 14, 2023

run-until-failure D:\src\iot\.dotnet\dotnet.exe test --no-build (run dotnet test once before this)

run-until-failure.bat:

@echo off
:repeat
%*
IF %errorlevel% == 0 GOTO :repeat

@krwq
Copy link
Member

krwq commented Jul 14, 2023

Did we see this failure on macOS only? perhaps ticks are different there and our defaults are wrong? (just brainstorming without reading code)

@krwq
Copy link
Member

krwq commented Jul 14, 2023

I've run this in a loop for a while (probably around 2h) on Windows without success. At one point I even added regular for loop on each test to make it more likely for something to show up but no success 😞

@raffaeler
Copy link
Contributor

I left the batch running for more than 1 day on Windows without seeing any failure.

Then I tried to run dotnet test on a RPi with .NET 7 installed and I see this failing when looking for .NET 3.1.0.
Not sure if the error depends on the environment or what else.
@krwq suggestions?

pi@pibuster32:~/iot/src/devices/Button/tests $ dotnet test
  Determining projects to restore...
  All projects are up-to-date for restore.
  System.Device.Gpio -> /home/pi/iot/artifacts/bin/System.Device.Gpio/Debug/netstandard2.0/System.Device.Gpio.dll
  System.Device.Model -> /home/pi/iot/src/devices/System.Device.Model/bin/Debug/netstandard2.0/System.Device.Model.dll
  Common -> /home/pi/iot/src/devices/Common/bin/Debug/netcoreapp3.1/Common.dll
  Button -> /home/pi/iot/src/devices/Button/bin/Debug/netcoreapp3.1/Button.dll
  Button.tests -> /home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll
Test run for /home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll (.NETCoreApp,Version=v3.1)
Microsoft (R) Test Execution Command Line Tool Version 17.6.3 (arm)
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
Testhost process for source(s) '/home/pi/iot/src/devices/Button/tests/bin/Debug/netcoreapp3.1/Button.tests.dll' exited with error: You must install or update .NET to run this application.
App: /home/pi/.nuget/packages/microsoft.testplatform.testhost/16.9.4/lib/netcoreapp2.1/testhost.dll
Architecture: arm
Framework: 'Microsoft.NETCore.App', version '3.1.0' (arm)
.NET location: /home/pi/dotnet/
The following frameworks were found:
  7.0.9 at [/home/pi/dotnet/shared/Microsoft.NETCore.App]
Learn about framework resolution:
https://aka.ms/dotnet/app-launch-failed
To install missing framework, download:
https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=3.1.0&arch=arm&rid=raspbian.10-arm
. Please check the diagnostic logs for more information.

Test Run Aborted.
pi@pibuster32:~/iot/src/devices/Button/tests $

@pgrawehr
Copy link
Contributor

@raffaeler Are you sure you're running the latest master version? Button.tests.dll has <TargetFramework>$(DefaultTestTfms)</TargetFramework>, and $(DefaultTestTfms) is defined as <DefaultTestTfms>net6.0</DefaultTestTfms> in src\devices\Directory.build.props, So I see no reason why it should be compiling against .NET core 3.1.

There was however an issue with an incorrect runtime selection by the test runner, caused by a bug in the build infrastructure. I'm not sure that is relevant here, though.

@raffaeler
Copy link
Contributor

There was however an issue with an incorrect runtime selection by the test runner, caused by a bug in the build infrastructure. I'm not sure that is relevant here, though.

I saw you removed 3.1 from the very latest, but I assumed it should work with the version specified by global.json and this is not the case.

Now I did git pull and get the same with error, but with .NET 6:

Testhost process for source(s) '/home/pi/iot/src/devices/Button/tests/bin/Debug/net6.0/Button.tests.dll' exited with error: You must install or update .NET to run this application.
App: /home/pi/iot/src/devices/Button/tests/bin/Debug/net6.0/testhost.dll
Architecture: arm
Framework: 'Microsoft.NETCore.App', version '6.0.0' (arm)
.NET location: /home/pi/dotnet/
The following frameworks were found:
  7.0.9 at [/home/pi/dotnet/shared/Microsoft.NETCore.App]
Learn about framework resolution:
https://aka.ms/dotnet/app-launch-failed
To install missing framework, download:
https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=6.0.0&arch=arm&rid=raspbian.10-arm
. Please check the diagnostic logs for more information.

Test Run Aborted.

@raffaeler
Copy link
Contributor

Now I modified the global.json to run on .NET 6 and pointed my global installation to .NET 6 as well.
The dotnet test now is successful. I will try to re-run indefinitely to see if it breaks over time

BTW I don't like this infrastructure, it's weird.

@raffaeler
Copy link
Contributor

I will leave the tests running on the RPi all night long, but they are currently running for more than 4 hours and never failed.
I still can't get what was going wrong.
@joperezr can you give me some hint from the past logs?

@raffaeler
Copy link
Contributor

The tests run for more than 2000 times on the Rpi without any failure.
Now I am stopping them and wait for some more info.

@krwq
Copy link
Member

krwq commented Jul 17, 2023

perhaps let's just disable them on macOS if this is macOS specific? I don't like doing this but I don't see a value of tests running in CI if we can't even get a log...

@joperezr
Copy link
Member Author

Another option is to add more telemetry or logs so that we can understand what is going on and investigate. One way of doing that would be to have the test log some output to a file, and then to have a target that runs after running the tests which checks for this file and if it exists it prints it out. This would guarantee that we would get those logs even if the thread running the tests didn't have the console.

@raffaeler
Copy link
Contributor

@joperezr If you mean logging the test output, isn't this by design in dotnet test?
I expect to see a failed assert or an exception, right?
And this should be in the pipeline logs.

If you mean something else, please explain how do you believe we/I should modify the tests.
TIA

@pgrawehr
Copy link
Contributor

@raffaeler The problem is a bug in the test runner. For some reason (Jose knows more) - when the tests run on multiple CPU cores simultaneously - the output of only one CPU ends in the log, the other parts are lost somewhere. The probem would be mitigated if we found a way around this.

Much of the build scripting is hidden in the arcade scripts though, and I think we don't even have direct access to that, or it would need to be fixed there.

@krwq
Copy link
Member

krwq commented Jul 18, 2023

Most likely this bug: microsoft/vstest#2224 (please upvote)

@joperezr
Copy link
Member Author

[Triage] We talked about this during triage and decided that we don't want to add retries as we want to ideally be able to reproduce the failure for now in order to be able to fix the test, so I'm going to go ahead and close this for now. Hopefully microsoft/vstest#2702 can get merged soon which would help us knowing which test is the culprit.

@joperezr joperezr closed this Aug 16, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
@joperezr joperezr deleted the joperezr/RetryButtonTests branch June 13, 2024 19:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-device-bindings Device Bindings for audio, sensor, motor, and display hardware that can used with System.Device.Gpio
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants