Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect network connectivity issue in Android devices #8263

Merged
merged 12 commits into from
Jan 13, 2022

Conversation

premun
Copy link
Member

@premun premun commented Dec 15, 2021

Context

We often see Android devices losing their WiFi connection and going offline. In this case, we are still able to install apps via the USB cable and run most of the tests.

Some of the test apps require internet though (such as System.Net.*.Tests) and no connectivity usually means the tests fail. From infrastructure point of view, we have no idea that there was this problem though.

This change

.. performs a network check from the device in case of any Android device failure and retries the work item on a different machine in case the WiFi connection is down.

We also send telemetry event so that we learn about the device without connectivity.

More details in #8232

@premun
Copy link
Member Author

premun commented Dec 15, 2021

@MattGal @akoeplinger question - do you think it would be better for the work item to request this check in advance? So that we would only do it for selected apps?

I am a bit afraid of the day when we the WiFi router in the Android room goes down again and we start retrying like crazy..

@MattGal
Copy link
Member

MattGal commented Dec 15, 2021

@MattGal @akoeplinger question - do you think it would be better for the work item to request this check in advance? So that we would only do it for selected apps?

I am a bit afraid of the day when we the WiFi router in the Android room goes down again and we start retrying like crazy..

This is a good point, also if we hit a certain scale of work items running on the same subnet long enough it may trigger throttling from Google's DNS server.

It would really be best if this could be a flag on xharness (i.e. '--network-precheck ' causing this), say caused by a property in the projects that need it... but that would probably require a bunch more work. I defer to @akoeplinger on that as the SME for how the runtime repo puts together the apps but I'll look around at how it's done today to get an idea myself too.

Copy link
Member

@MattGal MattGal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any systemic problems with the network on these devices (usually a one off) but I wanted to share my notes looking at your PR. Also, I'm unqualified to comment on whether this DNS server is a good ping test, that may be worth following up with someone with long-term experience in the network testing space.

Either way, if this router you mention does die, we do want to react to it ASAP and the fallout from this change would certainly cause that to happen.

From looking at the sources, there is some easy means to plumb this through.

When globbing here: https://github.com/dotnet/runtime/blob/7f09104621614e8b63e49ce1387a872d84136e83/src/libraries/sendtohelixhelp.proj#L346-L357, it would be feasible to do a low-tech ("APK name contains System.Net") or fancier means of adding additional metadata that could be understood here:
https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.Helix/Sdk/CreateXHarnessAndroidWorkItems.cs#L17-L24, then if this metadata is present make this behavior conditional.

@premun
Copy link
Member Author

premun commented Dec 16, 2021

When globbing here: dotnet/runtime@7f09104/src/libraries/sendtohelixhelp.proj#L346-L357, it would be feasible to do a low-tech ("APK name contains System.Net") or fancier means of adding additional metadata

@MattGal yes, I had exactly this in mind - flag some XHarness MSBuild work items with "needs device network access".

I haven't seen any systemic problems with the network on these devices (usually a one off) but I wanted to share my notes looking at your PR.

There was a case last month when about 60 of them went down when a router failed. But it happened once, could be a very rare thing, yeah. My idea was that the telemetry we emit from here would cause Grafana to open issues for us and we could get on top of things fast. We need to do this regardless. Would that be enough of a remediation?

Also, I'm unqualified to comment on whether this DNS server is a good ping test, that may be worth following up with someone with long-term experience in the network testing space.

I will ask the networking team here in Prague what they think about this. Or we can then bring it up with DDFUN.

@premun
Copy link
Member Author

premun commented Jan 5, 2022

@MattGal I spoke to the networking team here and they said they ping www.microsoft.com in some of their tests so I switched to that from 8.8.8.8.

We will discuss next Wed how to approach the "when to verify connectivity" problem, such as the client specifying it's required for that given apk.

In the meantime, I need to start using some of this code in a separate change so I commented out the check to not run and would like to check this in. Can you have one more look please?

@premun
Copy link
Member Author

premun commented Jan 12, 2022

@MattGal I have fed the python script different kinds of fake data to mimic XHarness failures and got following logs (it actually hit a machine with leftover apps):

Analyzing android/test@arm64-v8a (78)
    Encountered PACKAGE_INSTALLATION_FAILURE. This is typically not a failure of the work item. We will try it again on another Helix agent
    If this occurs repeatedly, please check for architectural mismatch, e.g. requesting installation on arm64_v8a-only queue for x86 or x86_64 APKs.
    Removing installed apps after unsuccessful run
        Removing net.dot.JIT_Intrinsics
Success
        Removing net.dot.JIT_Regression
Success
        Removing net.dot.JIT_opt
Success
        Removing net.dot.baseservices_threading
Success
        Removing net.dot.JIT_Math
Success
        Removing net.dot.JIT_SIMD
Success
        Removing net.dot.JIT_Methodical
Success
        Removing net.dot.JIT_IL_Conformance
Success
        Removing net.dot.baseservices_mono
Success
        Removing net.dot.Loader_classloader
Success
    Encountered non-zero exit code. Checking network connectivity...
PING e13678.dscb.akamaiedge.net (104.71.213.161) 56(84) bytes of data.
64 bytes from a104-71-213-161.deploy.static.akamaitechnologies.com (104.71.213.161): icmp_seq=1 ttl=52 time=9.37 ms
64 bytes from a104-71-213-161.deploy.static.akamaitechnologies.com (104.71.213.161): icmp_seq=2 ttl=52 time=10.0 ms
64 bytes from a104-71-213-161.deploy.static.akamaitechnologies.com (104.71.213.161): icmp_seq=3 ttl=52 time=10.2 ms

--- e13678.dscb.akamaiedge.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 401ms
rtt min/avg/max/mdev = 9.377/9.869/10.211/0.374 ms
Requesting work item retry because an infrastructure issue was detected on this machine

I leave the code commented out as we will decide on a meeting today how we want to trigger it. Regardless, I created a TODO and will also depend on this change in an unrelated issue so I would like to merge it as-is.

@premun premun merged commit 7a7ba50 into dotnet:main Jan 13, 2022
@premun premun deleted the prvysoky/android-connectivity branch January 13, 2022 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants