-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect network connectivity issue in Android devices #8263
Conversation
@MattGal @akoeplinger question - do you think it would be better for the work item to request this check in advance? So that we would only do it for selected apps? I am a bit afraid of the day when we the WiFi router in the Android room goes down again and we start retrying like crazy.. |
This is a good point, also if we hit a certain scale of work items running on the same subnet long enough it may trigger throttling from Google's DNS server. It would really be best if this could be a flag on xharness (i.e. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any systemic problems with the network on these devices (usually a one off) but I wanted to share my notes looking at your PR. Also, I'm unqualified to comment on whether this DNS server is a good ping test, that may be worth following up with someone with long-term experience in the network testing space.
Either way, if this router you mention does die, we do want to react to it ASAP and the fallout from this change would certainly cause that to happen.
From looking at the sources, there is some easy means to plumb this through.
When globbing here: https://github.com/dotnet/runtime/blob/7f09104621614e8b63e49ce1387a872d84136e83/src/libraries/sendtohelixhelp.proj#L346-L357, it would be feasible to do a low-tech ("APK name contains System.Net") or fancier means of adding additional metadata that could be understood here:
https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.Helix/Sdk/CreateXHarnessAndroidWorkItems.cs#L17-L24, then if this metadata is present make this behavior conditional.
src/Microsoft.DotNet.Helix/Sdk/tools/xharness-runner/xharness-event-processor.py
Outdated
Show resolved
Hide resolved
src/Microsoft.DotNet.Helix/Sdk/tools/xharness-runner/xharness-event-processor.py
Show resolved
Hide resolved
src/Microsoft.DotNet.Helix/Sdk/tools/xharness-runner/xharness-event-processor.py
Outdated
Show resolved
Hide resolved
@MattGal yes, I had exactly this in mind - flag some XHarness MSBuild work items with "needs device network access".
There was a case last month when about 60 of them went down when a router failed. But it happened once, could be a very rare thing, yeah. My idea was that the telemetry we emit from here would cause Grafana to open issues for us and we could get on top of things fast. We need to do this regardless. Would that be enough of a remediation?
I will ask the networking team here in Prague what they think about this. Or we can then bring it up with DDFUN. |
@MattGal I spoke to the networking team here and they said they ping We will discuss next Wed how to approach the "when to verify connectivity" problem, such as the client specifying it's required for that given apk. In the meantime, I need to start using some of this code in a separate change so I commented out the check to not run and would like to check this in. Can you have one more look please? |
@MattGal I have fed the python script different kinds of fake data to mimic XHarness failures and got following logs (it actually hit a machine with leftover apps):
I leave the code commented out as we will decide on a meeting today how we want to trigger it. Regardless, I created a TODO and will also depend on this change in an unrelated issue so I would like to merge it as-is. |
Context
We often see Android devices losing their WiFi connection and going offline. In this case, we are still able to install apps via the USB cable and run most of the tests.
Some of the test apps require internet though (such as
System.Net.*.Tests
) and no connectivity usually means the tests fail. From infrastructure point of view, we have no idea that there was this problem though.This change
.. performs a network check from the device in case of any Android device failure and retries the work item on a different machine in case the WiFi connection is down.
We also send telemetry event so that we learn about the device without connectivity.
More details in #8232