-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails #60017
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsRunfo Creating Tracking Issue (data being generated)
|
@danmoseley This appears to be flaky. Could we get retry enabled for this test? |
@karelz ? (BTW, what granularity do we enable retries? I had assumed it was eg "all networking tests") |
@antonfirsov just noting they're not all Windows 11. |
Ah yeah missed that there are a bunch of Win-7 failures too. Interestingly Win7 only started to fail after 10-06. @MattGal are you aware of any OS updates or any other changes to the Win7 queues around that date? |
Yes, we did change the image that day. Also, you can also tag @dotnet/dnceng for greater visibility if needed such as when I am out of the office. On that day we moved from using Server 2008 R2 images from the Azure Gallery to "homemade" Win7-SP1-Enterprise "N" SKU images, because the 2008 ones had been removed from the Azure gallery, which blocks us from regenerating them. We will likely need to change one more time if and when we find the "long term, private support" version of Windows 7 images, but this change was unfortunately not one made by choice.
|
@antonfirsov should retries be helping here -- or perhaps it's failing on the retry? |
I will say I'd rather not change stuff without an underlying theory of what we're changing making sense, but there is a slightly different Windows 7 client (non-N) SKU now available in the Azure Gallery (initial attempts to use it failed but this was invariant on the image we're now using, so it may be possible to try). Ideally I'd want to have confirmed using experimental VMs that this solves the problem first; that is, it needs to be expressable as "this OS needs change because..." |
I would not cry if we disable this on Win7. I don't see big risk or value. I think we should debug on Win11 and understand what is going on. |
I believe there is some weird threading issue that only repros on the following queues:
If #63702 works out, we will have a proof for this theory, and also a method to get clean runs on servicing branches, so I recommend to merge that PR ASAP if the SendPackets tests succeed after a couple of CI re-runs. @dotnet/dnceng is there anything common in these Win7/Win11 queues? (Some hw characteristics like number/type of CPU-s maybe?) |
They're about as far apart code-wise as two Windowses can be. They do, however, all run on the same 2-core AMD EPIC Da_v4-series Azure VMs. You can use the repro machines DTL to get one yourself, since you are unlikely to have an EPYC sitting on your desk; some (rare) native-code cases, especially where SIMD or threading optimizations might occur, have noted slight variance between AMD and Intel. |
This is what I suspected, would prove the threading issue idea. Are there other queues that run on AMD EPIC Da_v4 ? (And is there some easy way to list them ..?) |
Oops I was unclear. EVERY Helix test queue uses D2a_v4 currently, with a few rare exceptions like for Android emulators (which can't run there) or ones that use D4as. So, while there could be an AMD EPYC threading difference, it would need to also not reproduce for Windows 8.1 (really Server 2k12 R2) nor Windows 10 on identical hardware. |
Wild brainstorming -- perhaps there's a test that only runs on Win 10 (doesn't apply to Win7 and disabled for Win 11 for whatever reason) that when it runs somehow prevents the test failing? |
I can reproduce easily on my 2 core Windows11 VM when running whole set in loop.
perhaps timeouts are too aggressive...? |
@wfurt I wouldn't try to address this with timeouts, if I remember my previous investigation correctly it did not help. Will come back to it soon and investigate further. Since we need a quick solution now: If you still have that repro config at your hands, can you do me a favor and check if the changes in #63702 help? If yes, can I get an approval on that PR, so we get clean CI? |
Runfo Tracking Issue: System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails
Build Result Summary
The text was updated successfully, but these errors were encountered: