-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reliability issue on ARM64 Stage1 #86929
Comments
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsStage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations: Similarly, socket errors: Cc @VSadov
|
We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time) |
I see at least 3 data points - 1 for blended mode and 2 for speedopt. For comparison, we never saw this outside of NativeAOT. |
This is correlated with the Bad Responses + Socket error chart below. We're getting low RPS because there's a problem with the response. I'm not saying it's a codegen issue - just that it seems to be exacerbated with speedopts. |
I see, I only wanted to note that we usually give CI more time to produce >1 data points before we start investigations - the TE benchmarks are too volatile (+ rare infra failures) to be able to only compare two data points (atlhough, the same is true for dotnet/performance microbenchmarks)🙂 But it seems that with the current velocity we need to wait a few weeks for that (or run locally to validate). |
If I try running this benchmark (via crank), every few times there are socket errors and sometimes the app just crashes (reported as "Connection refused"). |
In the first chart - the first notch for regular Stage1AOT was when every platform had issues. Since then there is only one single point when regular run had issues. I guess it only tells us it is unlikely to be in the native runtime (including GC), since that is unaffected by I'll try running libraries tests with |
@VSadov can you please share the exact crank query? I want to validate it's not caused by recent GDV changes. |
I have run the libraries tests a few times with I guess we need to run the actual test on the actual machine to get the repro. Or obtain a crashdump from a lab run. |
I have sent instructions. If GDV is your worry, it would be interesting to try running that with GDV disabled via some build setting, if that is possible. It could help to rule out if this is GDV specific. I've already tried things like server/workstation GC, disabling concurrent GC or using conservative stackwalking. None of that had any difference. It is likely something with managed code or with how it is compiled. |
If I do not pass |
Looking at the history of the benchmarks, I think the standard results are fine. The question is |
Haven't seen any failures since at least October. |
Stage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations:
Similarly, socket errors:
Cc @VSadov
The text was updated successfully, but these errors were encountered: