-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure JIT/Methodical/xxobj/sizeof/sizeof64_Target_64Bit_and_arm_il_r/sizeof64_Target_64Bit_and_arm.cmd #81109
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsFailed in run: runtime-coreclr outerloop 20230123.1 Failed tests:
Error message:
Stack trace:
|
@markples, it is blocking outerloop. PTAL with a high priority. |
@trylek, PTAL. We are seeing the same error message from different test cases. #11360 and #81118.
|
A meta-question about the testing here - when I look at the devops job - https://dev.azure.com/dnceng-public/public/_build/results?buildId=144935&view=logs&j=454a72be-7065-5174-fb75-00cb418aebaf&t=46d7521e-696b-5674-5766-9cffa0247852 - the log file for this test isn't listed. The test is there in the devops test results integration, and if I go to https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata and manually find the logs I can see the failure there. Is there a problem with how test merging reports the overall job status back to devops? Separately, and more directly relevant: do we have definitive instructions on how to repro a crossgen failure like this, especially one with "R2R-CG2 windows arm Checked @ Windows.11.Arm64.Open" that mentions both arm and arm64? I have a bunch of things to try, but I don't know how long it will take until one works and I have a repro to work with. |
Moving @AaronRobinsonMSFT ‘s comment to the right issue: #81120 (comment) |
@markples - reproducing Crossgen2 failures locally should typically amount to building stuff for arm and setting the environment variable |
@trylek, As Markples mentioned, Send tests to Helix does not show the log file for this failure:
|
@JulieLeeMSFT What is the connection to #11360? One of the same tests failed, but I don't see any mention of a failure mode like this one in it. |
Error message is the same. #11360 (comment) |
Thanks @trylek. This is basically the point of my comment except that I haven't become confident that I'm reproing the correct way outside of the default testing modes. For this I tried:
None of these reproed it. However, I'm a bit more confident that my steps would have been sufficient but that this isn't going to repro but will add a separate response here for that. (However, if something above just seems wrong, I'd be happy to hear corrections for next time!) |
I don't understand. The discussion in 11360 is about a registry key. I don't see the Internal CLR Error / AllocateNewArray / StringBuilder or the AllocException / _resolveToken pattern anywhere in there. |
I do not fully understand how these are hooked up, but the Tests tab that you linked is operating at the individual test level, whereas things like job failure are a single overall pass/fail. If we look at the helix log, we see the text that was eventually copied to where you saw it. It might be as simple as text scanning of these logs as the "running test" "passed test" "failed test" patterns are fairly simple, or it might be integration with the test harness. Regardless, we can see the script then goes wrong when it prints |
I don't think we're going to get a repro for this. Looking at the outerloop failures, it passed at 5bd322b, failed at the very next commit a272954, and then passed several commits later at 9d44b9b. It also failed in the specific runtime-coreclr r2r pipeline, again just once and at the same commit. However, commit a272954 looks pretty safe. Timestamps for job starts for the above: Of course, it's not surprising that two failures in rolling jobs for the same commit happened at roughly the same time, but this feels like something happened at that time (problem in a repo dependency seems unlikely, maybe a hardware problem). I created a fake PR back at that point and launched the r2r job. If this passes, I think we should just close this and see if it happens again. If it fails, then we try harder to repro at precisely that commit. |
@JulieLeeMSFT - For the problem regarding "Send to Helix" not reporting the merged JIT/Methodical work item as failed, I think this is likely a bug in the generated merged test wrapper, it probably returns 0 while it should return an exit code if any of the component tests have failed, adding @jkoritzinsky to confirm. For the actual bug in the sizeof64_Target_64Bit_and_arm.cmd, I'm trying to repro that locally. At the first glance it looks like an OOM or invalid allocation size. As you can easily imagine, running the component tests in-proc means that in exceptional cases the failure may only be reproducible when running the entire merged test set, not just a single test, as it may be caused by some interaction between the individual test (e.g. excessive GC allocation or not shutting down a worker thread in an earlier test). |
@trylek - There are a few more things worth noting:
|
Oh, you're right, architecture-conditional tests now need to be marked as external as the merged test wrapper generator doesn't have sufficient logic to deal with the variations; on top of that, it turns out that some of the conditional tests crash JIT when getting built on a different architecture (even though the test would have been skipped at runtime). So if the tests runs out-of-proc I'm inclined to concur with your assessment that the problem is somewhere else; in particular, on arm64 machines it sometimes happen that a particular machine gets in a bad state so we should check the names of the offending machines that typically appear at the top of the Helix logs, that's what helped us several times in the past to identify faulty machines and ask the dnceng team to take them off rotation, reimage them or something else. |
Here's a list of a few recent unexplained arm/arm64 failures and machines: 4 machine IDs, same OS: coreclr windows arm Checked @ Windows.11.Arm64.Open
Console log: 'JIT.Regression.JitBlue' from job 0c80c0ec-a8da-4207-a792-72d84339586a workitem 9683ca79-3792-47ce-9a71-d8d72c6c7877 (windows.11.arm64.open) executed on machine a000XJH running Windows-10-10.0.22621-SP0 R2R-CG2 windows arm64 Checked @ Windows.11.Arm64.Open
Console log: 'PayloadGroup0' from job 9fd227ea-dc82-44d9-a46c-dfc2c402d758 workitem 8c216dcd-e1a1-480b-bcc8-6d394a92f9b3 (windows.11.arm64.open) executed on machine a000UAW running Windows-10-10.0.22621-SP0 R2R-CG2 windows arm Checked @ Windows.11.Arm64.Open
runtime-coreclr r2r Console log: 'Methodical_r2' from job 14fe2e89-fac7-4f18-ac98-fe2998020e90 workitem e6185482-4265-4f91-a175-601d4d641aa7 (windows.11.arm64.open) executed on machine a000XZF running Windows-10-10.0.22621-SP0 runtime-coreclr outerloop Console log: 'Methodical_r2' from job 120f293f-6f83-4e15-b8ce-53a1e9c364e4 workitem 84ad56c-22de-4bfd-80c2-24a78e759090 (windows.11.arm64.open) executed on machine a000YEX running Windows-10-10.0.22621-SP0 |
Hmm, that doesn't look like machine-specific. |
oops - sorry - same OS (I misread the workitem OS and "executed on" OS distinction when I saw two) - correcting above |
Looks like another case: Interop\PInvoke\Delegate\DelegateTest\DelegateTest.cmd R2R-CG2 windows arm64 Checked no_tiered_compilation @ Windows.11.Arm64.Open
|
Another: Interop\UnmanagedCallConv\UnmanagedCallConvTest\UnmanagedCallConvTest.cmd R2R-CG2 windows arm64 Checked no_tiered_compilation @ Windows.11.Arm64.Open |
Ping @jkoritzinsky.
|
We specifically have the merged wrapper succeed even if a test fails because we don't want the "Work Item Failure" entry for a test failure, only when the job crashed or didn't run to completion. This is the expected behavior for Helix. |
I think this is change in behavior, isn't it? When I have the test monitor role, I look at the run list for a pipeline, and if recent instances have failed, I open them and look at the stages/jobs. Should I now be looking at the "tests" tab for specific tests and the job list for things like job crashes? |
The "tests" tab will have both crashes and test failures. Individual test failures will have the name of the test, and crashes that take down the whole work item will have a "WorkItemName Work Item Failure" item in addition to any recorded test failures. I've been using this workflow for years for both the runtime and libraries test trees. |
@jkoritzinsky - I think that Julie's concern I responded to stemmed from the fact that for non-merged tests you see this bit around "Helix work item failure blah blah blah and here is the log" in the "Send to Helix" phase of the runtime test run jobs; apparently we're not seeing the equivalent thing for the merged tests, if it's not about the exit code, what is causing the difference? |
Another case: baseservices\invalid_operations\InvalidOperations\InvalidOperations.cmd R2R-CG2 windows arm Checked @ Windows.11.Arm64.Open |
Several new occurrences, e.g.: CoreMangLib\system\enum\EnumIConvertibleToUint16\EnumIConvertibleToUint16.cmd R2R-CG2 windows arm64 Checked jitstressregs0x10 @ Windows.11.Arm64.Open
|
@trylek another 6 occurences as listed in #83407. One of them:: runtime-coreclr outerloop 20230312.3 Failed test:
|
@jkotas commented on this yesterday in the issue thread Apparently the problem is understood and caused by a race condition that has been fixed since but we need to roll forward to SDK Preview 2 as the LKG version used for executing Crossgen2 to get this fixed completely. As the cause of the crash is now understood, I'll put up a PR fixing the primary cause of the exception - missing reference to |
@JulieLeeMSFT - my change made Crossgen2 throw much fewer exceptions so we no longer see the occasional exception handling failures on arm64. According to my understanding of @jkotas' explanation the underlying problem is a race condition in exception handling that should be fixed by rolling forward to SDK preview 2 as the LKG version used by the runtime repo. In my runtime repo clone from earlier today, dotnet --version still yields 8.0.100-preview.1.23115.2 so I suspect more work may be needed to fully fix this. |
Here's a direct link to dotnet/runtime main global.json to easily check the sdk version: |
@trylek - we appear to be on preview.3 now. In your comment "much fewer exceptions", did you mean that we're down to the expected number in the tests, or that there are still ones in the same flavor as the ones that you fixed? (In other words, did "more work" only mean waiting for preview.3 to appear so that we can close this now, or is there something else?) Thanks! |
@markples - I believe this can be closed now. My original mitigation just reduced the number of exceptions internally thrown and caught during Crossgen2 compilation and so reduced the repro rate of this non-deterministic race condition. As we're now at Preview 3 containing the proper fix for the race condition according to JanK's explanation, we should be good now. |
Failed in run: runtime-coreclr outerloop 20230123.1
Failed tests:
Error message:
Stack trace:
The text was updated successfully, but these errors were encountered: