-
Notifications
You must be signed in to change notification settings - Fork 510
ARM64 and cross compile support #8236
Comments
Awesome, thank you! I'll defer to @jkotas for final word, but my take:
|
I am still in the process of setting up my new development PC so I might remember some things wrong. IIRC the RyuJIT changes were primary the implementation of the ARM64 unwinding information that the GC needs for the stack walk. They were just missing in the version that was used at that time. I needed to build a new ObjectWriter as the default one was not able to write ARM64 for ELF code at all. Beside of this I compiled a version that could write all the different binary formats for all the CPUs I need. I have some platform independent changes for the coreRT runtime itself, too. Mostly the implementation of missing ARM64 assembler functions and getting the stackwalk working. Unfortunately most of the other runtime changes are platform specific and therefore I can not make them public. We might be able to talk about the XBox One parts if Microsoft is interested to give other XBox developers access to it. The cross compiler support was done very simple. Just using different RyuJit dlls depending on the target settings and using an own project file for each platform that contains the links to the platforms specific assemblies. Than I just used an own solution for every platform to link anything together in an executable. Overall in the end I didn't write that much code. It was mainly the process of finding the reasons why things were crashing or even worse just not working at all. I expect that I think I will still need some days to reintegrate everything to the current versions of CoreRT and the runtime and clean it up. |
As @MichalStrehovsky said. Any changes for the ARM64 port would be highly appreciated!
This can wait for later if there is more interest. We would need to have setup like what Mono has to deal with the licensing: https://www.mono-project.com/docs/about-mono/supported-platforms/xbox-one/ or https://www.mono-project.com/docs/about-mono/supported-platforms/playstation4/ |
FYI: I started to commit the ARM64 parts to https://github.com/RalfKornmannEnvision/corert/tree/ARM64 |
Seems like I cannot get it working anymore. I added the missing generation of CFI unwind data to RyuJit and that worked well. But many methods fail to compile with an error during register allocation. This might be caused by using a RyuJit that is compiled for X64 but should generate ARM64 code. I understand that this is not a supported scenario but it worked fine the last time. Or maybe because I am using the newest RyuJit version from GitHub while CoreRT needs an older version. Last time I needed to go back but I got an error message that told me that the newset one has an incompatible interface. This time at least the interfaces seems not to have changed. |
It doesn't look like there has been a breaking change in the runtime repo recently. The tip of I submitted #8242 to switch to the latest RyuJIT - let's see what the CI says to rule out a more general problem. |
Using a compiled regular clrjit from the same runtime version works fine. I will try to build a Windows ARM64 version instead of the Unix-ARM64 one I was using so far. In the end I need the Unix version but at least I can check if it will show the same errors. |
The CI on the pull request to update RyuJIT is pretty red so this looks like a more general problem. I'll build a debug version of RyuJIT and see if I can spot the problem. Try rewinding the runtime repo to commit 98b6284e020845c04bc7b1cefdcd01ffe7a4a8c0 (this is the RyuJIT that is being used in CoreRT right now). |
Thank's for the hint I will give it a try |
The problem in the CI is likely a different problem. The JIT interface GUID didn't change, but runtime still changed how hardware intrinsics are reported and the tests that throw exceptions end up getting an infinite recursion because an intrinsic wasn't expanded by RyuJIT. It's likely unrelated to the register allocation problem you're seeing. |
Well going back to the older version had at least some effect. There are now much less methods that fail to compile and it's an different assert. To be more precise it's now only one assert while there multiple once before. This issue seems to be somewhat related to System.Numerics.Vector`1 as all methods that are falling to compile have at least one argument of this type. The generic types are different. Examples: Assertion failed '!regState->rsIsFloat' in 'System.SpanHelpers:LocateFirstFoundByte(System.Numerics.Vector`1[System.Byte]):int' during 'Linear scan register alloc' (IL size 73) Assertion failed '!regState->rsIsFloat' in 'System.Numerics.Vector:BitwiseOr(System.Numerics.Vector I will try to have a closer look. But to be honest I am not sure if I can find and fix this issue. Adding the missing CFI information is one thing but this seems to be much deeper inside the whole jit process. |
Where is that assertion? Could it be this one? I see this code path is in an if/else that appears to be dealing with HFA. It's quite likely we're not computing the HFA shape of corert/src/ILCompiler.Compiler/src/Compiler/VectorOfTFieldLayoutAlgorithm.cs Lines 65 to 68 in 9d8bd29
It should probably be something along the lines of: corert/src/ILCompiler.Compiler/src/Compiler/VectorFieldLayoutAlgorithm.cs Lines 92 to 102 in 9d8bd29
|
Yes, it's this one. It's in line 178 in my version but it's the same code. Thank's for your help as I would properly never found the connection between this assert and the code part in the IL Compiler. I will have a look if I can figure out what's wrong and see if I can fix it |
I think you can just copy-paste the implementation from VectorFieldLayoutAlgorithm.cs - it should do the right thing. Fingers crossed it helps, but it really feels like it should help taking the |
Thank you very much. Now it runs without any asserts to the point were it needs to create the object writer. This was the next thing on my list were I need to reintegrate my old changes. |
Reintegrating my changes into the object writer got me a arm64 elf file that seems to be OK. But my changes include an ugly hack. As I want to commit this changes back it should be fixed properly.
For this type of relocations the object writer sets VK_PLT as kind when the object file type is ELF. This triggers later an assert when LLVM is doing some offset calculations as it expect that the kind of the symbol is VK_None.
In the case of the exception handling the symbol is in the .data section and clearly not a code symbol. So PLT should not be set in this case. I assume it had not caused an issue so far because this assert is only triggered when generating an ARM64 object file. |
I wonder if we could just use It will probably then need an adjustment here: corert/src/Native/Runtime/unix/UnixNativeCodeManager.cpp Lines 387 to 393 in 9d8bd29
That reloc already includes the 4 byte delta so I think we need to do the |
Yes, using IMAGE_REL_BASED_RELPTR32 for the Exception handling related relocation will fix the assert in the object writer. I will take a look at runtime part when I reached the point that it finally compiles and link for the Switch again. I think I am still some days away from this as there were quite some changes in the months I was assigned to do other things. But as it was working once I am still pretty confident I can get it working again. |
Looks like I run in another (little?) issue. I switched over to using the linux ARM64 runtime assemblies. While the windows ones uses dynamic linking for the needed system API calls for unix static linking is used. That's fine and the il compiler clearly creates the interop wrappers for functions like SystemNative_Dup. While emitting the code to the object writer it although creates a symbol reference to SystemNative_Dup. But it never creates a symbol definition. The issue seems to be a behavior of the ELF file format for AARCH64. As the symbol is not marked as global (function?) it's not in the generated library. It works fine for runtime functions like RhNewObject as they are defined as global symbols. I spend some time debugging the code and the only way to ensure that a symbol is global is this:
GetSymbolAlternateName just checks the NodeAliases dictionary. It's filled for all the external native functions in the core lib but not for the once in additional assemblies like System.Console. An easy solution might be to ensure all pinvokes from all references assemblies are added to the alias list, too. But I am not sure if this would be to simple and would cause that functions will be linked in the final executable even if not needed. Maybe the correct solution would be using the InteropStubManager as AddDependeciesDueToPInvoke seems to be called for every pinvoke method that is used I assume that it works on other targets as creating the symbol reference is enough there. |
Hm, for APIs like |
No, it's not a dumb question. It's more likely that I failed to describe the issue properly. I have code for all the SystemNative_* compiled in another static library. When the object writer generates the static libary file it includes a symbol table. The ARM64 ELF format requires that everything that is not part of the library itself or should be used by another library needs to be marked as global and if it is a function it needs a flag for this, too If I dump the generated static libary the symbol table looks like tis:
I needed to add some code to ensure that the function marker is set as it was already done for ARM. But the symbol table contains only functions that are reported as global by EmitSymbolDef. Therefore it has all the Rhp* functions but none of the SystemNative_* once as the il compiler only makes EmitSymbolRef calls for them. Unfortunately the LLVM part of the object writer is not smart enough to detect that there is no code for this symbol reference and convert it to a global symbol for linking. The only way around this would be to ensure that all external functions are not only reported via EmitSymbolRef but as a global symbol via EmitSymbolDef, too. This already happens for all the external functions in the CoreLib. This might be an unplanned side effect as it happens during the rooting of all the methods that should be exported even if these methods are later imported. While there is a local untyped symbol for all the SystemNative_* functions generated during EmitSymbolRef it seems to be drooped as there is nothing associated to it. Based on the decription of the ELF format such symbols should only be in the table when they are global. |
Ah, I see. I didn't realize Would something like this work?
That might do. I wonder a bit whether instead of |
RhNewObject is as far as I know written in Assembler. It's in the symbol table because there is code in the generated library that calls it. Thanks again. "target is ExternSymbol" was the hint that I needed. I think I will add a new ComitSymbolExtern as this gives me the chance to mark the symbol as external, too. Not sure if this has an impact right now but it might reduce future problems. But a quick test showed that if I just call EmitSymbolDef when I detect an ExtrnalSymbolNode the SystemNative_* functions show up in the symbol table |
It's in C# - we reference it through the RuntimeExport name though because of how the runtime library was structured in .NET Native. That's why it shows up as an alternative name. corert/src/Runtime.Base/src/System/Runtime/RuntimeExports.cs Lines 28 to 61 in 9d8bd29
Wouldn't that place a definition of the symbol at the spot where we're trying to generate a reference to it? I guess linker could theoretically stomp over it if it sees the definition in Glad you can make progress! |
You are right I confused it with RhpNewArray. It seems the real issue is somewhere else. There is some code missing in LLVM. Sorry for wasting your time |
For future references just in the case someone finds this in the future. Ignore what I have written above about the symbol table entries for the native functions. I got myself totally confused. While a native function that is called from the generated library must shown up in the Symboltable it should not as a global symbol. It must be as Undefined, Unresolved or in whatever your obj dumper will show it. In my case they are now shown as UND Another issue that I hadn't fixed in my old proof of concept (only hacked it away) was that the ARM64 relocation's not always work with just MCSymbolRefExpr as target. If there is more than one sub type for a relocation type the symbol references needs to be encapsulated in an additional AArch64MCExpr. Anyway that might be obvious for someone who has more experiences with LLVM than me. |
A simple Hello World is finally working on the Switch again. Including Command line parameters and simple Environment properties like TickCount, ProcessorCount, SystemPageSize written with format string Console.WriteLine. Breakpoints and single step debugging is working too therefore I expect the dwarf line information are correct. |
Could it be that the stack unwinding parts for ARM64/unix that are already there are somewhat hacked into the system? I am asking because the code looks suspicious. The main connection between the CoreRT runtime and libUnwind is done with the Registers_arm64_rt class. The strange thing here is that it contains a pointer to REGDISPLAY which is the representation of the register set from the runtime but although uses libunwind::Registers_arm64 as base class. Therefore this class contains now two registers sets but only one is used. For AMD64 the connection is implemented by using a shim over REGDISPLAY. For me it looks like that this is the way to go for ARM64, too. Or at least get rid of the libunwind::Registers_arm64 as it adds members that are never used. While I was working on the POC some time ago I didn't care that much but now as the code should be committed I'd like to fix this, too. Fixes need to be done anyway as it does not work in the current form. Which is not surprising as I am assume I am the first one who even has reached the point that this code is executed at all. |
Are you by any chance referring to things added in this pull request: #7504 Maybe the conversation in that pull request would be helpful. Otherwise I'll have to defer to Jan since I'm not up to speed on those parts. At some point I'll need to spend some time with the unwinder to learn how it works, but now is not a good time. |
Yes, this are the changes. Based on the conversation I seems that the primary idea behind this commit was to get the runtime compile on ARM64/Unix. I am pretty sure that it was never run as the code that generates the information that the unwind need were not in RyuJit. I just added it again last week from my old POC code my current local version of RyuJit. |
Wow, great progress! The assert in RyuJIT is dotnet/runtime#38541 - it also reproed on x64 CoreCLR. The fix for that was to delete the assert, so it's an easy fix to port to your branch :). |
It was just removing the ++ completely beside of porting all the related assembler code to the other syntax to get the exception handling working |
Unfortunately I run into another issue when another stack walk is required while the code is in a exception handling funclet. A rethrow or a garbage collection can cause this. corert/src/Native/Runtime/StackFrameIterator.cpp Lines 823 to 830 in f542d97
m_RegDisplay.SP does not point to the first saved data but the frame pointer instead. Therefore I am not sure if the stack pointer needs to be corrected there before starting to copy the saved register values or if m_RegDisplay.SP should already point to the data instead of the frame pointer. to forward the pointer over the frame pointer and return address and my test code now works on the Switch. But I might have broken it somewhere else where I cannot test.
Again this one points to the FramePointer and not the data. |
Where did this bad |
Spend some time stepping through the assembler code (generated by RyuJit and ported by me from the asm files) In the end it boil down to this section in the ALLOC_THROW_FRAME macro
There is stores the current SP in the PAL_LIMITED_CONTEXT that is later used to do the unwind. The macro itself is used by RhpThrowHwEx, RhpThrowEx and RhpRethrow always as first operation. I checked it for both RhpThrowEx and RhpRethrow In any case SP was pointing to the FP/LR pair from the caller and not to the actual stack data. With this in mind I took a look at the code that Ryujit has produced.
After this sp is not touched anymore until the epilog. I checked some other functions and it is always the same pattern that is used.
Based on this I think that the SP is not bad at this point. It's just that Ryujit builds code with the assumption that SP points to the FP/LR pair. Maybe this is different depending on the target system you build for. To test this I would need to get a cross compile from Windows AMD64 to Windows ARM64 working but I have no Windows ARM64 system to actually test it. |
What would be the actual stack data?
It may often be the case, but there are certainly situations in RyuJIT generated core where SP does not point to the FP/LR pair. |
On one side the local variables etc of the method that have called RhpThrowEx or RhpRethrow. On the other side the ALLOC_THROW_FRAME macro saves a Pal limited context. I double checked with the asm version of the code that was already there and I could not find a difference in this regard. But as the ARM64 and ARM code in the unwind function looked very similar I did some digging. I think I finally found the issue. While the ARM code starts to store the d registers first and than SP/PC of the fault side
the ARM64 code does it the other way around
But in they unwind code both parts start with the D registers. My "fix" for ARM64 was therefore just ignoring the SP/PC that were stored first. The ARM unwind code seems to ignore them too but after the D registers were copied.
I think I confused myself again here. While it was FP/LP when the throw was called later during the unwind it point's to SP/IP of the fault side (s. above) |
|
I looked in the assembler code for throwing the exception for all other cpu architectures. In any case beside ARM64 the information about the faulting side is stored after the registers are saved. I don't know why this was changed for ARM64. But for all architectures the UnwindFuncletInvokeThunk expects that the registers are stored first. |
Throw is done by I would expect the fix to be choice between fixing Maybe the problem is that a wrong |
Now I am felling stupid that I missed the obvious pairing there. At least it got me one step ahead. I could confirm that the correct unwind methods is called. The issue is related to the unwinding of the funclet.based on the stored dwarf information. As I generate them it's most likely my fault. The register values that libunwind delivers are the values after the prolog of the funclet. But the UnwindFuncletInvokeThunk expect the SP value before the prolog has been executed. As in my case the funclet just pushes an additional stack frame I am 16 bytes (2 register) off. I assume that if the ctach blog has local variables the difference would be larger as the prolog of the funclet needs to make more room. I need to recheck the whole generation of the unwind data as something seems wrong there. Edit: Now I felling extremely stupid. It was the generating of the unwind data. The special case when the CFA is not assigned to a register but still is changing in the prolog was not handled correctly. Sorry for wasting your time |
My test project (13.5MB C# sources / 7.8MB managed assembly) is running again. It runs stable and doing many GC collects for all generations. One run is comparable to nearly 3 hours of real gameplay and it seems I can run as many of these in a loop as I want. Unfortunately I couldn't find the exact results from the last run anymore but based on what I can remember it got even faster. In any case the performance is significant better than what we are able to achieve with Mono-AOT |
It seems that I am close to the point where the general parts of the ARM64 support are complete. The two primary areas left are the Intrinsics and the assembler code in System.Private.TypeLoader.Native. Are there any test cases that I can use to check the assembler code after it is ported? So far it seems my test programs don't need these features. It's the same for the ResolveVirtualFunction case in the ReadyToRunHelperNode. I have added the code but it is untested as it was never triggered so far. There might be some more of the assembler helpers that are linked but not called. But as these are just direct ports from the asm version used for Windows they should work. Anything beyond this point should be platform specific. Which I either can not make public or someone else need to do as I don't have a ARM64 based desktop system to fill in the gaps for Linux and Windows. So far I have identified two areas that will need additional code for Windows and Linux
Programs that don't need these features should hopefully work. |
You can skip System.Private.TypeLoader.Native - that code should only be needed to support universal shared generics and we don't have that in CoreRT (at least for now) - only .NET Native's code generator supports it - RyuJIT doesn't. We're actually scoping universal share code out of the minimum viable product in the runtimelab repo move. (If you're interested in what's universal shared code - there's a very brief explanation here) The rough plan for the move to the runtimelab repo is here: dotnet/runtimelab#4 The |
Actually - it's possible ResolveVirtualFunction is only reachable from IL, not from C#. The delegate case might go through the delegate creation helper. If you can't hit it, placing a |
While I was hoping I could do it a little bit faster I think the general ARM64 code will be finished in the next few days. Anything beyond this should be Nintendo Switch specific and need to be discussed separately if someone needs/want it. Therefore the question. Do you prefer the whole ARM64 code in one pull request or should it be done in multiple smaller once? |
That's great news! Thank you! I'll leave the number of pull requests up to you - but I think a single pull request should be manageable. If there's something unexpected, we can split that part off. |
Will do. The changes although include modifications for the ObjectWriter and the old LLVM version it uses. I replaced llvm.patch file with a new one that includes all the old changes and my additional once to keep the build process the same, Beside of CoreRT I unfortunately needed to touch two files in RyuJIT, too. They include a fix for funclets (the stack frame was broken when running RyuJIT in CoreRT mode) and the generating of ARM64 unwind informations for ELF based systems as there was only code to do this for Windows. I expect that we need to get these in first as the CoreRT part will not work without the CFI data. |
We don't have CI set up for ARM64 (and since the port is incomplete for Linux/Windows, it would probably not pass anyway) - I think the RyuJIT change can be submitted at the same time as the CoreRT change since the extent of CI validation will be limited to ensuring we don't regress x64. I assume you didn't have to change the interface between the JIT and the AOT compiler. We should get CI support for this "for free" once the move to the runtimelab repo is complete. The runtime build system, CI testing, and publishing has ARM64 support in the dotnet/runtime repo, that dotnet/runtimelab is based on.
I think your changes take us very close to having full Linux ARM64 support. Finishing this up shouldn't be too much work for someone else. Thank you! |
No, I didn't change the interface as there was already CFI support for AMD64 on Unix. While I am in the process to do a final check to see if I have missed something I noticed Dummies.asm in the ARM64 folder corert/src/Native/Runtime/arm64/Dummies.asm Lines 1 to 18 in 1c65791
I might be wrong but I believe this was copied there by mistake from the ARM folder as the code is not ARM64 and a 64Bit host should not need any helper for these functions. |
Yes, that file is not used for anything, even on .NET Native (where it originates). It got integrated from a prototype branch by accident I assume. We can just delete it. |
This should hopefully the last question before I can make the pull request. I noticed that there is a GcProbe.asm but no empty GcProbe.S as for most of the other assembler code I needed to convert. The GcProbe.S is missing for all other platforms, too. Can I assume that it is currently not needed for *nix based host system and therefore skip it for now? |
Yes, seem like this file is only included in CMakeList on Windows and that's why we don't need to add an empty file to make the build happy. Seems like |
I opened the push request for the RyuJIT changes: dotnet/runtime#41023 The one for the CoreRT changes should follow tomorrow |
Might take a little bit longer as at least one of my changes broke something for AMD64 on Linux/OSX. I suspect it's related to the stack unwinding. Maybe it need a more diverse code path for AMD64 vs ARM64 in the shared part. Excluded the cirtical submit with the modified stack unwind. I will do another pull request when I have a common solution for AMD64 and ARM64. Anything else doesn't break anything for ARM64: #8271 Sorry it's pretty huge |
For reference if someone find this in the future Because of various limitations based on the special environment (Game Consoles) I need to get CoreRT working I was not able to implement/check some features. This list might not be complete but should be a good starting point what still needs to be done:
|
The missing part of the unwind code: #8290 Now it doesn't break AMD64 anymore |
And with this finally in the master branch I believe I have done everything I could do to get CoreRT working on ARM64 based systems in general. Someone with an ARM64 based Linux system needs to fill in the still missing parts. If someone else beside me have the need to run CoreRT on a Game Console we will need to find a way how to handle this case. In the case I find some more issues that I can fix in a general way I will submit future pull requests. |
Thank you! Let's close this - we can open new issues for the leftover work. |
About a year a year ago I was working on an evaluation to use CoreRT to port the C# part of PC/Windows based game to they major game consoles (XBox One; PlayStation 4 and Switch). The Team had it already running with Mono but was not satisfied with the overall performances. Therefore I was tasked with checking if the performances can be improved or finding an alternative to using Mono.
As getting CoreRT working on the XBox was an very easy task I was able to get the core module of our C# code run in a few hours. As it runs more than twice as fast compared to Mono it was decided to go ahead. PlayStation was a little bit more work but it worked after a few days and showed the same performances improvements. Unfortunately while goingon to the Switch I noticed that the ARM64 support was incomplete and using ARM was not an option because of some limitations on the target system. As the team still wanted the performances improvements I started to implement the missing parts. It took some time but I was able to get our test case running and again it was more than twice as fast as the Mono version. It does not use all .Net features therefore there are still parts missing. But as our test test case (around 5MB of C# code) was able to run for hours with the precise GC I assume it is at least a big part.
Unfortunately priorities changed and this project was put on hold. After things have changed again it was decided to give a switch from Mono to CoreRT a second try. This time I although got the permission to submit anything that does not touch the various NDAs. As you might assume I needed to add code on different places including the jit itself. As anything was done on a now nearly one year old version I need to reintegrate all these changes again. But I'd like to make sure they can easily go back in the main code branch.
Therefore the question on which of these changes you are interested in and what's your recommended work flow?
The text was updated successfully, but these errors were encountered: