Skip to content
This repository has been archived by the owner on Nov 1, 2020. It is now read-only.

ARM64 and cross compile support #8236

Closed
RalfKornmannEnvision opened this issue Jul 17, 2020 · 61 comments
Closed

ARM64 and cross compile support #8236

RalfKornmannEnvision opened this issue Jul 17, 2020 · 61 comments

Comments

@RalfKornmannEnvision
Copy link
Contributor

About a year a year ago I was working on an evaluation to use CoreRT to port the C# part of PC/Windows based game to they major game consoles (XBox One; PlayStation 4 and Switch). The Team had it already running with Mono but was not satisfied with the overall performances. Therefore I was tasked with checking if the performances can be improved or finding an alternative to using Mono. 

As getting CoreRT working on the XBox was an very easy task I was able to get the core module of our C# code run in a few hours. As it runs more than twice as fast compared to Mono it was decided to go ahead. PlayStation was a little bit more work but it worked after a few days and showed the same performances improvements. Unfortunately while goingon to the Switch I noticed that the ARM64 support was incomplete and using ARM was not an option because of some limitations on the target system. As the team still wanted the performances improvements I started to implement the missing parts. It took some time but I was able to get our test case running and again it was more than twice as fast as the Mono version. It does not use all .Net features therefore there are still parts missing. But as our test test case (around 5MB of C# code) was able to run for hours with the precise GC I assume it is at least a big part.

Unfortunately priorities changed and this project was put on hold. After things have changed again it was decided to give a switch from Mono to CoreRT a second try. This time I although got the permission to submit anything that does not touch the various NDAs. As you might assume I needed to add code on different places including the jit itself. As anything was done on a now nearly one year old version I need to reintegrate all these changes again. But I'd like to make sure they can easily go back in the main code branch. 

Therefore the question on which of these changes you are interested in and what's your recommended work flow?

@MichalStrehovsky
Copy link
Member

Awesome, thank you!

I'll defer to @jkotas for final word, but my take:

  • We should take RyuJIT changes directly to dotnet/runtime master branch. That's where the rest of CoreRT RyuJIT changes live and I assume it's not an extensive change; just tweaks. We'll either move them later to be exclusive to runtimelab or just leave them there.
  • I would be happy to review and merge the ILCompiler changes here as you have them.
  • Same with whatever runtime changes you have.
  • When it comes to the cross compilation build system changes, we should just wait until CoreRT repo is moved over into runtimelab repo (Move native AOT experiment to dotnet/runtimelab runtimelab#4) - we're going to hook CoreRT build up into the CoreCLR build there and that already handles crosscompilation (it's probably still similar to how CoreRT crosscompiles, but CoreCLR has been evolving and CoreRT has been frozen in time for a while). If the changes you have are already good to merge, we could merge them, but crossbuild will have to be hooked up again in the runtimelab repo once we finish the move (so... probably no need to polish it if it's not polished).

@RalfKornmannEnvision
Copy link
Contributor Author

I am still in the process of setting up my new development PC so I might remember some things wrong.

IIRC the RyuJIT changes were primary the implementation of the ARM64 unwinding information that the GC needs for the stack walk. They were just missing in the version that was used at that time. 

I needed to build a new ObjectWriter as the default one was not able to write ARM64 for ELF code at all. Beside of this I compiled a version that could write all the different binary formats for all the CPUs I need.
The changes in the ILCompiler itself were mostly the parts were these small pieces of assembler code is generated to glue things together.

I have some platform independent changes for the coreRT runtime itself, too. Mostly the implementation of missing ARM64 assembler functions and getting the stackwalk working. Unfortunately most of the other runtime changes are platform specific and therefore I can not make them public. We might be able to talk about the XBox One parts if Microsoft is interested to give other XBox developers access to it.

The cross compiler support was done very simple. Just using different RyuJit dlls depending on the target settings and using an own project file for each platform that contains the links to the platforms specific assemblies. Than I just used an own solution for every platform to link anything together in an executable.

Overall in the end I didn't write that much code. It was mainly the process of finding the reasons why things were crashing or even worse just not working at all. I expect that I think I will still need some days to reintegrate everything to the current versions of CoreRT and the runtime and clean it up.

@jkotas
Copy link
Member

jkotas commented Jul 17, 2020

As @MichalStrehovsky said. Any changes for the ARM64 port would be highly appreciated!

other runtime changes are platform specific and therefore I can not make them public

This can wait for later if there is more interest. We would need to have setup like what Mono has to deal with the licensing: https://www.mono-project.com/docs/about-mono/supported-platforms/xbox-one/ or https://www.mono-project.com/docs/about-mono/supported-platforms/playstation4/

@RalfKornmannEnvision
Copy link
Contributor Author

FYI: I started to commit the ARM64 parts to https://github.com/RalfKornmannEnvision/corert/tree/ARM64

@RalfKornmannEnvision
Copy link
Contributor Author

Seems like I cannot get it working anymore. I added the missing generation of CFI unwind data to RyuJit and that worked well. But many methods fail to compile with an error during register allocation. This might be caused by using a RyuJit that is compiled for X64 but should generate ARM64 code. I understand that this is not a supported scenario but it worked fine the last time. Or maybe because I am using the newest RyuJit version from GitHub while CoreRT needs an older version. Last time I needed to go back but I got an error message that told me that the newset one has an incompatible interface. This time at least the interfaces seems not to have changed.

@MichalStrehovsky
Copy link
Member

It doesn't look like there has been a breaking change in the runtime repo recently. The tip of master should work. Note there was a RyuJIT change recently that broke the contract and didn't update the interface GUID in the past months, but if you use tip of both runtime and CoreRT, it should be fine.

I submitted #8242 to switch to the latest RyuJIT - let's see what the CI says to rule out a more general problem.

@RalfKornmannEnvision
Copy link
Contributor Author

Using a compiled regular clrjit from the same runtime version works fine. I will try to build a Windows ARM64 version instead of the Unix-ARM64 one I was using so far. In the end I need the Unix version but at least I can check if it will show the same errors. 

@MichalStrehovsky
Copy link
Member

The CI on the pull request to update RyuJIT is pretty red so this looks like a more general problem. I'll build a debug version of RyuJIT and see if I can spot the problem. Try rewinding the runtime repo to commit 98b6284e020845c04bc7b1cefdcd01ffe7a4a8c0 (this is the RyuJIT that is being used in CoreRT right now).

@RalfKornmannEnvision
Copy link
Contributor Author

Thank's for the hint I will give it a try

@MichalStrehovsky
Copy link
Member

The problem in the CI is likely a different problem. The JIT interface GUID didn't change, but runtime still changed how hardware intrinsics are reported and the tests that throw exceptions end up getting an infinite recursion because an intrinsic wasn't expanded by RyuJIT. It's likely unrelated to the register allocation problem you're seeing.

@RalfKornmannEnvision
Copy link
Contributor Author

RalfKornmannEnvision commented Jul 22, 2020

Well going back to the older version had at least some effect. There are now much less methods that fail to compile and it's an different assert. To be more precise it's now only one assert while there multiple once before. This issue seems to be somewhat related to System.Numerics.Vector`1 as all methods that are falling to compile have at least one argument of this type. The generic types are different.

Examples: 

Assertion failed '!regState->rsIsFloat' in 'System.SpanHelpers:LocateFirstFoundByte(System.Numerics.Vector`1[System.Byte]):int' during 'Linear scan register alloc' (IL size 73)

Assertion failed '!regState->rsIsFloat' in 'System.Numerics.Vector:BitwiseOr(System.Numerics.Vector1[System.UInt16],System.Numerics.Vector1[System.UInt16]):System.Numerics.Vector`1[System.UInt16]' during 'Linear scan register alloc' (IL size 13)

I will try to have a closer look. But to be honest I am not sure if I can find and fix this issue. Adding the missing CFI information is one thing but this seems to be much deeper inside the whole jit process. 

@MichalStrehovsky
Copy link
Member

Where is that assertion? Could it be this one?

https://github.com/dotnet/runtime/blob/6072e4d3a7a2a1493f514cdf4be75a3d56580e84/src/coreclr/src/jit/regalloc.cpp#L177

I see this code path is in an if/else that appears to be dealing with HFA. It's quite likely we're not computing the HFA shape of Vector<T> here:

public override ValueTypeShapeCharacteristics ComputeValueTypeShapeCharacteristics(DefType type)
{
return _fallbackAlgorithm.ComputeValueTypeShapeCharacteristics(type);
}

It should probably be something along the lines of:

if (type.Context.Target.Architecture == TargetArchitecture.ARM64 &&
type.Instantiation[0].IsPrimitiveNumeric)
{
return type.InstanceFieldSize.AsInt switch
{
8 => ValueTypeShapeCharacteristics.Vector64Aggregate,
16 => ValueTypeShapeCharacteristics.Vector128Aggregate,
_ => ValueTypeShapeCharacteristics.None
};
}
return ValueTypeShapeCharacteristics.None;

@RalfKornmannEnvision
Copy link
Contributor Author

Yes, it's this one. It's in line 178 in my version but it's the same code.

Thank's for your help as I would properly never found the connection between this assert and the code part in the IL Compiler. I will have a look if I can figure out what's wrong and see if I can fix it

@MichalStrehovsky
Copy link
Member

I think you can just copy-paste the implementation from VectorFieldLayoutAlgorithm.cs - it should do the right thing. Fingers crossed it helps, but it really feels like it should help taking the if (argDsc->lvIsHfaRegArg()) path in RyuJIT instead of the asserting one.

@RalfKornmannEnvision
Copy link
Contributor Author

Thank you very much. Now it runs without any asserts to the point were it needs to create the object writer. This was the next thing on my list were I need to reintegrate my old changes.

@RalfKornmannEnvision
Copy link
Contributor Author

Reintegrating my changes into the object writer got me a arm64 elf file that seems to be OK. But my changes include an ugly hack. As I want to commit this changes back it should be fixed properly. 
The issue is related to the IMAGE_REL_BASED_REL32 relocation type and exception handling. For non Windows system it is used to link to the type of exception that should be handled

RelocType rel = (_compilation.NodeFactory.Target.IsWindows) ?
                                RelocType.IMAGE_REL_BASED_ABSOLUTE :
                                RelocType.IMAGE_REL_BASED_REL32;

For this type of relocations the object writer sets VK_PLT as kind when the object file type is ELF. This triggers later an assert when LLVM is doing some offset calculations as it expect that the kind of the symbol is VK_None.
There is an comment in the object writer code that says

// PLT is valid only for code symbols,
// but there shouldn't be references to global data symbols

In the case of the exception handling the symbol is in the .data section and clearly not a code symbol. So PLT should not be set in this case. I assume it had not caused an issue so far because this assert is only triggered when generating an ARM64 object file.
How should we approch the fix for this?
Do we need a new relocation type here to differentiate between data and code symbols?
Or should I better just check if the target symbol is really in a text section before setting VK_PLT?

@MichalStrehovsky
Copy link
Member

I wonder if we could just use IMAGE_REL_BASED_RELPTR32 instead. We use RELPTR32 in most of the data structures, I assume because that one doesn't set VK_PLT. Weird that Linux x64 doesn't hit it. I wonder if it's because nobody runs debug builds of objwriter (I certainly didn't, but I also didn't do much objwriter work).

It will probably then need an adjustment here:

// Read target type
{
// @TODO: CORERT: Compress EHInfo using type table index scheme
// https://github.com/dotnet/corert/issues/972
Int32 typeRelAddr = *((PTR_Int32&)pEnumState->pEHInfo)++;
pEHClauseOut->m_pTargetType = dac_cast<PTR_VOID>(pEnumState->pEHInfo + typeRelAddr);
}

That reloc already includes the 4 byte delta so I think we need to do the ++ only after we finished the address computation (these calculation always make my head spin, so I just always set up a breakpoint on these, trace through, and adjust as needed).

@RalfKornmannEnvision
Copy link
Contributor Author

Yes, using IMAGE_REL_BASED_RELPTR32 for the Exception handling related relocation will fix the assert in the object writer. I will take a look at runtime part when I reached the point that it finally compiles and link for the Switch again. I think I am still some days away from this as there were quite some changes in the months I was assigned to do other things. But as it was working once I am still pretty confident I can get it working again.

@RalfKornmannEnvision
Copy link
Contributor Author

Looks like I run in another (little?) issue. I switched over to using the linux ARM64 runtime assemblies. While the windows ones uses dynamic linking for the needed system API calls for unix static linking is used. That's fine and the il compiler clearly creates the interop wrappers for functions like SystemNative_Dup. While emitting the code to the object writer it although creates a symbol reference to SystemNative_Dup. But it never creates a symbol definition.

The issue seems to be a behavior of the ELF file format for AARCH64. As the symbol is not marked as global (function?) it's not in the generated library. It works fine for runtime functions like RhNewObject as they are defined as global symbols.

I spend some time debugging the code and the only way to ensure that a symbol is global is this:

string alternateName = _nodeFactory.GetSymbolAlternateName(name);
                    if (alternateName != null)
                    {
                        _sb.Clear();
                        AppendExternCPrefix(_sb);
                        _sb.Append(alternateName);

                        EmitSymbolDef(_sb, global: true);
                    }

GetSymbolAlternateName just checks the NodeAliases dictionary. It's filled for all the external native functions in the core lib but not for the once in additional assemblies like System.Console.

An easy solution might be to ensure all pinvokes from all references assemblies are added to the alias list, too. But I am not sure if this would be to simple and would cause that functions will be linked in the final executable even if not needed.

Maybe the correct solution would be using the InteropStubManager as AddDependeciesDueToPInvoke seems to be called for every pinvoke method that is used

I assume that it works on other targets as creating the symbol reference is enough there.

@MichalStrehovsky
Copy link
Member

Hm, for APIs like SystemNative_Dup the AOT compiler doesn't actually generate definitions - the definition comes from the static libraries that we link against. We pick these up from the dotnet/runtime repo - did you build ARM64 version of these and passed them to the linker? Sorry if it's a dumb question.

@RalfKornmannEnvision
Copy link
Contributor Author

No, it's not a dumb question. It's more likely that I failed to describe the issue properly.

I have code for all the SystemNative_* compiled in another static library.

When the object writer generates the static libary file it includes a symbol table. The ARM64 ELF format requires that everything that is not part of the library itself or should be used by another library needs to be marked as global and if it is a function it needs a flag for this, too

If I dump the generated static libary the symbol table looks like tis:

0000000001517c0 g     F __managedcode	0000000000000000 RhpLdelemaRef
000000000014a3c0 g     F __managedcode	0000000000000000 RhpResolveInterfaceMethod
000000000014e2d0 g     F __managedcode	0000000000000000 RhpReversePInvokeBadTransition
00000000001515f0 g     F __managedcode	0000000000000000 RhpStelemRef
00000000002fc450 g     F __managedcode	0000000000000000 __managed__Main
000000000010a500 g       .rodata	0000000000000000 g_compilerEmbeddedSettingsBlob

I needed to add some code to ensure that the function marker is set as it was already done for ARM. But the symbol table contains only functions that are reported as global by EmitSymbolDef. Therefore it has all the Rhp* functions but none of the SystemNative_* once as the il compiler only makes EmitSymbolRef calls for them.

Unfortunately the LLVM part of the object writer is not smart enough to detect that there is no code for this symbol reference and convert it to a global symbol for linking. The only way around this would be to ensure that all external functions are not only reported via EmitSymbolRef but as a global symbol via EmitSymbolDef, too. This already happens for all the external functions in the CoreLib. This might be an unplanned side effect as it happens during the rooting of all the methods that should be exported even if these methods are later imported.

While there is a local untyped symbol for all the SystemNative_* functions generated during EmitSymbolRef it seems to be drooped as there is nothing associated to it. Based on the decription of the ELF format such symbols should only be in the table when they are global.

@MichalStrehovsky
Copy link
Member

Ah, I see. I didn't realize RhNewObject is written in C# - I thought it's one of the APIs we have in C, so I was wondering why it works there but doesn't for the libSystem.* libraries.

Would something like this work?

  • Add a bool global flag to EmitSymbolRef in ObjWriter and propagate that to where we do MCSymbol *T = OutContext->getOrCreateSymbol(SymbolName);. Add an if (global) { statement after that line that looks like the one in EmitSymbolDef.
  • On the managed side, update EmitSymbolReference in ObjectWriter to check bool global = target is ExternSymbol and pass that as the flag to EmitSymbolRef

That might do.

I wonder a bit whether instead of target is ExternSymbol we should just add a bool property on ISymbolNode, but maybe that's too much work. We might need to do that if more things than just ExternSymbol need it.

@RalfKornmannEnvision
Copy link
Contributor Author

RhNewObject is as far as I know written in Assembler. It's in the symbol table because there is code in the generated library that calls it.

Thanks again. "target is ExternSymbol" was the hint that I needed. I think I will add a new ComitSymbolExtern as this gives me the chance to mark the symbol as external, too. Not sure if this has an impact right now but it might reduce future problems.

But a quick test showed that if I just call EmitSymbolDef when I detect an ExtrnalSymbolNode the SystemNative_* functions show up in the symbol table

@MichalStrehovsky
Copy link
Member

RhNewObject is as far as I know written in Assembler. It's in the symbol table because there is code in the generated library that calls it.

It's in C# - we reference it through the RuntimeExport name though because of how the runtime library was structured in .NET Native. That's why it shows up as an alternative name.

[RuntimeExport("RhNewObject")]
public static unsafe object RhNewObject(EEType* pEEType)
{
// This is structured in a funny way because at the present state of things in CoreRT, the Debug.Assert
// below will call into the assert defined in the class library (and not the MRT version of it). The one
// in the class library is not low level enough to be callable when GC statics are not initialized yet.
// Feel free to restructure once that's not a problem.
#if DEBUG
bool isValid = !pEEType->IsGenericTypeDefinition &&
!pEEType->IsInterface &&
!pEEType->IsArray &&
!pEEType->IsString &&
!pEEType->IsByRefLike;
if (!isValid)
Debug.Assert(false);
#endif
#if FEATURE_64BIT_ALIGNMENT
if (pEEType->RequiresAlign8)
{
if (pEEType->IsValueType)
return InternalCalls.RhpNewFastMisalign(pEEType);
if (pEEType->IsFinalizable)
return InternalCalls.RhpNewFinalizableAlign8(pEEType);
return InternalCalls.RhpNewFastAlign8(pEEType);
}
else
#endif // FEATURE_64BIT_ALIGNMENT
{
if (pEEType->IsFinalizable)
return InternalCalls.RhpNewFinalizable(pEEType);
return InternalCalls.RhpNewFast(pEEType);
}
}

But a quick test showed that if I just call EmitSymbolDef when I detect an ExtrnalSymbolNode the SystemNative_* functions show up in the symbol table

Wouldn't that place a definition of the symbol at the spot where we're trying to generate a reference to it? I guess linker could theoretically stomp over it if it sees the definition in libSystem.* first, but it feels odd.

Glad you can make progress!

@RalfKornmannEnvision
Copy link
Contributor Author

RalfKornmannEnvision commented Jul 28, 2020

You are right I confused it with RhpNewArray.

It seems the real issue is somewhere else. There is some code missing in LLVM. Sorry for wasting your time

@RalfKornmannEnvision
Copy link
Contributor Author

For future references just in the case someone finds this in the future.

Ignore what I have written above about the symbol table entries for the native functions. I got myself totally confused. While a native function that is called from the generated library must shown up in the Symboltable it should not as a global symbol. It must be as Undefined, Unresolved or in whatever your obj dumper will show it. In my case they are now shown as UND
There is although (at least for ARM64) no need to add them manually. The symbol reference is enough.as long as it is used. The issue is that these missing functions were only used as part of relocation's. While I added the code in objwriter.cpp to emit these additional ARM64 specific relocation types I didn't remember that there were although additional LLVM code to convert the text based relocation type to the Id that is used later. Unfortunately without this piece of code LLVM just ignored all these relocation instead of generation an error or assert.

Another issue that I hadn't fixed in my old proof of concept (only hacked it away) was that the ARM64 relocation's not always work with just MCSymbolRefExpr as target. If there is more than one sub type for a relocation type the symbol references needs to be encapsulated in an additional AArch64MCExpr. Anyway that might be obvious for someone who has more experiences with LLVM than me.

@RalfKornmannEnvision
Copy link
Contributor Author

A simple Hello World is finally working on the Switch again. Including Command line parameters and simple Environment properties like TickCount, ProcessorCount, SystemPageSize written with format string Console.WriteLine. Breakpoints and single step debugging is working too therefore I expect the dwarf line information are correct.
Not sure how much of the CoreRT function this already covers but from the last time I know that the GC is not called for this. So I expect it is not working yet again.
I noticed that the ARM64 funclets are still broken when RyuJIT runs in CoreRT mode.

@RalfKornmannEnvision
Copy link
Contributor Author

Could it be that the stack unwinding parts for ARM64/unix that are already there are somewhat hacked into the system?

I am asking because the code looks suspicious. The main connection between the CoreRT runtime and libUnwind is done with the Registers_arm64_rt class. The strange thing here is that it contains a pointer to REGDISPLAY which is the representation of the register set from the runtime but although uses libunwind::Registers_arm64 as base class. Therefore this class contains now two registers sets but only one is used. 

For AMD64 the connection is implemented by using a shim over REGDISPLAY. For me it looks like that this is the way to go for ARM64, too. Or at least get rid of the libunwind::Registers_arm64  as it adds members that are never used.

While I was working on the POC some time ago I didn't care that much but now as the code should be committed I'd like to fix this, too. Fixes need to be done anyway as it does not work in the current form. Which is not surprising as I am assume I am the first one who even has reached the point that this code is executed at all.

@MichalStrehovsky
Copy link
Member

Are you by any chance referring to things added in this pull request: #7504

Maybe the conversation in that pull request would be helpful. Otherwise I'll have to defer to Jan since I'm not up to speed on those parts. At some point I'll need to spend some time with the unwinder to learn how it works, but now is not a good time.

@RalfKornmannEnvision
Copy link
Contributor Author

Are you by any chance referring to things added in this pull request: #7504

Maybe the conversation in that pull request would be helpful. Otherwise I'll have to defer to Jan since I'm not up to speed on those parts. At some point I'll need to spend some time with the unwinder to learn how it works, but now is not a good time.

Yes, this are the changes. Based on the conversation I seems that the primary idea behind this commit was to get the runtime compile on ARM64/Unix. I am pretty sure that it was never run as the code that generates the information that the unwind need were not in RyuJit. I just added it again last week from my old POC code my current local version of RyuJit.

@MichalStrehovsky
Copy link
Member

Wow, great progress!

The assert in RyuJIT is dotnet/runtime#38541 - it also reproed on x64 CoreCLR. The fix for that was to delete the assert, so it's an easy fix to port to your branch :).

@RalfKornmannEnvision
Copy link
Contributor Author

It will probably then need an adjustment here:

// Read target type
{
// @TODO: CORERT: Compress EHInfo using type table index scheme
// https://github.com/dotnet/corert/issues/972
Int32 typeRelAddr = *((PTR_Int32&)pEnumState->pEHInfo)++;
pEHClauseOut->m_pTargetType = dac_cast<PTR_VOID>(pEnumState->pEHInfo + typeRelAddr);
}

That reloc already includes the 4 byte delta so I think we need to do the ++ only after we finished the address computation (these calculation always make my head spin, so I just always set up a breakpoint on these, trace through, and adjust as needed).

It was just removing the ++ completely beside of porting all the related assembler code to the other syntax to get the exception handling working

@RalfKornmannEnvision
Copy link
Contributor Author

Unfortunately I run into another issue when another stack walk is required while the code is in a exception handling funclet. A rethrow or a garbage collection can cause this.
I was able to track the problem down to 

PTR_UInt64 d = (PTR_UInt64)(m_RegDisplay.SP);
for (int i = 0; i < 8; i++)
{
m_RegDisplay.D[i] = *d++;
}
SP = (PTR_UIntNative)d;

m_RegDisplay.SP  does not point to the first saved data but the frame pointer instead. 
This code was introduce a long time ago with d6c2c08#diff-ba6b29d9ecfa3205477eed3f5c7deea8

Therefore I am not sure if the stack pointer needs to be corrected there before starting to copy the saved register values or if m_RegDisplay.SP should already point to the data instead of the frame pointer.
To test my assumption I added a 
d += 2; 

to forward the pointer over the frame pointer and return address and my test code now works on the Switch. But I might have broken it somewhere else where I cannot test.
But if the SP should point to the data there is another issue in this function. 
Add the end the SP in m_RegDisplay is set to the new one

m_RegDisplay.SetSP((UIntNative)dac_cast<TADDR>(SP));

Again this one points to the FramePointer and not the data.
Therefore IMHO it need to the corrected either at the begin or the end. As the rest of the code seems to work fine if SP in RegDisplay points to the Framepointer I assume correcting it at the top is the right way but I am not sure.

@jkotas
Copy link
Member

jkotas commented Aug 6, 2020

m_RegDisplay.SP does not point to the first saved data but the frame pointer instead.

Where did this bad SP come from? It sounds like the problem happened even before this function got called.

@RalfKornmannEnvision
Copy link
Contributor Author

m_RegDisplay.SP does not point to the first saved data but the frame pointer instead.

Where did this bad SP come from? It sounds like the problem happened even before this function got called.

Spend some time stepping through the assembler code (generated by RyuJit and ported by me from the asm files)

In the end it boil down to this section in the ALLOC_THROW_FRAME macro

       mov x3, sp
 
        // Setup a PAL_LIMITED_CONTEXT on the stack {
        .if \exceptionType == HARDWARE_EXCEPTION
            sub sp,sp,#0x50
            stp x3, x1, [sp]   // x3 is the SP and x1 is the IP of the fault site 
            // TODO PROLOG_PUSH_MACHINE_FRAME
        .else
            PROLOG_STACK_ALLOC 0x50
            stp x3, lr, [sp]   // x3 is the SP and lr is the IP of the fault site 
        .endif

There is stores the current SP in the PAL_LIMITED_CONTEXT that is later used to do the unwind. The macro itself is used by RhpThrowHwEx, RhpThrowEx and RhpRethrow always as first operation. I checked it for both RhpThrowEx and RhpRethrow In any case SP was pointing to the FP/LR pair from the caller and not to the actual stack data. With this in mind I took a look at the code that Ryujit has produced. 
The prolog of the function that later calls RhpThrowEx is:

0000000000021430 <repro_Program__ExceptionTest>:
   21430:	a9b97bfd 	stp	x29, x30, [sp, #-112]!
   21434:	910003fd 	mov	x29, sp

After this sp is not touched anymore until the epilog. I checked some other functions and it is always the same pattern that is used.
I might be wrong but for me this looks like that Ryujit is generating a new stack frame in the way that SP points to the FP/LR pair that will be later used for the ret. As my function always throws an exception Ryujit was smart enough to not create the epilog and just added a brk after the throw. But I checked with another more simple function

000000000021530 <repro_Program___ctor>:
   21530:	a9be7bfd 	stp	x29, x30, [sp, #-32]!
   21534:	910003fd 	mov	x29, sp
   21538:	f9000fa0 	str	x0, [x29, #24]
   2153c:	f9400fa0 	ldr	x0, [x29, #24]
   21540:	94015498 	bl	767a0 <Object___ctor>
   21544:	d503201f 	nop
   21548:	d503201f 	nop
   2154c:	a8c27bfd 	ldp	x29, x30, [sp], #32
   21550:	d65f03c0 	ret

Based on this I think that the SP is not bad at this point. It's just that Ryujit builds code with the assumption that SP points to the FP/LR pair. Maybe this is different depending on the target system you build for. To test this I would need to get a cross compile from Windows AMD64 to Windows ARM64 working but I have no Windows ARM64 system to actually test it.

@jkotas
Copy link
Member

jkotas commented Aug 10, 2020

I checked it for both RhpThrowEx and RhpRethrow In any case SP was pointing to the FP/LR pair from the caller and not to the actual stack data

What would be the actual stack data?

It's just that Ryujit builds code with the assumption that SP points to the FP/LR pair.

It may often be the case, but there are certainly situations in RyuJIT generated core where SP does not point to the FP/LR pair.

@RalfKornmannEnvision
Copy link
Contributor Author

I checked it for both RhpThrowEx and RhpRethrow In any case SP was pointing to the FP/LR pair from the caller and not to the actual stack data

What would be the actual stack data?

On one side the local variables etc of the method that have called RhpThrowEx or RhpRethrow. On the other side the ALLOC_THROW_FRAME macro saves a Pal limited context. I double checked with the asm version of the code that was already there and I could not find a difference in this regard. 

But as the ARM64 and ARM code in the unwind function looked very similar I did some digging. I think I finally found the issue.

While the ARM code starts to store the d registers first and than SP/PC of the fault side

PROLOG_VPUSH {d8-d15}
PROLOG_PUSH "{r0,lr}"        // push {sp, pc} of fault site

the ARM64 code does it the other way around

        PROLOG_STACK_ALLOC 0x50
        stp x3, lr, [sp]   // x3 is the SP and lr is the IP of the fault site 
        .endif
        stp d8, d9, [sp, #0x10]
        stp d10, d11, [sp, #0x20]
        stp d12, d13, [sp, #0x30]
        stp d14, d15, [sp, #0x40]

But in they unwind code both parts start with the D registers. My "fix" for ARM64 was therefore just ignoring the SP/PC that were stored first. The ARM unwind code seems to ignore them too but after the D registers were copied.
I must confess that I am now very unsure what the actual issue is here? Either the data are stored in the wrong order and ARM64 should store them like the ARM code does. Or the order is correct and therefore the unwind needs to ignore the first two values like I did. Please advice.
I compiled and run the Exception test on the Switch and beside of the null reference exceptions it works with my "fix". The null reference does not work because of limited support for hardware exceptions but this should not mater here.

It's just that Ryujit builds code with the assumption that SP points to the FP/LR pair.

It may often be the case, but there are certainly situations in RyuJIT generated core where SP does not point to the FP/LR pair.

I think I confused myself again here. While it was FP/LP when the throw was called later during the unwind it point's to SP/IP of the fault side (s. above)

@jkotas
Copy link
Member

jkotas commented Aug 11, 2020

StackFrameIterator::UnwindThrowSiteThunk and StackFrameIterator::UnwindFuncletInvokeThunk have to be in sync with the assembly methods that they are unwinding. From you problem description, it sounds like that the asm code and these unwind methods are not in sync, but I do not see any obvious bug by just inspecting the code.

@RalfKornmannEnvision
Copy link
Contributor Author

I looked in the assembler code for throwing the exception for all other cpu architectures. In any case beside ARM64 the information about the faulting side is stored after the registers are saved. I don't know why this was changed for ARM64. But for all architectures the UnwindFuncletInvokeThunk expects that the registers are stored first. 
Therefore I either need to fix UnwindFuncletInvokeThunk or the write order during the throw. 
To fix UnwindFuncletInvokeThunk we need to skip over the first two values before start to copy the register values. And I think this line needs to be changed, too
SP += EQUALS_RETURN_ADDRESS(m_ControlPC, RhpCallCatchFunclet2) ? 4 : 2;
But I need to double check this.
Changing it during the throw would be not that complicated but I don't were the code is that uses the SP and IP of the fault side as it might need some fixes than. 

@jkotas
Copy link
Member

jkotas commented Aug 11, 2020

fix UnwindFuncletInvokeThunk or the write order during the throw.

UnwindFuncletInvokeThunk is for unwinding RhpCallCatchFunclet, RhpCallFinallyFunclet or RhpCallFilterFunclet.

Throw is done by RhpThrowHwEx, RhpThrowEx or RhpRethrow. UnwindThrowSiteThunk is for unwinding these methods.

I would expect the fix to be choice between fixing UnwindThrowSiteThunk or the write order during the throw.

Maybe the problem is that a wrong StackFrameIterator::Unwind* method is getting selected somehow?

@RalfKornmannEnvision
Copy link
Contributor Author

RalfKornmannEnvision commented Aug 11, 2020

Now I am felling stupid that I missed the obvious pairing there.

At least it got me one step ahead. I could confirm that the correct unwind methods is called. The issue is related to the unwinding of the funclet.based on the stored dwarf information. As I generate them it's most likely my fault. The register values that libunwind delivers are the values after the prolog of the funclet. But the UnwindFuncletInvokeThunk expect the SP value before the prolog has been executed. As in my case the funclet just pushes an additional stack frame I am 16 bytes (2 register) off. I assume that if the ctach blog has local variables the difference would be larger as the prolog of the funclet needs to make more room.

I need to recheck the whole generation of the unwind data as something seems wrong there. 

Edit: Now I felling extremely stupid. It was the generating of the unwind data. The special case when the CFA is not assigned to a register but still is changing in the prolog was not handled correctly. Sorry for wasting your time

@RalfKornmannEnvision
Copy link
Contributor Author

My test project (13.5MB C# sources / 7.8MB managed assembly) is running again.

It runs stable and doing many GC collects for all generations. One run is comparable to nearly 3 hours of real gameplay and it seems I can run as many of these in a loop as I want.

Unfortunately I couldn't find the exact results from the last run anymore but based on what I can remember it got even faster. In any case the performance is significant better than what we are able to achieve with Mono-AOT

@RalfKornmannEnvision
Copy link
Contributor Author

It seems that I am close to the point where the general parts of the ARM64 support are complete. The two primary areas left are the Intrinsics and the assembler code in System.Private.TypeLoader.Native. Are there any test cases that I can use to check the assembler code after it is ported? So far it seems my test programs don't need these features. It's the same for the ResolveVirtualFunction case in the ReadyToRunHelperNode. I have added the code but it is untested as it was never triggered so far. There might be some more of the assembler helpers that are linked but not called. But as these are just direct ports from the asm version used for Windows they should work.

Anything beyond this point should be platform specific. Which I either can not make public or someone else need to do as I don't have a ARM64 based desktop system to fill in the gaps for Linux and Windows.

So far I have identified two areas that will need additional code for Windows and Linux

  • The Thunkpool as the Switch needed a custom solution for this
  • Hardware based exceptions. Null reference as example

Programs that don't need these features should hopefully work.

@MichalStrehovsky
Copy link
Member

You can skip System.Private.TypeLoader.Native - that code should only be needed to support universal shared generics and we don't have that in CoreRT (at least for now) - only .NET Native's code generator supports it - RyuJIT doesn't. We're actually scoping universal share code out of the minimum viable product in the runtimelab repo move. (If you're interested in what's universal shared code - there's a very brief explanation here)

The rough plan for the move to the runtimelab repo is here: dotnet/runtimelab#4

The ResolveVirtualFunction code path should be reachable if you construct a delegate to a (non-generic) virtual method. Might need to make it non-obvious to RyuJIT so that it doesn't devirtualize (e.g. store an instance of something into a static field and then construct a delegate Func<string> toString = myStaticField.ToString; - the helper should be used to resolve the target of the virtual method so that subsequent delegate invocations are direct calls without virtual dispatch overhead). If it doesn't trigger it, try unoptimized build (or pass --noscan to ILC) - I don't remember if we still use the helper call in optimized builds for this case or if RyuJIT inlines the helper functionality for this.

@MichalStrehovsky
Copy link
Member

Actually - it's possible ResolveVirtualFunction is only reachable from IL, not from C#. The delegate case might go through the delegate creation helper.

If you can't hit it, placing a brk as the implementation (or just before the untested implementation if you already have it) might be the best course of action. IL is a corner case that can be fixed up when needed. We'll hit this when we start running the CoreCLR test bed.

@RalfKornmannEnvision
Copy link
Contributor Author

While I was hoping I could do it a little bit faster I think the general ARM64 code will be finished in the next few days. Anything beyond this should be Nintendo Switch specific and need to be discussed separately if someone needs/want it. Therefore the question. Do you prefer the whole ARM64 code in one pull request or should it be done in multiple smaller once?

@MichalStrehovsky
Copy link
Member

That's great news! Thank you!

I'll leave the number of pull requests up to you - but I think a single pull request should be manageable. If there's something unexpected, we can split that part off.

@RalfKornmannEnvision
Copy link
Contributor Author

Will do.

The changes although include modifications for the ObjectWriter and the old LLVM version it uses. I replaced llvm.patch file with a new one that includes all the old changes and my additional once to keep the build process the same, 

Beside of CoreRT I unfortunately needed to touch two files in RyuJIT, too. They include a fix for funclets (the stack frame was broken when running RyuJIT in CoreRT mode) and the generating of ARM64 unwind informations for ELF based systems as there was only code to do this for Windows. I expect that we need to get these in first as the CoreRT part will not work without the CFI data.
Disclaimer: I only tested everything on the Nintendo Switch by using a RyuJIT that was compiled to run on AMD64 Windows but generated code for ARM64 Unix. I expect that there will be some more additional work needed to get it working on a ARM64 Windows or Unix based system. Even more work for the new ARM64 based Apple systems. Unfortunately I don't have the hardware to do this as our current project doesn't target these systems.

@MichalStrehovsky
Copy link
Member

We don't have CI set up for ARM64 (and since the port is incomplete for Linux/Windows, it would probably not pass anyway) - I think the RyuJIT change can be submitted at the same time as the CoreRT change since the extent of CI validation will be limited to ensuring we don't regress x64. I assume you didn't have to change the interface between the JIT and the AOT compiler.

We should get CI support for this "for free" once the move to the runtimelab repo is complete. The runtime build system, CI testing, and publishing has ARM64 support in the dotnet/runtime repo, that dotnet/runtimelab is based on.

Unfortunately I don't have the hardware to do this as our current project doesn't target these systems.

I think your changes take us very close to having full Linux ARM64 support. Finishing this up shouldn't be too much work for someone else. Thank you!

@RalfKornmannEnvision
Copy link
Contributor Author

No, I didn't change the interface as there was already CFI support for AMD64 on Unix.

While I am in the process to do a final check to see if I have missed something I noticed Dummies.asm in the ARM64 folder

;; Licensed to the .NET Foundation under one or more agreements.
;; The .NET Foundation licenses this file to you under the MIT license.
#include "AsmMacros.h"
TEXTAREA
LEAF_ENTRY RhpLMod
DCW 0xdefe
bx lr
LEAF_END RhpLMod
LEAF_ENTRY RhpLMul
DCW 0xdefe
bx lr
LEAF_END RhpLMul
END

I might be wrong but I believe this was copied there by mistake from the ARM folder as the code is not ARM64 and a 64Bit host should not need any helper for these functions.

@MichalStrehovsky
Copy link
Member

Yes, that file is not used for anything, even on .NET Native (where it originates). It got integrated from a prototype branch by accident I assume. We can just delete it.

@RalfKornmannEnvision
Copy link
Contributor Author

This should hopefully the last question before I can make the pull request. 

I noticed that there is a GcProbe.asm but no empty GcProbe.S as for most of the other assembler code I needed to convert. The GcProbe.S is missing for all other platforms, too. Can I assume that it is currently not needed for *nix based host system and therefore skip it for now?

@MichalStrehovsky
Copy link
Member

Yes, seem like this file is only included in CMakeList on Windows and that's why we don't need to add an empty file to make the build happy. Seems like TARGET_UNIX is using RhpGcProbeHijack* helpers from portable.cpp and those just assert. I assume this is unreachable outside Windows.

@RalfKornmannEnvision
Copy link
Contributor Author

I opened the push request for the RyuJIT changes: dotnet/runtime#41023

The one for the CoreRT changes should follow tomorrow

@RalfKornmannEnvision
Copy link
Contributor Author

RalfKornmannEnvision commented Aug 20, 2020

Might take a little bit longer as at least one of my changes broke something for AMD64 on Linux/OSX.  I suspect it's related to the stack unwinding. Maybe it need a more diverse code path for AMD64 vs ARM64 in the shared part.

Excluded the cirtical submit with the modified stack unwind. I will do another pull request when I have a common solution for AMD64 and ARM64.

Anything else doesn't break anything for ARM64: #8271

Sorry it's pretty huge

@RalfKornmannEnvision
Copy link
Contributor Author

For reference if someone find this in the future

Because of various limitations based on the special environment (Game Consoles) I need to get CoreRT working I was not able to implement/check some features. This list might not be complete but should be a good starting point what still needs to be done:

  • Exceptions based on hardware signals like null reference, stack overflow etc
  • Thunk pool management. The Thunk code itself is working but there is not code that will managed the thunk pool as it is highly dependent on the host OS.
  • The intrinsic support is implemented but not tested. I might be wrong but it looked like that the support of the special ARM64 intrinsics will be part of .Net 5. Beside of this the function that checks what intrinsics are actually supported need to implemented for every host OS. But for the OS that .Net Core supports the code should be already in the regular runtime.
  • It might still be necessary to fix some of the GCToOSInterface and REDHAWK_PALEXPORT functions as I needed to replace this with a fully custom implementation. But with a little bit of luck they will just work fine.
  • As debugging is still somewhat broken for me I am not sure if there is an issue with the debug information itself or if it is just the debugger that have some issues with this foreign code I have injected by using RyuJIT as compiler instead of LLVM.

@RalfKornmannEnvision
Copy link
Contributor Author

The missing part of the unwind code: #8290

Now it doesn't break AMD64 anymore

@RalfKornmannEnvision
Copy link
Contributor Author

And with this finally in the master branch I believe I have done everything I could do to get CoreRT working on ARM64 based systems in general. Someone with an ARM64 based Linux system needs to fill in the still missing parts.

If someone else beside me have the need to run CoreRT on a Game Console we will need to find a way how to handle this case.

In the case I find some more issues that I can fix in a general way I will submit future pull requests.

@MichalStrehovsky
Copy link
Member

Thank you! Let's close this - we can open new issues for the leftover work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants