-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share UTF8 converters between coreclr and mono #85558
Conversation
@janvorli this is a draft PR, not ready for review yet.. |
@am11 ah, I am sorry for that, I was mislead by the fact that @marek-safar has added me as a reviewer. |
The fallback logic in coreclr and mono is a bit different. e.g. mono's eglib does not handle these error cases: https://github.com/dotnet/runtime/blob/25a4e045aa975656a96e3edd8380093177fe841d/src/coreclr/pal/tests/palsuite/locale_info/MultiByteToWideChar/test4/test4.cpp#L62-L114 Similarly, these Reflection.Emit tests are disabled on mono and now failing on coreclr: https://github.com/dotnet/runtime/blob/25a4e045aa975656a96e3edd8380093177fe841d/src/libraries/System.Reflection.Emit/tests/TypeBuilder/TypeBuilderDefineEvent.cs#L23-L25 I tried to implement the fallback logic in eglib but it turned out to be non-trivial given how we traverse the character stream. It requires look-ahead and look-behind mechanics which is currently missing in eglib. Accommodating the incomplete surrogate cases, where more than one character is missing e.g. With some refactoring in eglib, I think we can feature-complete it (such that it handles the list of known error cases). It can be done in a separate PR and we can continue unification work in this one. If we entertain the alternative; port coreclr's utf8.cpp to mono, we will lose utf32/ucs4 transformation functionality needed by mono public API for embedded apps
cc @jkotas, @lambdageek |
I assume that this fallback logic matches the fallback logic of managed UTF8 encoders/decoders. Is that correct? We have intentionally tried to match the behavior of managed and unmanaged UTF8 converters, so that we do not need to deal with behavior changes every time we move UTF8 conversion from unmanaged code to managed code.
I do not think we need utf32/ucs4 transformation in the minipal. It should stay Mono-only if these embedding APIs are still relevant. |
Yes, coreclr's utf8.cpp was initially based on corefx implementation: dotnet/coreclr#3809, but corefx implementation has evolved/diverged since (it is using vectorized APIs these days). Going by test results, we have kept the fallback logic in sync with managed implementation. There may be some corner cases which are not handled by utf8.cpp that managed implementation supports. mono's eglib implementation is based on public documentation of GNOME's glib, per the README: https://github.com/mono/eglib; it is not a copy of the original source https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gutf8.c. The original source seem to support some of these fallback cases. I haven't tested it if that is the case.
Sounds good. I will try to port utf8.cpp to mono (without C++ runtime dependency) and replace eglibc implementation; modulo the utf32 bits. |
25a4e04
to
fd9827e
Compare
15315df
to
1b49bf6
Compare
1b49bf6
to
3818b1e
Compare
This reverts commit f9845ac6f53dc95fb747eb21351dfa9412397217.
3f5d1d6
to
66b6ccd
Compare
66b6ccd
to
50b0549
Compare
cc @GrabYourPitchforks This PR is refactoring the unmanaged utf8 encoding/decoding so that it can be shared between all runtimes. The unmanaged utf8 encoding/decoding is used by the loader that assumes trusted input, it should have low security risk profile. The code matches the old managed implementation that you have redone some time ago. Please let us know if you have concerns about this change, or if you would like to see more to be done as a follow up. |
c279116
to
18c04bf
Compare
Depending on the input, PAL's MultiByteToWideChar / WideCharToMultiByte wrappers are now 20-22% faster. Perhaps we can vectorize this code by adding SIMD wrappers in minipal (e.g. once https://github.com/dotnet/runtime/pull/87865/files#diff-e0b84b85c730fa4b4a840dba229a4d630e88fcd33f17208eb35a905796e89f3c is in), similar to how managed code was updated. These APIs can are called on hot paths as well. |
I would rather spend time on pushing these hot paths to C# where they can take advantage of the vectorized managed implementation. |
@jkotas, e.g. this OS Thread Id: 0x7f4db74 (1)
Child SP IP Call Site
000000016FDFBB40 0000000101E40F3C libcoreclr.dylib!minipal_convert_utf16_to_utf8
000000016FDFBB40 0000000101E40138 libcoreclr.dylib!WideCharToMultiByte + 232
000000016FDFBB80 0000000101E37F20 libcoreclr.dylib!CreateFileW + 236
000000016FDFBD20 0000000101F165E4 libcoreclr.dylib!PEImage::TryOpenFile(bool) + 156
000000016FDFBD60 0000000101F93288 libcoreclr.dylib!BinderAcquirePEImage + 172
000000016FDFBDF0 000000010215FF80 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::GetAssembly(SString&, int, BINDER_SPACE::Assembly**, BundleFileLocation) + 124
000000016FDFBE50 00000001021613AC libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindByTpaList(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, BINDER_SPACE::BindResult*) + 2400
000000016FDFC1F0 00000001021605A0 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindLocked(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, bool, BINDER_SPACE::BindResult*) + 404
000000016FDFC260 000000010215F5CC libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindByName(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, bool, bool, BINDER_SPACE::BindResult*) + 208
000000016FDFC3E0 000000010215F084 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindAssembly(AssemblyBinder*, BINDER_SPACE::AssemblyName*, bool, BINDER_SPACE::Assembly**) + 208
000000016FDFC680 00000001021662D8 libcoreclr.dylib!DefaultAssemblyBinder::BindUsingAssemblyName(BINDER_SPACE::AssemblyName*, BINDER_SPACE::Assembly**) + 96
000000016FDFC730 0000000101E7CBF0 libcoreclr.dylib!AssemblyBinder::BindAssemblyByName(AssemblyNameData*, BINDER_SPACE::Assembly**) + 100
000000016FDFC760 0000000101F92F30 libcoreclr.dylib!AssemblySpec::Bind(AppDomain*, BINDER_SPACE::Assembly**) + 796
000000016FDFCC40 0000000101E7118C libcoreclr.dylib!AppDomain::BindAssemblySpec(AssemblySpec*, int) + 644
000000016FDFD240 0000000101F14030 libcoreclr.dylib!PEAssembly::LoadAssembly(unsigned int) + 192
000000016FDFD2D0 0000000101E8A214 libcoreclr.dylib!Module::LoadAssemblyImpl(unsigned int) + 168
000000016FDFD370 0000000101E798A0 libcoreclr.dylib!Assembly::FindModuleByTypeRef(ModuleBase*, unsigned int, Loader::LoadFlag, int*) + 528
000000016FDFD3C0 0000000101E99888 libcoreclr.dylib!ClassLoader::LoadTypeDefOrRefThrowing(ModuleBase*, unsigned int, ClassLoader::NotFoundAction, ClassLoader::PermitUninstantiatedFlag, unsigned int, ClassLoadLevel) + 484
000000016FDFD490 0000000101F2CBD4 libcoreclr.dylib!SigPointer::GetTypeHandleThrowing(ModuleBase*, SigTypeContext const*, ClassLoader::LoadTypesFlag, ClassLoadLevel, int, Substitution const*, ZapSig::Context const*, MethodTable*, SigPointer::HandleRecursiveGenericsForFieldLayoutLoad*) const + 1448
000000016FDFD5A0 0000000101F62878 libcoreclr.dylib!ZapSig::DecodeType(Module*, ModuleBase*, unsigned char const*, ClassLoadLevel, unsigned char const**) + 100
000000016FDFD620 0000000101EEDB88 libcoreclr.dylib!LoadDynamicInfoEntry(Module*, unsigned int, unsigned long*, int) + 320
000000016FDFDB20 0000000101E8BF54 libcoreclr.dylib!Module::FixupNativeEntry(READYTORUN_IMPORT_SECTION*, unsigned long, unsigned long*, int) + 92
000000016FDFDB50 0000000101F67450 libcoreclr.dylib!Module::FixupDelayList(unsigned long long, int) + 204
000000016FDFDBD0 0000000101F67150 libcoreclr.dylib!ReadyToRunInfo::GetEntryPoint(MethodDesc*, PrepareCodeConfig*, int) + 872
000000016FDFDC80 0000000101F1EDDC libcoreclr.dylib!MethodDesc::GetPrecompiledR2RCode(PrepareCodeConfig*) + 60
000000016FDFDCC0 0000000101F1EA94 libcoreclr.dylib!MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 172
000000016FDFDD20 0000000101EA1074 libcoreclr.dylib!CodeVersionManager::PublishVersionableCodeIfNecessary(MethodDesc*, CallerGCMode, bool*, bool*) + 1336
000000016FDFDE60 0000000101F22994 libcoreclr.dylib!MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 212
000000016FDFDF10 0000000101F224EC libcoreclr.dylib!PreStubWorker + 524
000000016FDFDF50 [PrestubMethodFrame: 000000016fdfdf50] Microsoft.AspNetCore.dll!Microsoft.AspNetCore.Builder.WebApplicationBuilder..ctor(Microsoft.AspNetCore.Builder.WebApplicationOptions, Boolean, System.Action`1<Microsoft.Extensions.Hosting.IHostBuilder>)
000000016FDFDFD0 00000001020F7874 libcoreclr.dylib!ThePreStub + 80
000000016FDFE100 000000010467582C Microsoft.AspNetCore.dll!Microsoft.AspNetCore.Builder.WebApplication.CreateSlimBuilder(System.String[]) + 108 [/_/src/DefaultBuilder/src/WebApplication.cs @ 122]
000000016FDFE120 00000001045619D8 api1.dll!Program.<Main>$(System.String[]) + 112 [/Users/am11/projects/api1/Program.cs @ 4]
FFFFFFFFFFFFFFFF 00000001045619D8
FFFFFFFFFFFFFFFF 00000001020F82B4 libcoreclr.dylib!CallDescrWorkerInternal + 132
000000016FDFE1D0 0000000101F75E94 libcoreclr.dylib!MethodDescCallSite::CallTargetWorker(unsigned long long const*, unsigned long long*, int) + 856
000000016FDFE440 0000000101E7AE8C libcoreclr.dylib!RunMain(MethodDesc*, short, int*, PtrArray**) + 632
000000016FDFE650 0000000101E7B1D8 libcoreclr.dylib!Assembly::ExecuteMainMethod(PtrArray**, int) + 260
000000016FDFE8E0 0000000101EA3218 libcoreclr.dylib!CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 616
000000016FDFE9E0 0000000101E68658 libcoreclr.dylib!coreclr_execute_assembly + 204
000000016FDFEA40 0000000100005BC0 corerun!run(configuration const&) + 6696 at /Users/am11/projects/runtime/src/coreclr/hosts/corerun/corerun.cpp:0
000000016FDFF0C0 0000000100002EC8 corerun!main + 1372 at /Users/am11/projects/runtime/src/coreclr/hosts/corerun/corerun.cpp:0
000000016FDFF2B0 00000001968B3E50 dyld!start + 2544 To accomplish "managed UTF-8 conversion" here, should we make it so |
It should be possible to move the whole assembly loader to C#. We would need a special unmanaged path to load CoreLib to bootstrap, but the rest can be managed. It is too big to do it all at once. I think a good place to start with this refactoring would be (The UTF8 conversion under PEImage::TryOpenFile does not look perf sensitive. It is done once per loaded assembly. Assembly loading does a lot of expensive operations, non-vectorized UTF8 conversion as part of it is not a big deal.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this one. Thank you!
Fix #78490