Expose a Utf8String type. #17872

ghost · 2018-05-03T14:18:17Z

We want to start prototyping Utf8String in CoreFxLab
and for that, we'll need a min-bar System.Utf8String
class exposed from System.Private.CoreLib.

The MethodTable is tricked out exactly as System.String
is except that ComponentSize is 1 rather than 2.

Other than that, this is copy-paste from the
StringObject code with all the necessary offerings
needed to keep the build happy.

We want to start prototyping Utf8String in CoreFxLab and for that, we'll need a min-bar System.Utf8String class exposed from System.Private.CoreLib. The MethodTable is tricked out exactly as System.String is except that ComponentSize is 1 rather than 2. Other than that, this is copy-paste from the StringObject code with all the necessary offerings needed to keep the build happy.

jkotas · 2018-05-03T14:25:29Z

src/mscorlib/src/System/Utf8String.cs

+        // Do not reorder these fields. Must match layout of Utf8StringObject in object.h.
+        private int _length;
+        [CLSCompliant(false)]
+        public byte _firstByte; // TODO: Is public for experimentation in CoreFxLab. Will be private in its ultimate form.


Do you really need this to be public? It would be better to have readonly byte GetPinnableReference() method - that is the proper public method we will want to have eventually.

It allows the code in CoreFxLab to be written as instance members of Utf8String itself would be.

Though if GetPinnableReference() adds no overhead, I'm fine with it. This PR is unblock @GrabYourPitchforks so I'll let him make the call here.

Sure, use GetPinnableReference instead if you wish. The only overhead I can think of is that an implicit null check may be emitted, but the JIT is generally good about eliding these where possible. If we find stray null checks we can open JIT bugs.

jkotas · 2018-05-03T14:28:51Z

What do you expect to get by having this in CoreLib for prototyping?

For prototyping, you can just have a class that wraps byte[] array, no need to have anything in corelib.

jkotas · 2018-05-03T14:33:45Z

src/vm/ecalllist.h:1391:40: error: use of undeclared identifier 'gUtf8StringFuncs'; did you mean 'gStringFuncs'

ghost · 2018-05-03T14:36:51Z

cc @GrabYourPitchforks

The benefit is to be able to benchmark and demo using an implementation that will have the same memory usage characteristics as what we'd ship.

stephentoub · 2018-05-03T14:43:19Z

If it's a prototype, why not keep it in a separate branch rather than master?

ghost · 2018-05-03T14:44:09Z

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

stephentoub · 2018-05-03T14:45:51Z

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

I don't know if that's currently possible, but you can certainly do so locally. My opinion is master should be for stuff we're on track to ship, not prototypes.

jkotas · 2018-05-03T14:48:45Z

The benefit is to be able to benchmark and demo using an implementation that will have the same memory usage characteristics

You would not be really able use this for bench-marking because of the allocator is slow and the JIT optimizations are missing.

ghost · 2018-05-03T15:04:43Z

Btw, while we're debating where to merge all this in (we need more stakeholders in the room for that, like the people who will be doing the main work on Utf8String), I'd still like to get eyes on the code itself. This GC-related stuff is tricky.

jkotas · 2018-05-03T16:09:40Z

This GC-related stuff is tricky.

What have you done to test it?

jkotas · 2018-05-03T16:10:48Z

src/classlibnative/bcltype/utf8stringnative.cpp

+
+// Compile the string functionality with these pragma flags (equivalent of the command line /Ox flag)
+// Compiling this functionality differently gives us significant throughout gain in some cases.
+#if defined(_MSC_VER) && defined(_TARGET_X86_)


This is copy&paste that does not make sense given what's in this file.

jkotas · 2018-05-03T16:11:15Z

src/classlibnative/bcltype/utf8stringnative.cpp

+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+//
+// File: StringNative.cpp


Does not match the actual name. Better to just delete - it is useless comment.

jkotas · 2018-05-03T16:11:27Z

src/classlibnative/bcltype/utf8stringnative.h

+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+//
+// File: StringNative.h


jkotas · 2018-05-03T16:13:56Z

src/vm/object.h

+    //   DWORD m_OptionalPadding (this is an optional field and will appear based on need)
+
+public:
+    VOID    SetUtf8StringLength(DWORD len) { LIMITED_METHOD_CONTRACT; _ASSERTE(len >= 0); m_StringLength = len; }


This should be called just SetLength, I think. The managed side has Length property, not Utf8Length property.

ghost · 2018-05-03T16:14:46Z

What have you done to test it?

Stepping through the allocation code, making sure the sizes passed to the GC is ok, tested multiple Utf8String allocations with GC's and random allocations triggered in between, checking to ensure the UTF8String moved in memory but has the same character content.

Make sure MethodTable::IsString still returns true for System.String but not Utf8String.

jkotas · 2018-05-03T16:15:03Z

src/vm/object.h

+    // GC will see a Utf8StringObject like this:
+    //   DWORD m_StringLength
+    //   BYTE  m_Characters[0]
+    //   DWORD m_OptionalPadding (this is an optional field and will appear based on need)


This comment about m_OptionalPadding does not make sense to me. Looks like left over from times when System.String was done differently.

ghost · 2018-05-03T16:22:25Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

jkotas · 2018-05-03T16:29:39Z

src/vm/vars.hpp

@@ -315,7 +315,9 @@ class REF : public OBJECTREF
 #define ObjectToOBJECTREF(obj)     (OBJECTREF(obj))
 #define OBJECTREFToObject(objref)  ((objref).operator-> ())
 #define ObjectToSTRINGREF(obj)     (STRINGREF(obj))
+#define ObjectToUTF8STRINGREF(obj) (UTF8STRINGREF(obj))


This would be only needed if we were to have manually managed implementations for UTF8 string. I do not expect we are going to have any. (Except for the fast allocator - that won't need it.)

I agree - hoping we won't need this. Ideally we'd be able to keep almost all code related to this type fully managed, with the exception of allocation / GC / p/invoke.

jkotas · 2018-05-03T17:14:19Z

src/vm/gchelpers.cpp

+    if (cchStringLength > 0x7FFFFFDF)
+        ThrowOutOfMemory();
+
+    SIZE_T ObjectSize = PtrAlign(Utf8StringObject::GetSize(cchStringLength));


I think this allocates more than required.

I have noticed that we may have the same problem in regular string. For example:
new string('a', 5) allocates 0x28 bytes objects on x64 today. But it should be enough for it to allocate just 0x20 bytes:

8 syncblock 8 vtable 4 size 5*2 content 2 zero terminator

Do you know why it is the case?

I have checked CoreRT. CoreRT allocates 0x20 bytes in this case, so there is definitely something pretty broken.

Isn't there some padding between size and content?

There is padding for regular arrays (so that all arrays have same layout even when elements are 8-byte aligned). There is no padding for strings. The content starts right after length. The extra unnecessary bytes are at the end.

I remember folks went to a great length to ensure that strings do not pay the extra 4 bytes during the initial 64-bit ports. It looks like it has regressed.

It's coming from sizeof(Utf8StringObject) - there should be a pragma pack 4 around that declaration.

Presumably the same for StringObject, though I want to keep anything like that separate from this PR. We haven't yet agreed to merge this into master.

#17876 has the fix for the String overallocation problem.

GrabYourPitchforks · 2018-05-03T19:22:25Z

src/mscorlib/src/System/Utf8String.cs

+namespace System
+{
+    // This is an experimental type and not referenced from CoreFx but needs to exists and be public so we can prototype in CoreFxLab.
+    public sealed class Utf8String


As we add interfaces (IEnumerable, IComparable, ...) to this during development, will we need to touch the VM code at all? That is, does the VM need to know about every possible member that sits on this type?

Not necessarily about interfaces. But the VM and JIT will certainly need to know more about this type to get decent performance. It goes back to my question on what you expect to get by adding this stub to CoreLib.

@jkotas You're right in that we're not going to see full performance benchmarks until JIT support comes online, but this does get us at least partway there. Even when not JIT-optimized, the supplied allocation routine will still be faster and allocate less overall heap memory than calling two different ctors for a container + separate array.

Could just be a struct wrapper with array as its single field?

@benaadams We considered this for prototype purposes. But the end result is to have the final type be a class, and any data we'd collect in the meantime from a struct stand-in would be suspect.

the supplied allocation routine will still be faster

That is not correct. The allocation routine in this PR is super slow naive implementation. I just tried:

class MyUtf8String { byte[] _payload; static public MyUtf8String FastAllocate(int length) { var ret = new MyUtf8String(); ret._payload = new byte[length+1]; return ret; } }

Run Utf8String.FastAllocate vs. MyUtf8String.FastAllocate in a loop. MyUtf8String loop is faster.

I am with Stephen that it would be best for a rough early prototype code like this to stay in a branch and not be merged into master.

FWIW, Span<T> was also developed in a branch and only integrated to master once it was sufficiently along and we agreed to ship it for real: #7886.

GrabYourPitchforks · 2018-05-03T19:28:18Z

src/vm/gchelpers.cpp

+
+    orObject = (Utf8StringObject *)Alloc(ObjectSize, FALSE, FALSE);
+
+    // Object is zero-init already


This includes the null terminator being zero-inited, I suppose?

davidfowl · 2018-05-04T07:41:32Z

src/vm/methodtablebuilder.cpp

+            DWORD baseSize = ObjSizeOf(Utf8StringObject) + sizeof(BYTE);
+            pMT->SetBaseSize(baseSize); // NULL character included
+
+            GetHalfBakedClass()->SetBaseSizePadding(baseSize - bmtFP->NumInstanceFieldBytes);


GetHalfBakedClass is this a real method 🤣

~~method~~ function (it's c++) 😉

stephentoub · 2018-05-04T14:01:18Z

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

I seem to remember we had things previously set up to be able to do this. @mmitche, @weshaggard, had you helped with that, publishing NuGet packages from another coreclr branch so they could be consumed into a dev branch in either corefx or corefxlab?

ghost · 2018-06-08T14:43:48Z

@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build

ghost · 2018-06-08T16:57:58Z

@dotnet-bot test CentOS7.1 x64 Checked Innerloop Build and Test
@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build

ghost · 2018-06-12T16:37:59Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

ghost · 2018-06-25T15:30:42Z

@dotnet-bot test Linux-musl x64 Debug Build

ghost · 2018-06-28T20:12:11Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

ghost · 2018-06-29T15:53:49Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

ghost · 2018-07-09T22:43:55Z

Will target feature branch. Opening a new PR for that.

ghost self-assigned this May 3, 2018

ghost requested review from jkotas and GrabYourPitchforks May 3, 2018 14:18

jkotas reviewed May 3, 2018

View reviewed changes

Fix Unix build breaks

fabbe5d

jkotas reviewed May 3, 2018

View reviewed changes

Incorporate PR feedback

f8639fa

jkotas reviewed May 3, 2018

View reviewed changes

GrabYourPitchforks reviewed May 3, 2018

View reviewed changes

atsushikan added 2 commits May 3, 2018 13:14

Fix allocation size & remove the manual managed code trappings

bcbbbb9

GetPinnableReference and fix x86 build

e4f9a43

davidfowl reviewed May 4, 2018

View reviewed changes

Merge with master

d026198

RussKeldorph added the area-System.Runtime label Jun 9, 2018

atsushikan added 2 commits June 11, 2018 07:35

Merge with master

493c27f

Merge with master

5eb53cc

atsushikan added 9 commits June 13, 2018 07:45

Merge with master

6cf128f

Merge with master

dd11e73

Merge with master

82ec9cb

Merge with master

4a01601

Merge with master

4ff38a1

Merge with master

fab65e9

Merge with master

d6258a4

Merge with master

21f47d3

Merge with master

4d60318

atsushikan added 3 commits June 26, 2018 07:32

Merge with master

39a4e2e

Merge with master

279a4c5

Merge with master

b247f32

Merge with master

cfe461b

atsushikan added 4 commits July 2, 2018 07:39

Merge with master

2f4ad81

Merge with master

11af9eb

Merge with master

00e7d26

Merge with master

5e9651f

ghost closed this Jul 9, 2018

ghost deleted the utf8 branch July 11, 2018 14:14

This pull request was closed.


		orObject = (Utf8StringObject *)Alloc(ObjectSize, FALSE, FALSE);

		// Object is zero-init already

Expose a Utf8String type. #17872

Expose a Utf8String type. #17872

Conversation

ghost commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented May 3, 2018

jkotas commented May 3, 2018

ghost commented May 3, 2018

stephentoub commented May 3, 2018

ghost commented May 3, 2018

stephentoub commented May 3, 2018

jkotas commented May 3, 2018

ghost commented May 3, 2018

jkotas commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented May 3, 2018

Choose a reason for hiding this comment

ghost commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub commented May 4, 2018

ghost commented Jun 8, 2018

ghost commented Jun 8, 2018

ghost commented Jun 12, 2018

ghost commented Jun 25, 2018

ghost commented Jun 28, 2018

ghost commented Jun 29, 2018

ghost commented Jul 9, 2018

jkotas May 3, 2018 •

edited

Loading