-
Notifications
You must be signed in to change notification settings - Fork 344
Scenarios and Design Philosophy - UTF-8 string support #2368
Comments
What is the reason to avoid validation if it's very cheap?
What is the use case for that? I've heard "reverse a string" being used as an interview question, but when is it actually useful? |
Reversing a string is not the only use for reverse enumeration. If you're parsing a text file or text-based format where you know you only need data at the end. That said, in this case, it would probably be more useful to application code to enumerate the graphemes instead of the scalars... but that could be said about nearly all of the exposed APIs, and doing that starts with being able to enumerate the scalars. |
To add data-persistence to the above scenarios: it would be useful if EF Core used the distinction between Ex: |
@cdorst Ignoring that these methods would need to exist in the framework or in packages first for EF Core to have an opinion about them, |
Thank you for the reply @JesperTreetop! That makes sense about being mindful of the db’s collation/encoding & not expecting it to be universally safe. |
@JesperTreetop @cdorst The typical example I give for reverse enumeration is a |
If it speeds up developing this spec, I think this part should be left out. First reason being: 99% of apps are only copying and concatenating strings, and don't care about any text processing. Second reason being: of the 1% of other apps, not many of them are doing it correctly and the ones that are doing it correctly are all duplicating code. It's probably a larger topic to discover which APIs could be made to help here such as iterating/indexing by code point and grapheme cluster. |
Reversing a string is useful in converting integer to string. Reversing a string is useful for plain text search maybe. Reverse is sometimes useful for file path canonicalization. Notice that reverse can often by encoding-ignorant, if the special characters being searched for are single byte. Also the integer to string case assuming 0-9 can just be byte-wise. For some reason the reversing algorithms come to mind first, but the non-reversing algorirthms are slightly more efficient. |
Seeing the great work being done here (i.e. https://github.com/dotnet/corefx/issues/30503), I'm hoping that a no-copy, zero-allocation alternative/complement to UTF8String is still being considered. As above:
and
Indeed, some sort of way to work directly on a On a related note, if we do go down this route we'd have at least three string-like types:
It seems like it would be good to have some uniform way which would allow performing string operations while hiding the particular underlying implementation (a bit like how spans can be constructed over byte[], byte*...). |
(This is a living document. Expect edits.)
Scenarios
High-performance networking stacks
(This includes
HttpClient
, ASP.NET's Kestrel, and similar APIs.)High-performance networking stacks want to be able to read incoming data directly into a buffer and perform text processing operations over it. This buffer is most likely going to be a
byte[]
or similar, and the operations they'll want to perform run the gamut from text tokenization (looking for HTTP headers) to full parsing (e.g., interpreting a sequence of bytes as a human-readable date).Importantly, these stacks don't want to incur the cost of transcoding or of allocating lots of little string instances just to call methods like
String.Split
orInt32.Parse
. We need to provide an API set that has near-parity (where sensical) with the existingString
APIs but which can operate over arbitrary spans of UTF-8 data.Finally, because these buffers are generally going to be represented as
byte[]
, we need to ensure that callers can perform UTF-8 operations over instances of this type without falling back to unsafe code or bouncing off theUnsafe
orMemoryMarshal classes
. If the developer needs to wrap thebyte[]
in a different type before calling the UTF-8 APIs, the wrapping logic should be (a) constant-time, (b) non-allocating, and (c) a single line of code at most.Interop with Go / Swift / Python
We want developers to be able to port their applications from other frameworks and have them run on .NET. Currently this may require developers to keep transcoding concerns top-of-mind due to the difference in how they ingest data from other languages (sometimes UTF-8) and how .NET's
String
type expects that data to be represented (UTF-16).For example, consider a scenario where a Go developer persists a string (perhaps representing a JSON payload) to a file, then the developer later wants to consume that file from .NET code. .NET's
File.ReadAllText
andFile.WriteAllText
APIs use UTF-8 by default, so the transcoding step is hidden from the developer and they just have a pleasant development experience with these APIs returningString
.Unfortunately such goodness does not extend to other concepts in the framework. One example of a problem area is p/invoking with a library that expects string data in UTF-8 format. Using such a library is certainly achievable, but we force developers to have subject matter expertise both in Unicode transcoding concepts and in interop concepts in order to fulfill this scenario.
Another example is the existence of some frequently-used concepts that other languages have (consider Go's
rune
or Swift'sCharacter
) that we hide behind complex APIs. If we made these APIs more approachable, developers would have greater confidence migrating this kind of code to our platform.Cheap slicing
(This is similar to the high-performance networking stack scenario but deserves its own callout due to the prevalence of such code.)
It is extraordinarily common for developers to call
String
'sSubstring
,Split
, andTrim
methods in order to get substrings back from the original string. Our data show that the majority of applications call at least of the aforementioned APIs. These APIs are particularly prevalent in parsing code paths.More compact memory representation
Evidence shows that most of the data present in String instances is ASCII. This is due to a number of factors, including the prevalence of Latin-based text on the web. Even in applications that cater to Chinese audiences and other speakers whose languages don't use Latin-based characters, we find that things like OS identifiers are predominantly English. Since UTF-8 ASCII text takes half the memory size of UTF-16 ASCII text, it stands to reason that changing the internal representation of this data can provide significant savings. (Our internal tests over first-party code have born out this theory.)
Reducing the total allocation cost of strings has other beneficial side effects. Oracle's own experiments showed that when they added the compact string feature to Java 9, the reduced memory footprint allowed them to reduce the frequency of garbage collection events, and the time spent in each event decreased (reference, PDF link).
We're not ready to change the internal representation of our
String
type, but theUtf8String
feature does allow developers to use a new API that follows familiar behaviors and patterns while also reducing overall load on the managed runtime.Serialization / deserialization
Some serializers (JSON, XML) are defined to produce textual data instead of binary data. However, in some situations we know the output of these serializers is going to be written to the wire, and it would improve our runtime performance if we're able to write to the expected wire format (generally UTF-8) directly rather than go through a serialize-then-transcode process.
APIs like
StreamWriter
also fall under this general umbrella. The most common encoding to provide to theStreamWriter
constructor is UTF-8 (this is also the default encoding), which means that any input passed to any of the writer'sWrite
APIs needs first to be converted to a String, then converted to binary, then copied to the output buffer. This is the case even for string literals passed to the APIs! We should instead ensure that the string data is already in the correct encoding format before it's passed to the APIs, and at that point we have a simple memcpy operation.Other candidates for optimizations here include ASP.NET's Razor web pages, which operates analogously to
StreamWriter
. Large chunks of the data written to these pages are literal strings, and having first-class support for a UTF-8 string type would help them avoid unnecessary runtime transcoding.Interop with code and services
We've been approached by first-party and third-party clients who are consuming SDKs which expect all text data to be in UTF-8 format. That is, they're removing their dependencies on consuming
wchar_t*
, instead relying onutf8char_t*
(aliased aschar*
). Giving developers the option to useUtf8String
instead ofString
in their code minimizes the overhead of p/invoke and data exchange between the managed and unmanaged worlds.ML.NET
The ML.NET team currently has the following scenario.
std::wstring
. (memcpy + transcode)wchar_t*
is fetched and copied into a managedString
instance. (memcpy)DvText
is used to cheaply slice and parse the string. This is used for tokenization and general primitive type parsing.std::wstring
instances. (memcpy of substrings)std::wstring
instances into a single binary payload in a format expected by Python. (memcpy + transcode)We can eliminate the transcoding step and provide a type which supports cheap slicing, which should cut down significantly on the overhead the team is seeing in their scenarios. Additionally, we have the opportunity to eliminate the initial memcpy step entirely if that also proves to be a bottleneck.
Design philosophy
Usage, usability, and behaviors
We find ourselves with two conflicting goals. The first goal is performance above all else: fill a buffer with inbound network data,
reinterpret_cast
it as UTF-8 data, and operate on it. Network protocol stacks are the big consumer here. This can be achieved by providing UTF-8 manipulation methods which operate directly on spans, which has the added benefit of allowing the consumer to remain in full control of all memory allocations.The second goal is to provide a friendly, usable API surface on an object which represents UTF-8 data. This helps with application migration, data exchange, and using UTF-8 SDKs. Most (but not all) comparable languages represent strings as immutable objects with their own dedicated backing memory. For example, in the Go language, the
string([]byte) -> string
API converts a byte array to a string, but it does so by making a copy of the underlying data. The returned object has an independent lifetime from the original input array. (In Swift and C++, strings are copy-by-value.)This implies that for usability's sake we should have a
Utf8String
type which mimics the behavior developers have come to expect fromString
: it's an immutable object which holds on to its own copy of the data, and you're able to retrieve the underlying read-only span from the object. This provides something of a universal exchange type, as APIs which need to hydrate a standalone instance of data can return an instance of this type, and developers can always go fromUtf8String
to span if they need access to the more powerful span-only APIs.Since we have an immutable reference type, we can also make certain optimizations like ensuring a null terminator (important for p/invoke scenarios) or repurposing flags in the object header to improve the performance of string inspection or manipulation operations.
Ideally at some point in the future we can have full globalization support for UTF-8 sequences, including culture-aware sorting and case conversion routines. This will likely require a sizeable change to the globalization APIs, so it's possible that such a feature would be several versions out. We should at minimum support limited globalization-related operations on UTF-8 sequences, including
Ordinal
andOrdinalIgnoreCase
comparisons,ToUpperInvariant
and friends, and allowing the invariant culture to be passed toToUtf8String
APIs.Performance
Utf8String
should have similar complexity characteristics asString
: constant time indexing, linear time allocation and searching, etc. For marshaling, we may wish to consider similar optimizations as exist onString
, e.g., stack-copying small objects rather than pinning the object in the managed heap. It is not a goal to provide constant time indexing of scalar values or graphemes within either aString
or aUtf8String
.While
Utf8String
is useful for representing incoming UTF-8 data without the need for transcoding, it does still incur the cost of an allocation per instance. As part of this work we may want to consider makingStringSlice
orUtf8StringSlice
first-class types in the framework. One could imagine these types as being thin wrappers (perhaps aliases?) forReadOnlyMemory<char>
andReadOnlyMemory<Char8>
along with most (but not all) of the instance methods onString
andUtf8String
.Security and validation
UTF-8 processing has traditionally been a source of security vulnerabilities for applications and frameworks. There are subtleties with data processing that commonly lead to buffer overflows or exceptions in unexpected places.
.NET applications have historically not been subject to these same vulnerabilities because of our internal representations of strings as UTF-16. It's generally difficult for ill-formed UTF-16 sequences to make their way into the system because client-submitted data on the wire is normally in UTF-8 format, and the conversion process from UTF-8 to UTF-16 will naturally replace invalid sequences with a replacement character. When vulnerabilities have been found the culprit has generally been serializers like JSON readers which blindly splat "\uXXXX" sequences into a
String
rather than go through a proper encode / decode routine.UTF-8 is much more prone to misuse due to the fact that remote client input is already expected to be in UTF-8 format. Since there's no need for transcoding, there's a greater temptation to reinterpret cast the provided data directly into a UTF-8 container without running through a verifier. This behavior generally leads to problems like those mentioned in the earlier CVE link. Therefore, as much as possible, we should strive to ensure that Utf8String instances represent well-formed UTF-8 data, where well-formed is defined in The Unicode Standard, Chapter 3, Table 3-7.
Any
Utf8String
factory (where "factory" is anything that returns aUtf8String
, including constructors) should perform validation on its inputs, replacing ill-defined sequences with the replacement characterU+FFFD
. The validation logic should be compatible with theUtf8Encoding
class used by full framework.There are a handful of exceptions to this rule. Some callers may know that the input data is already well-formed, perhaps because it has been loaded from a trusted source (like a resource string) or because it has already been validated. There must be "no-validate" equivalents of the factories to allow the caller to avoid the performance hit.
In a nutshell, though we do not require
Utf8String
instances to be well-formed, our APIs should encourage this as much as possible. APIs which operate onUtf8String
instances must be prepared to handle ill-formed input and should behave predictably. For example, the enumerator overUtf8String.AsScalars()
must have well-defined behavior for all possible inputs.One interesting consideration for validation by default is
Substring
and related APIs. While it's true that this could theoretically be used to split aUtf8String
in the middle of a multibyte sequence, in practice developers tend to use this API in a safe fashion. Consider the following two examples.In both cases, the string is split at a proper scalar boundary due to the fact that the target string is well-formed. And since the target string is almost always a literal (or itself a
Utf8String
, which we encourage to be well-formed), the split string will likewise be well-formed. Since this represents the typical use case ofSubstring
, we can optimistically avoid validation on this and related calls. (Though if we wanted to perform validation, that's easy enough to do very cheaply.)Validation and inspection
We should expose APIs that allow developers to gather useful information about UTF-8 sequences (not just
Utf8String
instances), including validation, transcoding, and enumeration of these sequences. There are three kinds of enumeration that are useful for both UTF-16 strings and UTF-8 strings.Char8
orchar
) - Provides access to the raw bit data of the string.UnicodeScalar
) - Provides access to the decoded data of the string. Can be used for transcoding purposes or to make ordinal comparisons between strings of different representations.The APIs we provide should be powerful and low-level enough for developers to build their own higher-level APIs on top, adding value where those developers see fit. As a concrete example, we needn't provide an API which says "the next scalar in the input string is CYRILLIC SMALL LETTER IOTIFIED A". But we should have an API which allows the developer to see that the next scalar in the input string is
U+A657
, allowing the developer to build their own higher-level API which then mapsU+A657
to "CYRILLIC SMALL LETTER IOTIFIED A" (see code chart PDF).Open question: should the framework provide a text element / grapheme enumerator? Or does it perhaps fall into the "separate component provides this facility using our lower-level APIs as implementation details" category?
Code units,
byte
,Char8
, andUnicodeScalar
First, some quick definitions:
Code unit: The fundamental data type of a string. The code unit for UTF-16 text is a 16-bit integral type (
char
, distinct fromushort
andshort
). The code unit for UTF-8 text is an 8-bit integral type (tentativelyChar8
, distinct frombyte
andsbyte
).Code point: Any value in the Unicode codespace (
U+0000..U+10FFFF
). Not all code points have representations in UTF text; for example, the code pointsU+D800..U+DFFF
are reserved exclusively for UTF-16 text and make sense only when combined to form a scalar value.Scalar value: Any value in the range [
U+0000..U+D7FF
], inclusive; or [U+E000..U+10FFFF
], inclusive. In other words, the set of all code points minus the set of all UTF-16 surrogate code points. All scalar values have unique representations in UTF-8, UTF-16, and UTF-32 text. Well-formed UTF-8, UTF-16, and UTF-32 text is defined as a sequence of scalar values which have been properly encoded (into one or more code units per scalar value) per the UTF being targeted.Grapheme: A single display character which may be composed of one or more scalar values. For example, the "woman firefighter" emoji is a single grapheme which consists of the three-scalar sequence [
U+1F469
(woman),U+200D
(zero width joiner),U+1F692
(fire engine) ]. A more layman's way to think of a grapheme is in the context of a text editor: if the user hits the backspace key, what symbols would they reasonably expect to disappear from the screen? (We don't consider in-box grapheme support in this proposal.)High-performance scenarios require that we add APIs which operate on UTF-8 text in the form of spans. The simplest way to do this is to add UTF-8 extension methods on
ReadOnlySpan<byte>
, but this comes with two big problems. First, containers ofbyte
are primarily used as an exchange for binary blob data. Any extension methods onReadOnlySpan<byte>
will show both for spans that contain UTF-8 text and for spans that contain binary data. Additionally, managed type systems tend to draw a strong distinction between objects which contain different types of data (and developers generally depend on compile time type checks to catch mistakes), and we don't want to subvert the type system in this manner.Our solution to this is to introduce an integral
Char8
type to represent the code unit of UTF-8 textual data. Just asReadOnlySpan<char>
(represents a UTF-16 string) is distinct fromReadOnlySpan<ushort>
(represents integral data),ReadOnlySpan<Char8>
(represents a UTF-8 string) is distinct fromReadOnlySpan<byte>
(represents binary data). Any span-based extension methods we create will takeReadOnlySpan<Char8>
as the this parameter.To support this concept, we'll also need to add
AsUtf8(this ReadOnlySpan<byte>) : ReadOnlySpan<Char8>
andAsBytes(this ReadOnlySpan<Char8>) : ReadOnlySpan<byte>
extension methods. This does technically allow subverting the type system in the sense that it allows reinterpreting textual data as binary data and vice versa, but it has the advantage that the call site makes very clear that the developer is changing the representation of the data between text and binary.Finally, we introduce a
UnicodeScalar
type that can contain any valid Unicode scalar value. Instances of this type can be converted to any representation (UTF-8 / UTF-16 / UTF-32). Overloads ofUtf8String.IndexOf
andUtf8String.Contains
take instances ofUnicodeScalar
in place ofChar8
because a singleChar8
in isolation can really only represent an ASCII character. In addition, we'll provide APIs for enumeratingUnicodeScalar
instances from a UTF-8 sequence both forward and reverse. (We can extend this support to enumerating scalars from UTF-16 sequences in the future if demand exists.)The text was updated successfully, but these errors were encountered: