-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms #30960
Comments
What would be the size impact? |
@stephentoub To System.Private.CoreLib.dll? Around 3 KB to carry the data. Edit: This number is derived from adding 4 casing tables to |
cc @tarekgh Just for my own education, why do we carry tables at all rather than calling into the OS/ICU? Performance? Or code that must run before ICU loaded? |
There are some Unicode data Windows doesn't expose. so we had to carry such tables. When we fully switch to use ICU, we can look at not carrying these table but I don't think we can do that today as we need to run on OS versions not having ICU. |
How will sort keys work?
Will this be incompatible with |
@danmosemsft Want to make Windows 10, 1703 a minbar for .NET Core 5? :) That exposes the APIs we'd need. @scalablecory |
GetSortKey doesn't allow passing CompareOptions.OrdinalIgnoreCase. It throws. |
Here's a bit more on this, to help fill in some missing details. The Unicode Standard defines four types of "simple" case conversion for every code point: toLower, toUpper, toTitle, and toCaseFold. toLower and toUpper are fairly self-explanatory. I'll skip toTitle since it's not particularly relevant to this discussion. toCaseFold is interesting for reasons we'll see shortly. First, a quick definition. Simple case conversion simply means that a single code point case-maps to a single code point. This is contrast to full case conversion, where a single code point might case-map to multiple code points. In .NET, our ToUpper and ToLower routines perform simple case conversion, as demonstrated below: // 'ffi' is U+FB03 LATIN SMALL LIGATURE FFI
string toUpperSimple = "ffi".ToUpperInvariant();
string toUpperFull = "ffi".ToUpperInvariantFull(); // assume such a method exists
Console.WriteLine(toUpperSimple); // prints "ffi" (the single-code point string)
Console.WriteLine(toUpperFull); // prints "FFI" (the three-code point string) Unicode defines no single code point which represents a "large" version of the FFI ligature. So the ToUpperInvariant framework API no-ops on the input (since it uses simple case mapping). If we had a theoretical API which instead performed full case mapping, it would return the three-character string "FFI", as shown above. This is also reflected in Unicode's CLDR data files via the Uppercase_Mapping field in the table at https://unicode.org/cldr/utility/character.jsp?a=FB03. The big benefit of "simple" case mapping is that under UTF-16 or UTF-32, the output string has a length which exactly matches the input string's length. For example, an 8- (In UTF-8, even under "simple" case mapping, the byte length of the output string isn't necessarily equal to the byte length of the input string. The details of this aren't particularly interesting to this discussion, but I wanted to call it out here for completeness.) Now, to case folding. For various reasons, certain code points do not map to their logical uppercase or lowercase versions. For example, even though the Unicode Standard defines an uppercase To work around this, Unicode defines a case mapping called case fold mapping, which is a not-quite-uppercase / not-quite-lowercase mapping that can be applied to every scalar value within a string to generate a new string that can be compared in a case-insensitive fashion. That is, Unicode defines simple case-insensitive equality as "does So, in summary, here's what the various Framework APIs are intended to mean, along with how this proposal will affect them:
|
Am I remembering correctly that matching Windows behavior was sold as beneficial for OrdinalIgnoreCase to match how the OS compares file names and other symbols like user names? Is that still relevant if so? |
Where is this code? I am not able to find it.
I think we should change
Note that we carry this data in Unix PAL already: https://github.com/dotnet/coreclr/blob/master/src/pal/src/locale/unicodedata.cpp. There is opportunity to dedup this (on Unix at least). |
@nguerrera Somewhat, but this is becoming a relic of a time where everything takes place on a single machine. Consider what I mentioned earlier where the Windows operating system updates their internal casing information with each release. In today's world your application is almost certainly talking to a machine on a network somewhere, and it's probably the case that the machine on the other end of the line is running a different operating system from you. It's almost certainly possible to craft two strings a and b where your machine and the remote machine you're talking to disagree on whether a and b are equal under an
@jkotas It's implemented by the Win32 function
Depends on the file system. From discussions with @JeremyKuhne, NTFS can operate in a case-sensitive fashion, so it's already possible today to have two files whose names differ only in case. I'll dig up further information. |
This is not how Win32 function CompareStringOrdinal is implemented internally. To avoid confusion, it would be better to just say that this just p/Invokes Win32 on Windows and ICU elsewhere.
There are still many scenarios that take place on a single machine. For example, one can have a local disk cache - that is a perfectly reasonable thing to do even with cloud - and the implementation of the cache may depend on the CompareStringOrdinal to be in sync with the local NTFS file system for security. |
This is the thing that I don't think is correct. Say that an adversary can choose the filenames to be stored on the file system or the values of the primary keys to be stored in a database. I happen to know that you're running OS version vCurrent, but OS version vNext is releasing imminently. So I craft two strings which under vCurrent compare as not equal under a case-insensitive comparison, you happily commit them to storage, then you update to vNext and the strings suddenly start colliding. This is already a potential point of failure in today's world, even without the proposal here. |
BTW "no" is also an appropriate response to this proposal. :) Part of this proposal was to speed up |
Just trying to come up with a hypothetical worst-case example. If the on-disk cache was flushed on OS updates, this would work fine today. |
Usually the file systems snaps their own casing table to be isolated from the running system to ensure if you moved the disk from one system and mounted to other system will continue work regardless of the OS version.
Does the OS will use its own casing to flush the cache created by .NET? |
From a security perspective, as much as everybody hates to admit it, the absolute safest thing to do is to ensure that unique identifiers (user names, file names sans extension, etc.) consist only of the characters
The rules above are demonstrative and not intended to be exhaustive. Due to the Unicode stability policy, since a code point's case folding is permanently locked at the time of its introduction to the standard, blocking code points such as those listed above would be sufficient to prevent introducing data which might not collide now but which has potential to collide in the future. For applications which use |
Moving to ICU is great plan! As PowerShell Core repo maintainer I am thinking about remoting scenarios. Today users can connect from any OS to any OS (Windows, Linux, MacOs). If PowerShell will be based on .Net Core 5.0 and ICU on all OSes all remote scenarios will be "consistent by design"! It is very important because PowerShell is for management that assumes sometimes manipulating sensitive data. I guess not all projects will be ready to move to ICU so it would be nice to have opt-in on early preview and opt-out later to switch back to NLS. It is important for testing too. In PowerShell repo we could run tests for ICU and then for NLS to ensure that they pass both and our users is not break. I am interesting in using SCF in PowerShell Core for performance reasons (possible scenarios: identifier and file names comparisons, log searching...). Preserving string lengths and using table-based mapping allow us to achieve performance comparable to Ordinal. I already made SCF implementation in CoreFXlab repo. I started with 3-level table mapping as @tarekgh adviced. I found this very slow and switch to 2-level table with good results (near Ordinal!). Of course, the size of the table is no longer 3 Kb but more, but if performance is important, we could accept this. /cc @joeyaiello @SteveL-MSFT @daxian-dbw @mklement0 for information. |
Cc @mjsabby |
cc @timmydo As long as we're converging on ICU and can have it be on Windows, that sounds good. |
Follow-up question to anybody interested: for the specific APIs mentioned here ( That is, it is acceptable to say ".NET 5's implementation of these specific methods uses data from the Unicode Standard 12.1" (and later versions of the framework would be pegged to later standards)? Or is there a requirement to always use the latest available data available from the local system? There are obviously trade-offs involved, but I'd like to gauge the community's pulse on this. |
I am inclining to this option more. I am think in the near future we should fully use ICU for all needed functionality across all OS's. Also, there will be some consistency between the native and managed apps. |
What are we going to do about the Windows version that do not include ICU? I doubt that we are going to drop the support for those in near future. |
We are talking to Windows team who are going to publish a NuGet packages for ICU. These packages can be used on Windows down level versions. Also, we are going to fallback to NLS in case we couldn't find any ICU on the system. I am going to write some doc with more details and I'll share it as soon as I have it ready. |
Will InvariantCulture be the same as locale "C"/"POSIX" ? |
I am not sure what you mean by be the same, in 3.0 we already map "C" locale to InvariantCulture. "C" locale behavior is not desirable at all and that is why we did this mapping. |
@tarekgh Sorry, my question was about Windows. |
Yes, "C" on Windows is mapped to Invariant too (in net core 3.0). |
For reference, "applications" that require stable mappings, carry them themselves, even For example, NTFS stores an upcase table on volumes when it formats them. And btw this kinda contradicts: https://github.com/dotnet/corefx/issues/41333#issuecomment-535566914 You'd have to read the upcase table from the relevant volume. (and indicated in https://github.com/dotnet/corefx/issues/41333#issuecomment-535601504) |
@GrabYourPitchforks does it still make sense to make Invariant Mode OrdinalIgnoreCase not look at non-ASCII given it is in managed code? Given that we're now carrying the C# data? Should this proposal be updated? I would love to have our application run in Invariant mode but still be able to do OrdinalIgnoreCase. |
@mjsabby We're not yet carrying the C# data. We're also still doing case comparisons incorrectly, even when invariant mode is not enabled, since we're always doing a toupper conversion rather than a case folding conversion. In an ideal world, this is what I'd want to happen:
|
I would to say, it is better to define ordinal case by the casing provided in UnicodeData.txt and not to change that. In the future, we can expose case folding casing/comparison APIs. The apps will have the flexibility to choose the way want the comparison to be performed. The reason here is case folding can be more complicated and sometime users want to choose to do case folding using specific culture which will need normalization at that time. We should think about the case folding from more wider angle and not constrain us to ordinal operations. |
You're right, but I was thinking that "OrdinalIgnoreCase" comparisons would always use simple / non-culture-aware folding. This is already the recommendation from Unicode regarding how to perform caseless matching (see the case mapping FAQ and Unicode Ch. 5, Sec. 5.18). |
The only concern I have with that is Ordinal by name indicating 1:1 mapping/comparison. Case folding will not guarantee that. This is why I am preferring new APIs for case folding. Anyway, we'll visit this in later releases and we can come up with the best proposal together :-) |
I'd say that following to Unicode standard is better than splitting API while we have a switch to old - this last will confuse users and open a way to have applications with mixing old and new case folding that is my main concern. |
hi @GrabYourPitchforks , Do you have any recommendations for someone in this situation? In our situation, we used the ToUpperInvariant() with training a machine learning model. Now when we use the same model in a Linux env, when we run ToUpperInvariant(), we get different response for the same input character which impacts how the model behaves. Our model training and hosting envs are mixed mode, Windows in training, Linux and Windows in hosting. What can we do to port the Windows mapping to Linux? Can we use the case folding table in this case? Example:
On Linux/WSL
What we would like is the Linux call to also return the same char as Windows. |
@talktovishal I assume you are not using .NET 5.0. right? if so, would it be good if you can migrate to .NET 5.0 which I expect you will get the exact behavior when running on Windows or Linux. |
hi @tarekgh , Thanks for your response. Your observation is correct. We haven't migrated to .Net 5.0 yet. We have plans to do so. Do you have any recommendations for the models that are already trained with .net Fx 4.*? What do we do in that case? I want the same mappings from Windows to work in Linux. How can I achieve that for the short term while we migrate and re-train our models? |
@talktovishal I don't think there anyway guarantee to get the NETFX casing behavior when running on Linux except if you generate your own casing table from Windows and then manually doing the casing operation using the generated casing table. if you want, I can try to provide some code doing that if this is feasible option to you and how urgent you'll need that? |
@tarekgh if you can provide sample code, that would be great. We have a deadline in the end of next week and hence we are trying to figure out what are our options. Really appreciate your help and prompt response. |
@talktovishal I'll try to provide you the code in next couple of days. |
@talktovishal please try the code https://gist.github.com/tarekgh/de1a1b29b03d048ad580c0a338c44fbb and let me know if you have any question. Note, I did some quick testing on it so I hope you can do some more testing. The behavior should match the full framework behavior. usage would be something like string uppercasedString = "abcd".NetFxToUpperInvariant();
char uppercasedChar = 'a'.NetFxToUpperInvariant(); |
Wow, this is great @tarekgh ! Can you please give a summary of the approach? I will def do some testing at my end. |
I just run |
FYI, I tested your code and it appears to provide consistent results across:
I ran both your provided |
I tried the gist in .NET 6 and .NET 3.1 and found that strings "և" and "fifl" did not convert to upper case properly (I was expecting "ԵՒ" and "FIFL", respectively). Here's a repo that replicates the issue. (shows .NET 6. can provide .NET 3.1 repo if relevant) Should I expect this to work? cc @tarekgh |
@StachuDotNet the gist you pointed at was created to mimic the behavior of the casing when running on the .NET Framework and not to be used for .NET 5.0 or 6.0. The result given by the gist code is exactly the result when using ToUpperInvariant on the .NET Framework. So, what you are getting is the expected result. Also, the file Unicode standard data UnicodeData.txt is not showing these characters have casing form.
Please note, casing operation is one-to-one mapping in the .NET. |
On my system, Sample code using System.Runtime.CompilerServices;
Console.OutputEncoding = System.Text.Encoding.UTF8;
const string small = "ß";
const string capital = "ẞ";
string toUpper = small.ToUpperInvariant();
string toLower = capital.ToLowerInvariant();
Print(small);
Print(toUpper);
Print(capital);
Print(toLower);
static void Print(string s, [CallerArgumentExpression(nameof(s))] string? paramName = null)
{
Console.WriteLine($"{paramName,-10}: {s} ({(int)s.First()})");
} Output:
Can somebody explain this? |
Look at the Unicode Data file you'll see following two entries:
You will see the small letter |
Interesting. Thanks! I was mostly concerned about my results not matching the results described in the original post. Did they change the behavior some time after 2019? |
Proposal
.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via
StringComparer.OrdinalIgnoreCase
). The current implementation of this API is to callToUpperInvariant
on each string, then compare the resulting uppercase strings for bitwise equality..NET Core also provides methods to convert
char
s,Rune
s, andstring
s to uppercase or lowercase using the "invariant" culture (ToUpperInvariant
/ToLowerInvariant
). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.I propose changing the logic so that .NET Core carries its own copy of the ICU "simple" casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an
OrdinalIgnoreCase
comparison, and it would affect string casing only when usingCultureInfo.InvariantCulture
.Justification
Today, when processing UTF-8 data in a case-insensitive manner (such as via
Equals(..., OrdinalIgnoreCase)
), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we'd be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such asstring.ToUpperInvariant
andstring.ToLowerInvariant
since we'd be able to optimize those calls.As mentioned earlier, today's casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)
Here are some concrete examples demonstrating the problems:
With this proposal, the code above will behave the same across all OSes. They would follow what is today's non-Windows behavior. They'd be locked to whatever version of the Unicode data we include in the product as part of the
CharUnicodeInfo
class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within theCharUnicodeInfo
class follows the Unicode Standard 11.0.0.Breaking change discussion
Affected APIs:
string
/char
/Rune
equality methods or hash code generation routines which takeStringComparison.OrdinalIgnoreCase
as a parameter. All other comparisons are unchanged.string
/char
/Rune
case changing methods whenCultureInfo.InvariantCulture
is provided. All other cultures are unchanged.ReadOnlySpan<char>
which provide equivalent functionality to the above.StringComparer.OrdinalIgnoreCase
. All otherStringComparer
instances are unchanged.CultureInfo.InvariantCulture.TextInfo
.If
GlobalizationMode.Invariant
is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.Applications which depend on
OrdinalIgnoreCase
equality being stable may be affected by this proposed change. That is, if an application relies on"ß"
and"ẞ"
being not equal under anOrdinalIgnoreCase
comparer, that application is likely to experience odd behavior in the future.In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:
Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it's using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.
There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using
OrdinalIgnoreCase
might compare as not equal usingInvariantCultureIgnoreCase
. For example:I don't expect this to trip up most applications because I don't believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.
This may also lead to a discrepancy between managed code which uses
StringComparison.OrdinalIgnoreCase
and unmanaged code (including within the runtime) which usesCompareStringOrdinal
on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.The text was updated successfully, but these errors were encountered: