Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringComparer.CurrentCultureIgnoreCase does not ignore case when LANG=C.UTF-8 #27376

Closed
rjmholt opened this issue Sep 12, 2018 · 13 comments
Closed

Comments

@rjmholt
Copy link

rjmholt commented Sep 12, 2018

See also: PowerShell/PowerShell#7761

For a small repro:

  • Run dotnet new console
  • Put the following in Program.cs:
    using System;
    using System.Collections;
    
    namespace LangExample
    {
        class Program
        {
            static void Main(string[] args)
            {
                if (StringComparer.CurrentCultureIgnoreCase.Compare("String", "string") != 0)
                {
                    throw new Exception("Hashtable key comparison is case-sensitive");
                }
    
                Console.WriteLine("Working as expected");
            }
        }
    }
  • Run dotnet run and get output:
    Working as expected
    
  • Run LANG=C.UTF-8 dotnet run and get output:
    
    Unhandled Exception: System.Exception: Hashtable key comparison is case-sensitive
       at LangExample.Program.Main(String[] args) in /home/rob/Documents/Dev/sandbox/LangExample/Program.cs:line 18
    

Expected Behaviour (with LANG=C.UTF-8):

StringComparer.CurrentCultureIgnoreCase.Compare("String", "string") == 0

Actual Behaviour (with LANG=C.UTF-8):

String.Comparer.CurrentCultureIgnoreCase.Compare("String", "string") != 0

Runtime information:

> dotnet --version
2.1.401

> uname -a
Linux chronos 4.15.0-33-generic dotnet/runtime#13856-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

> cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"

Given that C.UTF-8 is a Debian locale, this is probably also the case on other Debian-based Linux distros.

@tarekgh
Copy link
Member

tarekgh commented Sep 12, 2018

"C" and "Posix" collations behavior is case sensitive. we encourage not using these locales if you are performing any sorting operations. you may look at https://github.com/dotnet/corefx/issues/28611 for more info.

You may look at the comment on the code https://github.com/dotnet/coreclr/blob/master/src/corefx/System.Globalization.Native/locale.cpp#L128 for Posix but it apply to C too.

Also, you may look at https://www.postgresql.org/docs/10/static/collation.html which is telling same things

23.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

@tarekgh tarekgh closed this as completed Sep 12, 2018
@tarekgh
Copy link
Member

tarekgh commented Oct 8, 2018

dotnet/docs#8179

@wfurt
Copy link
Member

wfurt commented Mar 2, 2019

This still seems wrong @tarekgh I could see that the lang could impact default comparison but I would expect functions where explicitly asked for "IgnoreCase" do case-insensitive comparison.

@tarekgh
Copy link
Member

tarekgh commented Mar 3, 2019

@wfurt did you read the comment https://github.com/dotnet/corefx/issues/32250#issuecomment-420749205? it has the explanation there. IF you don't like the behavior switch off from POSIX and C.

@wfurt
Copy link
Member

wfurt commented Mar 4, 2019

Yes, I did read through and I think the argument is not correct.
I did also testing with native libc implementation and strcasecmp() keeps comparing in case insensitive matter regardless of LANG.
I did not find any documentation suggesting that CurrentCultureIgnoreCase may or may not perform case insensitive matching.

@wfurt
Copy link
Member

wfurt commented Mar 4, 2019

https://wiki.musl-libc.org/functional-differences-from-glibc.html
musl-linux like Alpine is different from other distributions. The "normal" ways how to generate or changer locale does not quite work.
https://bugs.alpinelinux.org/issues/7374

@danmoseley danmoseley reopened this Jun 19, 2019
@danmoseley
Copy link
Member

@tarekgh what are your thoughts about @wfurt comment above?

i'm guessing most .NET code out there relies somewhere on case insensitive comparison, it seems quite odd if WSL is truly configured such that that is impossible. Should they set a different LANG? (Or maybe this is different in the new WSL2 )

@danmoseley
Copy link
Member

I see @tarekgh already updated docs which is good 😄 dotnet/docs#8179

Just curious about @wfurt point.

@tarekgh
Copy link
Member

tarekgh commented Jun 19, 2019

To summarize/clarify,

When setting the culture to "C", ICU map it to en_us_posix. and the posix locale is for sure doesn't support case insensitive comparisons. We are just working as ICU. So, .NET Core behavior is 100% matching ICU which is the main globalization component for Linux. In short, we are not defining how "C" locale work, we just do whatever ICU does.

Now, if we need to do something here, we can just start changing the mapping of "C" to something else rather than POSIX locale. This is kind of breaking change as no idea if there is people already taking dependency in such behavior. POSIX locale just doing that because people wanted to compare strings as binaries (regardless of using case sensitive option or not).

Let me know what you think.

CC @jkotas @stephentoub

@danmoseley
Copy link
Member

I see. In regular distros/shells, do they typically have different choices for LANG? I wonder why WSL does this, it seems like a large gotcha.

@tarekgh
Copy link
Member

tarekgh commented Jun 19, 2019

I can follow up with WSL team about that. but in Linux world think about "C" as invariant culture.

@wfurt
Copy link
Member

wfurt commented Jun 19, 2019

Alpine is also like this. There really is no locale other than C. (see #962)

@danmoseley
Copy link
Member

I am happy to close this and defer to experts. It would be great to circle back with WSL team's response. Thanks guys.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants