Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non breaking space and breaking space #288

Closed
pergardebrink opened this issue Aug 20, 2014 · 12 comments · Fixed by #656
Closed

Non breaking space and breaking space #288

pergardebrink opened this issue Aug 20, 2014 · 12 comments · Fixed by #656

Comments

@pergardebrink
Copy link

I'm new to both version 0.1.1 and 1.0.0-alpha and have never used jquery globalize before so I might have misunderstood something, but I have a small issue with version 1.0.0-alpha5 (as I plan to move there from 0.1.1 when it's stable).

In the previous version (0.1.1), if run the following code:

Globalize.culture('sv-SE'); // globalize.culture.sv-SE.js is loaded
var number = Globalize.parseFloat("123 456,78"); // Gives 123456.78 as expected
var number2 = Globalize.parseFloat("123" + String.fromCharCode(160) + "456,78"); // Gives 123456.78 also as expected

But if I run the following code in 1.0.0.alpha-5:

Globalize.locale('sv'); // numbers module, cldr loaded and main/sv/numbers.json is also loaded
var number = Globalize.parseNumber("123 456,78"); // NaN
var number2 = Globalize.parseNumber("123" + String.fromCharCode(160) + "456,78"); // Gives 123456.78

Since an enduser probably (definitely) won't type the space as a non breaking space, any conversion will fail. If I change the group property in numbers.json to be a breaking space instead, then of course the parse will work, but then my value provided from the server won't work since I use .NET to format my number with swedish culture:
(C#)

var number = 123456.78;
var culture = CultureInfo.CreateSpecificCulture("sv-SE");
number.ToString("N", culture); // Gives 123 456,78 with a non breaking space as a thousand separator
@rxaviers
Copy link
Member

Hi @pergardebrink, thanks for your clear description.

As you have pointed out, Globalize deduces the grouping separator symbol from the CLDR content. Therefore, all it "knows" comes from that data set. If the sv grouping separator is defined as character 160 (non-breaking space), that's what it's going to use.

On Globalize, we make sure this will always be true:

var sv = Globalize("sv");
sv.parseNumber(sv.formatNumber(123456.78)) === 123456.78; // true

We don't have any specific rules/conditions on the parser code like "if grouping separator is 160, also try 32". As of now, I think this current behavior is correct.

@scottgonzalez, @jzaefferer, @srl295 any ideas?

Anyway, if you want to allow user to input 32 (breaking space) as an alternative grouping separator, which I agree it makes sense in your case, this could be used:

sanitezedInput = "123 456,78".replace( "\x20", "\xa0" ); // 20 is the hex for 32, a0 is the hex for 160.
sv.parseNumber( sanitezedInput );

TR35 defines this: (link)

For the sign, decimal separator, percent, and per mille, use a set of
all possible characters that can serve those functions. For example, the
decimal separator set could include all of [.,']. (The actual set of
characters can be derived from the number symbols in the By-Type charts
[ByType], which list all of the values in CLDR.) To disambiguate, the
decimal separator for the locale must be removed from the "ignore" set,
and the grouping separator for the locale must be removed from the
decimal separator set. The same principle applies to all sets and
symbols: any symbol must appear in at most one set.

Although we don't fully implement this heuristics in Globalize (it doesn't parse the number string using all loaded grouping separators, but the locale one), note that even implementing that would not solve your problem. Because, no language defines 32 (breaking space) as a grouping separator.

@pergardebrink
Copy link
Author

Yes, I'll probably have to use some sort of sanitization as you suggest. My application will use the culture that the user specifies (from a list of all .NET supported cultures) so there are probably more cultures other than swedish that specifies non breaking space as a grouping character.

(The reason I started this issue was that the 0.1.1 version did allow me to specify both non breaking and breaking space and was curious if it was a bug or not)

Thanks for quick reply!

@rxaviers
Copy link
Member

Lets wait on input from cc'ed people above. I'm open to suggestion. But, I
very much dislike to include if else if specifics/exceptions with content
hardcoded.

+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
http://rafael.xavier.blog.br

@pergardebrink
Copy link
Author

I've been thinking and reading since yesterday and I think that Globalize really should support both non breaking and breaking space even if the CLDR says non breaking as a grouping character.

I think that most developers not familiar with cultures that uses space as a grouping separator probably won't know this until they are hit by the first bug report from a swedish or french end user (or any other that uses it).

I've found some info on unicode.org suggesting that you should use a more "lenient parsing" that if the grouping character is non breaking space, all whitespace characters should match.
http://unicode.org/reports/tr35/#Loose_Matching

  • Normalize to NFKC; thus no-break space will map to space; half-width katakana will map to full-width.

@rxaviers
Copy link
Member

Excellent. So, let's do it.

@rxaviers
Copy link
Member

The documentation lead me to the questions below. I have sent that to the CLDR mailing list and will update here as I get replies.

If anyone knows the answers, please just let me know.

7.2 Loose Matching
Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:

  • Remove "." from currency symbols and other fields used for matching, and also from the input string unless:
    • "." is in the decimal set, and
    • its position in the input string is immediately before a decimal digit
  • Ignore all format characters: in particular, ignore the RLM and LRM used to control BIDI formatting.

Where do I find a list of all format characters?

  • Ignore all characters in [:Zs:] unless they occur between letters. (In the heuristics below, even those between letters are ignored except to delimit fields)

Where do I find a list of all [:Zs:] characters?

  • Map all characters in [:Dash:] to U+002D HYPHEN-MINUS

Where do I find a list of all [:Dash:] characters?

  • Use the data in the element to map equivalent characters (for example, curly to straight apostrophes). Other apostrophe-like characters should also be treated as equivalent, especially if the character actually used in a format may be unavailable on some keyboards. For example:
    • U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as U+2018 LEFT SINGLE QUOTATION MARK (‘).
    • U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
    • U+05F3 HEBREW PUNCTUATION GERESH (‎׳) might be typed instead as U+0027 APOSTROPHE.

Except for the U+05F3 example, the other two cannot be found in http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json. Are both the "other apostrophe-like characters". Where do I find a complete list of the apostrophe-like characters? Do mappings follow an algorithm, algebric formula or lookup table?

On http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data, there's:

There is more than one possible fallback: the recommended usage is that when a character value is not in the desired repertoire the following process is used, whereby the first value that is wholly in the desired repertoire is used.

  • toNFC(value)
  • other canonically equivalent sequences, if there are any
  • the explicit substitutes value (in order)
  • toNFKC(value)

Does it mean that when the character being looked up is not found, the above process should be followed? Where do I find the definition of toNFC(), toNFC(), canonically equivalence and explicit substitutes?

  • Apply mappings particular to the domain (i.e., for dates or for numbers, discussed in more detail below)

Where?

  • Apply case folding (possibly including language-specific mappings such as Turkish i)

Where do I find more information about it?

  • Normalize to NFKC; thus no-break space will map to space; half-width katakana will map to full-width.

Are both mappings (no-break space and half-width katakana) all it's about, or are there any other NFKC normalizations that should be done? Where do I find a complete list of what should be done? Do mappings follow an algorithm, algebric formula or lookup table?

Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.

"NA f." is the currency symbol for ANG (Netherlands Antillean guilder, aka Netherlands Antilles Florin according to wikipedia). nl-CW and nl-SX defines ANG symbol as NAf.. All other locales define it as ANG.

Following the above recommendation (to map NA f. into naf), how is implementation supposed to know naf is ANG? Where do I find a mapping between naf and ANG?

@arschmitz
Copy link

@rxaviers Ping me about this i have some experience with NFC and NFKC as will as js implementation of these.

@rxaviers
Copy link
Member

For the record, @arschmitz has worked with Unicode normalization in his arschmitz/jquery-pr project, where he used walling/unorm/.../unorm.js for the NFC and etc normalizations.

@rxaviers
Copy link
Member

Also, I have received answers from CLDR mailing list: https://gist.github.com/rxaviers/76762da0ea8d3335f263

@rxaviers
Copy link
Member

The ES6 String.prototype.normalize seems to be the way to go (about NFC and NFKC) + using unorm.js shim to polyfill that in the meanwhile.

// Comparing 160 no-break space with 32:
" " === " "; // false
" ".normalize("NFKC") === " ".normalize("NFKC"); // true

The problem is that unorm.js currently embeds the normalization lookup data, making it 36.6KB big (minified+gzipped), which is 10x bigger than Globalize and its number module together. While it's not a problem for backend application, it may be way too much for frontend. Out of curiosity, stripping the embedded data out from unorm.js makes it 2.0KB (minified+gzipped).

@arschmitz
Copy link

@rxaviers ah cool that es6 String.prototype.normalize has actually landed in chrome and firefox now it had not yet when I wrote arschmitz/jquery-pr that means I can actually remove unorm.js now! Since jquery-pr is a chome extension :)

@rxaviers
Copy link
Member

Closed in favor of the broader scope #292 (Loose Matching).

ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016
ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016
rxaviers added a commit that referenced this issue Nov 28, 2016
- Correctly handles prefix and suffix literals; #353;
- Loose Matching: This implementation is now much closer to UTS#35 7.1.2 Loose
  Matching http://unicode.org/reports/tr35/#Loose_Matching and fixes all
  reported cases that related to it, including #288;
- Regression: Drop scientific notation parsing support, which wasn't documented
  anyway and shall be implemented by #533.

Ref #292
Fixes #353

Fixes #46
Fixes #288
Fixes #443
Fixes #457
Fixes #492
Fixes #587
Fixes #644
rxaviers added a commit that referenced this issue Nov 28, 2016
- Correctly handles prefix and suffix literals; #353;
- Loose Matching: This implementation is now much closer to UTS#35 7.1.2 Loose
  Matching http://unicode.org/reports/tr35/#Loose_Matching and fixes all
  reported cases that are related to it, including #288;
- Regression: Drop scientific notation parsing support, which wasn't documented
  anyway and shall be implemented by #533.

Ref #292
Fixes #353

Fixes #46
Fixes #288
Fixes #443
Fixes #457
Fixes #492
Fixes #587
Fixes #644
rxaviers added a commit that referenced this issue Dec 13, 2016
- Correctly handles prefix and suffix literals; #353;
- Loose Matching: This implementation is now much closer to UTS#35 7.1.2 Loose
  Matching http://unicode.org/reports/tr35/#Loose_Matching and fixes all
  reported cases that are related to it, including #288;
- Regression: Drop scientific notation parsing support, which wasn't documented
  anyway and shall be implemented by #533.

Ref #292
Fixes #353

Fixes #46
Fixes #288
Fixes #443
Fixes #457
Fixes #492
Fixes #587
Fixes #644
rxaviers added a commit that referenced this issue Dec 13, 2016
- Correctly handles prefix and suffix literals; #353;
- Loose Matching: This implementation is now much closer to UTS#35 7.1.2 Loose
  Matching http://unicode.org/reports/tr35/#Loose_Matching and fixes all
  reported cases that are related to it, including #288;
- Regression: Drop scientific notation parsing support, which wasn't documented
  anyway and shall be implemented by #533.

Ref #292
Fixes #353

Fixes #46
Fixes #288
Fixes #443
Fixes #457
Fixes #492
Fixes #587
Fixes #644
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants