-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[browser][non-icu] HybridGlobalization
checking for prefix/suffix
#84920
Conversation
// unless we have < 2 empty chars at the beginning | ||
if (segment.length === 1 || (segment.length > 1 && "".localeCompare(segment[1].segment, undefined) !== 0)) | ||
{ | ||
segment.shift(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass currentIndex: number
and increment it rather than mutate the array with .shift()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could, I will try to edit the logic.
|
||
export function segment_string_locale_sensitive(string: string, locale: string | undefined) : Intl.SegmentData[] | ||
{ | ||
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate comment about why we are segmenting the string into "user perceived letters" instead of comparing the whole string ? Is that because of the need to know the match index ?
Would it be possible it to assume that most calls return false (no match) and optimize for it by comparing unsegmented string first ?
Is creation of Segmenter
object expensive ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because we need locale-sesitive comparison. We cannot guess how many chars does n letters consist of. Consider the case:
s1 = "A\u0300" // 2 chars, 1 letter
s2 = "\u00C0" // 1 char, 1 letter
segmenter recognises both as 1 letter and we can easily compare them. If we were to compare s1[0] === s2[0] then we would get false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we normalize both strings instead ?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" }); | ||
return Array.from(segmenter.segment(string)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locale.split
on each call to compare_strings
is probably expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added performance tests (for the whole operation, not only segmentation) and it seems it's not that bad.
measurement | time ICU | time Hybrid |
---|---|---|
String, CompareInfo IsPrefix | 5.3330ms | 6.0471ms |
String, CompareInfo IsSuffix | 0.0091ms | 0.0093ms |
String, String StartsWith | 5.5313ms | 6.0296ms |
String, String EndsWith | 0.0094ms | 0.0102ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to re-investigate it, the numbers are a bit too similar in both cases. I expected bigger differences, something might be wrong with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked. There was an error connected with prefix testing but it does not change the range of ICU vs HG difference. Updated numbers:
measurement | time ICU | time Hybrid | time CoreCLR |
---|---|---|---|
String, CompareInfo IsPrefix | 5.3665ms | 5.6897ms | 4.300ms |
String, CompareInfo IsSuffix | 10.1670ms | 10.7293ms | 6.700ms |
String, String StartsWith | 5.4105ms | 6.4558ms | 4.900ms |
String, String EndsWith | 10.4519ms | 12.8267ms | 6.700ms |
Closing. The PR will get re-designed to improve the performance and re-posted. We will need to drop support for |
Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.
Old, icu-based private API:
GlobalizationNative_StartsWith
,GlobalizationNative_EndsWith
New, non-icu private API:
Interop.JsGlobalization.StartsWith
,Interop.JsGlobalization.EndsWith
Affected public API (see: tests in CompareInfoTests.IsPrefix.cs, CompareInfoTests.IsSuffix):
This implementation hovers with 64kB strings. Average processing times for 16kB is ~200ms.
All changes in behavior are listed in docs\design\features\hybrid-globalization.md.
cc @SamMonoRT