-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data analysts should be able to use Text.contains
to check for substring using various matcher techniques.
#3285
Conversation
8c076c0
to
03e6e12
Compare
test/Tests/src/Data/Text_Spec.enso
Outdated
## TODO what do we do with that?? Since the standard decomposition | ||
splits 'ś' into 's+{accent}', 'ś'.contains 's', but I don't think | ||
this is the expected behaviour... | ||
"Cześć".contains 's' . should_be_true | ||
'Czes\u{301}c\u{301}'.contains 's' . should_be_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found a non-trivial and quite problematic edge case: since we perform the normalization, the accented letter is represented by unaccented letter + accent, thus if we just do: normalize and then do naive Java contains
, it finds the s
(which is just a part of the representation of this grapheme).
I don't think this is something we want, because logically s
is not contained in ś
(although visually in a way, it is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an analogous test and there it works as expected (i.e. ś
does not contain s
in regex matching mode). That's an argument for making this work correctly in exact matching mode - we want to be consistent.
Also that's reassuring, because if Regex didn't support this properly and we still wanted that property - it could have been very hard to 'fix' the Regex implementation.
EDIT:
I was wrong. Regex does work in the direction "ś" . contains 's\u{301}'
and also correctly handles 's\u{301}' . contains 'ś'
. But it actually does return True for 's\u{301}' . contains 's'
- contrary to what we'd expect. At this point I'm not sure what to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution to that could be to just write the contains serach manually, using the BreakIterator, ensuring that it looks at whole grapheme clusters correctly. This is likely going to be slightly slower than what we have now (ICU Normalizer2
preprocess step + Java contains
), but may be the only way to go to retain correctness.
...
After a long and deep dive into ICU4J API, I've found StringSearch
which should do what we need in an efficient manner. Will try it out.
Interestingly it allows setting locale - I have no idea how changing locale can influence the search in general - but noting this as it may be something we may want to explore (although I'd set up a separate chore task for it instead of digging into it right now - but that's up to discussion).
Comparison of the Normalizer+Java.contains vs StringSearch implementations (full results sheet):
Only first two rows are relevant - Regex implementation did not change, so any differences there are only due to measurement uncertainty. We can see that unfortunately StringSearch is 2-4x slower. I don't think we can get a better solution which will handle the edge cases correctly though - unlikely that we can get something both correct and at the same faster than the ICU implementation. Also a significant part of this cost is likely due to the additional logic needed to correctly handle the edge cases - which is just unavoidable if we want this (simply - more complex) behaviour. |
e7dfca0
to
3dfd999
Compare
"Straße" . contains "ss" . should_be_false | ||
"Strasse" . contains "ß" . should_be_false | ||
"Straße" . contains "ss" (Text_Matcher Case_Insensitive.new) . should_be_true | ||
"Strasse" . contains "ß" (Text_Matcher Case_Insensitive.new) . should_be_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documenting this slightly peculiar case - due to how we candle case insensitive operations (tolower+toupper), and given the fact that the uppercase variant of ß
is SS
, ß
and ss
get collated in case insensitive mode.
More generally (also shown in tests here, just different place), currently in Enso: "ß".equals_ignore_case "ss" == false
.
Not sure if this is good or bad:
- It seems bad, because the difference that got collated is not exactly a case difference.
- OTOH, it seems natural that these two symbols mean the same thing so under a less strict equality they may be equated.
However, in Java "ß".equalsIgnoreCase("ss") == false
.
Moreover, it's really a different kind of difference - scharfes S is more like a ligature, i.e. in the similar spirit maybe æ
should also get collated with ae
etc.
So I'd lean more into the direction of trying to get rid of this collation - but I'm not exactly sure how to do this efficiently - the ICU normalizer we use for equals_ignore_case
supports case folding, but does not accept a locale. Seems the only way to handle cases with locale is through the to_lower_case
and to_upper_case
methods. Interestingly, how does Java get away with this? Because their equalsIgnoreCase
processes character-by-character (not even by grapheme clsuters!) and since the proper upper-case of ß
is SS
which takes two characters, the Character.toUpperCase
simply ignores it and returns back ß
(because it is incapable of returning two characters). So Java gets this right, because it is handling characters on a more lower lever which is too limited to encounter this issue.
Quick solutions that come to mind:
- Use the ICU's case folding that is not locale aware, possibly adding an
if
for the Turkish locale which seems to be toggleable in ICU (maybe that's the only difference between all locales so we don't need others?). - Use BreakIterator and implement this manually.
(2) is likely going to be slower, so probably don't want this (although may need a benchmark to be sure). (1) could be incorrect which would be bad, unless really the only Locale having different case handling is Turkish - that's possible - but we'd need to research that - possibly ask some linguist.
I think it may make most sense to create a separate task to explore this, especially check if (1) is viable as it would be our best shot. For now I'd just live with this collation - but open to discussion if this should be resolved before merging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out Swift's caseInsensitiveCompare
also compares scharfes s as equal to ss
, so I guess we can keep the current behaviour for now.
Would probably need someone knowing German linguistics very well to understand if collating these two in case-insensitive mode makes sense or not.
test/Tests/src/Data/Text_Spec.enso
Outdated
@@ -256,6 +256,7 @@ spec = | |||
"Cześć" . contains 's\u{301}' Regex_Matcher.new . should_be_true | |||
'Czes\u{301}c\u{301}' . contains 's\u{301}' Regex_Matcher.new . should_be_true | |||
'Czes\u{301}c\u{301}' . contains 'ś' Regex_Matcher.new . should_be_true | |||
'Czes\u{301}c\u{301}' . contains 's' Regex_Matcher.new . should_be_false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this test fails...
So Regex only works well with Unicode normalization to some extent - it does correctly find ś
in s\u{301}
and vice-versa. It does correctly not find s
in ś
. But it incorrectly (according to what I'd expect) finds s
in s\u{301}
. This is quite inconsistent. Maybe it actually should be reported as a bug in the Regex implementation - we already got one bug accepted there, so maybe we could get there with this one too - not exactly sure if this will be considered a bug, but the behaviour is not consistent - I don't think the results should depend on whether the string is normalized or not.
Not sure if simple workarounds exist for this - we could normalize the text before passing it to the engine but normalization does split ś
into s\u{301}
(IIRC), so that would make it even worse (but at least consistent, irrelevant of if the input was normalized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out Swift has a similar problem - actually it handles these cases even worse than our Regex
Seems to be a widely-known issue with Regex implementations: https://www.regular-expressions.info/unicode.html with no known implementations which do better in this case.
Will document this nuance in contains
docstring and add unit tests showing it so that we are aware of it, but I expect we can't do much more than that.
a5ca22a
to
fcdee07
Compare
fcdee07
to
264fe31
Compare
Pull Request Description
Important Notes
Checklist
Please include the following checklist in your PR:
./run dist
and./run watch
.