-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should the Segmenter types accept a locale? #3284
Comments
General preference for #3 Do we plan to provide these locale-ish APIs in the near term? I actually think future |
General preference for #2 I'd prefer the default constructor names to be consistent in behavior as much as makes sense. If we believe Segmenter constructors will want to take locale just like all others, lets keep the names for those constructors. |
Why this ICU4C rule isn't merged/requested to UAX#29? Does ICU4C have a plan to file/merge an issue to UAX#29? After merging this change to UAX#29, then #3. |
UAX #29 in general doesn't really want to include locale-specific stuff because it wants to leave that up to CLDR. |
Suffix suggestions:
|
Since we will want to take locales as parameters (even for segmenters where that isn't implemented yet), IMO we should make that the "normal" case. |
From discussion with @aethanyc @makotokato @Manishearth @nordzilla: It is an enhancement to consume CLDR root.xml tailorings, but not necessarily a bug. We would like to see it done in a timely fashion. Conclusion: use |
For the record we didn't use |
These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?
It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.
This seems to be about ASCII semicolon having sentence-ending question mark semantics. Could this be accommodated in the root by triggering the rule on the most recent letter being from the Greek script? |
I think German needs this tailoring as well. I don't know why Finnish and Swedish do, but in German a colon is commonly used to form gender-neutral nouns, like What's the current process for updating the tailorings? ICU or CLDR? |
@robertbastian this was recently discussed in the CLDR design meeting, CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15910 / PAG issue: https://github.com/unicode-org/properties/issues/187 (internal) There's thought that this should actually be made to apply to all languages, since colons without spaces on either side are not really a thing in regular text anyway, and if the space has been removed there's a good chance it's on purpose. |
That's exactly what they are. https://www.unicode.org/reports/tr35/tr35-general.html#Segmentation_Exceptions for more details. The lists for one language may not be applicable for others. But you could probably calculate a list that's likely to be generally useful, it might be less useful for any particular language. |
I think https://unicode-org.atlassian.net/browse/CLDR-15910 should be reverted on the root level so that we don't need tailorings to accommodate natural languages.
For Finnish, the use case is marking where a sufficiently unusual word body (e.g. acronym) ends and the case suffix starts. For example, English Henri’s would be Henrin in Finnish but English ICU4X’s would be ICU4X:n in Finnish. The use case for Swedish seems to be also about applying suffixes (though not case suffixes) to sufficiently unusual word bodies. (Consider an analog English Londoner but with with the suffix applied to e.g. a sports team acronym.) |
Okay, for 2.0 purposes, which of the four segmenters requires a locale parameter?
The "language of the text" would be more appropriate to provide in the terminal |
Not yet. Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc. |
On the flip side, putting this in the API really requires making ECMA-402 have a way to explicitly ask for root and to default to root. Some users getting a different definition of extended grapheme clusters based on the browser UI locale would likely be bad, after developers having assumed for years that extended grapheme clusters are a Unicode-level concept and not a locale-level concept. Also, it would be bad to have to assume that English is always going to be the untailored language and to teach every developer to ask for a grapheme segmenter for English in order to get behavior on a similar level of stability that one would expect of e.g Swift strings.
What languages do you mean and why do they want to keep "ch" together for the kind of purposes that extended grapheme clusters are used for, such as denying the selection of only "c" or only "h"? Czech treats "ch" as a collation unit, but do users of the language expect not to be able to select "c" and "h" individually? |
The conclusions from the discussion of this issue with the CLDR design group:
My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional |
I'm moving this back into 1.5 because the constructor can be drafted and bikeshed ahead of time, and then in 2.0 we can do the minimal change of making the new constructor the default one. |
This makes no sense and contradicts the long standing requests. ( https://unicode-org.atlassian.net/browse/CLDR-2992 which I am working on scheduling ) I would have joined, did not realize this was coming up today. Perusing the notes it's not clear that the previous requirements and recent discussion from the segmentation summary last year were included here. |
Based on additional discussion in the email thread, I would like to move forward with the recommendation in #3284 (comment), with the additional understanding that we may add support for locale-based grapheme segmentation in the future if CLDR adds data for this, but it might take the form of another (fifth) segmenter type. Concretely:
|
When looking ICU4C brkiter rule files for word and sentence, UAX#29's property of this isn't same each locale. But rules seem to be same. So if we modify datagen (with a few changes of toml data file), we can generate rules data per locale. |
Add LocaleData parameter for word/sentence segmenter This is a part of #3284. ICU4C has some language break rules for word and sentence segmenter, so this fix adds some rules to ICU4X per locale. This adds LocaleData argument to all constructors. Also, locale difference is small and 2 data only, I add the override table data marker for machine state property.
If we support |
Currently, we have the optional Content Locale on @makotokato will create a pull request to replace Once that PR lands, we can close this issue. |
In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:
Why don't we support these in ICU4X Segmenter, and should we add them?
For 1.2 purposes, we have a few choices:
_invariant
to the constructor names, so that in the futuretry_new_auto_invariant()
creates the locale-invariant segmenter andtry_new_auto(locale!("el"))
creates the locale-specific segmenterThoughts?
@aethanyc @makotokato @Manishearth
The text was updated successfully, but these errors were encountered: