Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Latin and Viet subsets missing many characters #6

Open
jvgaultney opened this issue Feb 2, 2023 · 12 comments
Open

Extended Latin and Viet subsets missing many characters #6

jvgaultney opened this issue Feb 2, 2023 · 12 comments

Comments

@jvgaultney
Copy link

This is the fourth place I've submitted this issue in the last few months, as there is still no progress. See also google/fonts#5385 google/fonts#3756 googlefonts/lang#30

A large number of extended Latin and Vietnamese characters are not displaying properly. These characters are being displayed with fallback fonts even if the characters are supported in the fonts.

In the following screenshots LPR = local path-referenced font, GF = Google Font with subset=latin-ext,cyrillic-ext,vietnamese, FLO = our own internal font server. Screen shots are from current Chrome on Win 10.

Three specific examples:

  1. Vietnamese text properly renders the Vietnamese diacritic forms when lang='vi' is set. However certain combinations with dot below are using fallback fonts.
    Character string in example: Ấấ Ầầ Ẩẩ Ẫẫ Ắắ Ằằ Ẳẳ Ẵẵ Ếế Ềề Ểể Ễễ Ốố Ồồ Ổổ Ỗỗ Phải áp dụng chế độ giáo dục miễn phí, ít nhất là ở bậc tiểu học và giáo dục cơ sở

194867919-3d5b5907-9aa9-49b9-8542-7532b115807a

  1. Extended Latin does not seem to include some important diacritics, such as U+0329, and again fallback fonts are used. Example from Yoruba language UDHR.
    Character string in example: E̩nì kò̩ò̩kan ló ní è̩tó̩ láti kó̩ è̩kó̩. Ó kéré tán, è̩kó̩ gbo̩dò̩ jé̩ ò̩fé̩ ní àwo̩n ilé‐è̩kó̩ alákò̩ó̩bè̩rè̩. E̩kó̩ ní ilé‐è̩kó̩ alákò̩ó̩bè̩rè̩ yìí sì gbo̩dò̩ jé̩ dandan. A gbo̩dò̩ pèsè è̩kó̩ is̩é̩‐o̩wó̩, àti ti ìmò̩‐è̩ro̩ fún àwo̩n ènìyàn lápapò̩. Àn fàní tó dó̩gba ní ilé‐è̩kó̩ gíga gbo̩dò̩ wà ní àró̩wó̩tó gbogbo e̩ni tó bá tó̩ sí.

194868043-a2fd08f9-e6a8-4eda-a2f6-f27306ca4e34

  1. Many common diacritics, like ogonek, are not displaying properly
    Character string in example: ọ o̧ ǫ ô o˞ o̝̠̣ ô͑ n f i fi f l fl ˥ ˦ ˧ ˨ ˩ ˥˥ ˥˦ ˥˧ ˥˨ ˥˩ ˥˨˥ ˥˨˦ ˥˨˧ ˥˨˨ ˥˨˩

194868719-634ad632-6253-438e-ada8-beafb0a4bd43

@simoncozens
Copy link
Contributor

Very few of the U+03XX combining marks appear in any of the Google Fonts glyphsets, so they will all be stripped out of fonts served via GF. We could make piecemeal PRs adding combining marks into the Latin and Vietnamese and extended Latin and whatever various other script sets use them, but it feels really yucky; it's clearly symptomatic of a larger problem. However, the engineering team sees a lot of benefit in subsetting fonts, so I'm not sure how to solve that larger problem.

@simoncozens
Copy link
Contributor

(See also #7. There are a huge number of fonts on GF which offer these combining marks, but they can't be used.)

@jvgaultney
Copy link
Author

Well that's a non-answer. We know it's not working, and that the combining marks are not getting included, and that it's one symptom of a larger systemic problem with GF.

However we just need something that works, even if it feels yucky to you. Even if only the more common combining diacritics were added it would make GF useful for many more languages. The lack of basic Vietnamese support is really embarrassing, when the fix is trivial.

@thlinard
Copy link

This is a screenshot of Roboto on https://fonts.google.com/specimen/Roboto?subset=vietnamese&noto.script=Latn (sample in Vietnamese):
Roboto

Same situation for every font with Vietnamese support (.notdef displayed for ịửỡ in standard sample text).

@garretrieger
Copy link
Contributor

FYI I made an update for this issue in googlefonts/glyphsets#102. Since this affects many families it may take a bit to get the fix rolled out to each family. For now I've already updated Noto Sans, Andika, Charissil, and Gentium Plus with the fixed subset definitions.

@thlinard
Copy link

thlinard commented Jun 19, 2023

FYI I made an update for this issue in googlefonts/glyphsets#102. Since this affects many families it may take a bit to get the fix rolled out to each family. For now I've already updated Noto Sans, Andika, Charissil, and Gentium Plus with the fixed subset definitions.

Hi @garretrieger

The fix is incomplete:

Example with Andika, from the API:

Andika APi

Andika downloaded and displayed on desktop:

Andika Desktop

Displaying other fonts is still problematic:

Roboto

@garretrieger
Copy link
Contributor

garretrieger commented Jun 19, 2023

We had to partially rollback some of the fixes due to google/fonts#6245. The problem is that the combining marks are present in the latin, latin extended, and vietnamese subsets. Selecting the subset to load/use for a particular occurrence of a combining mark is up to the browser and sometimes it doesn't use the right one.

We're experimenting with different subset definitions + unicode range setups to try and find something that works for all cases, but this is difficult. You end up fixing one case, but causing another to break.

I'm currently working on assembling a test suite that tries to cover as many of the different cases as possible. So we can evaluate potential fixes to make sure we don't regress anything.

Could you provide the specific codepoint sequences you used for the above iuo case? I'll add it to the test suite.

For Roboto, we haven't pushed updated subset definitions yet and likely won't until it's upgraded to the variable version. Unfortunately the way the layout rules are set up on the static version of Roboto causes it's subset sizes to massively increase in size when introducing the additional combining marks. This issue has been fixed in the upcoming variable version of the font.

@thlinard
Copy link

thlinard commented Jun 19, 2023

Thanks for the information.

For the sequences, I simply copied the problematic characters in the sample text from "Select preview text > Asia > Vietnamese", i.e.:

ị (0069 LATIN SMALL LETTER I + 0323 COMBINING DOT BELOW)
ĩ (0069 LATIN SMALL LETTER I + 0303 COMBINING TILDE)
ỉ (0069 LATIN SMALL LETTER I + 0309 COMBINING HOOK ABOVE)
ắ (0103 LATIN SMALL LETTER A WITH BREVE + 0301 COMBINING ACUTE ACCENT)
ẫ ‎(00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX + 0303 COMBINING TILDE)
ụ (0075 LATIN SMALL LETTER U + 0323 COMBINING DOT BELOW)
ử (01B0 LATIN SMALL LETTER U WITH HORN + ‎0309 COMBINING HOOK ABOVE)

Results vary from font to font. For example, on Lora, a VF, the results are good in Italic, bad in Roman:

Lora

@garretrieger
Copy link
Contributor

I've been trying to reproduce your Andika example and haven't been able to: https://codepen.io/garretrieger/pen/XWyKaZq

What browser are you using?

@garretrieger
Copy link
Contributor

This is what I get for that example:
Screenshot 2023-06-19 at 6 04 01 PM

@moyogo
Copy link

moyogo commented Jun 20, 2023

@garretrieger U+031B is used in ử (0075 031B 0309) but it is not in the vietnamese set in https://fonts.googleapis.com/css?family=Andika. Chrome shows the example correctly but Safari and Firefox do not.

Firefox:
Screenshot 2023-06-20 at 06 40 54
Safari:
Screenshot 2023-06-20 at 06 41 07

There also seem to be others missing: googlefonts/glyphsets#110 (comment)

@thlinard
Copy link

What browser are you using?

Firefox 114.0.1 on macOS 13.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants