-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CJK char frequency #36
Comments
Added to the test above a first example of what happens when there are no common character range PBFs around (78 range requests). |
@mikemorris answering some of your questions: we'll need to build separate common glyphs for korean and japanese?
From http://en.wikipedia.org/wiki/CJK_characters Basically: Korean, Japanese writing systems often include chinese characters for kind of "old-school" writing, and then have their own character/alphabet systems for the majority of normal use. For example, in Korea newspaper headlines are often have some CJK characters, and then the article will be in the Hangul character set. Ideally we do not need to build separate common CJK glyph range PBFs for other languages that use CJK. We would only need to do this if the frequency distribution of character usage in Korean or Japanese of CJK characters is very different from Chinese. Let's stick with a single set of common CJK for now. Korean: Hangul There is the separate issue of whether Korean Hangul (a very different character set and also very large portion of unicode) needs a common glyph freq analysis. It is a big character range in unicode (http://jrgraphix.net/r/Unicode/AC00-D7AF) though not nearly as huge as CJK:
I did a quick analysis of this using a seoul OSM extract and recall from the results that likely a 256-512 set of common glyphs would have a very good impact. I haven't run an analysis that covers North/South Korea and that is what I think would be the next step here. *Japanese: Hiragana + Katakana I am less familiar with Japanese but these are two charactersets that do not cause as many headaches as CJK/Hangul. http://jrgraphix.net/r/Unicode/3040-309F, http://jrgraphix.net/r/Unicode/30A0-30FF
|
Emailed [email protected] asking for a Korea (North and South) sub region extract at http://download.geofabrik.de/asia.html, is this something @joto could help out with? |
@mikemorris - per chat, use the Overpass API to download extracts. You can create a download URL with bbox and curl it. @ajashton says there's also http://overpass-turbo.eu/ where you can use the wizard to easily make complex queries, eg |
Thanks @lxbarth, got extracts of North Korea and South Korea to analyze now. |
http://extract.bbbike.org/ is another resource for creating extracts, recommended by the folks at geofabrik |
Per chat with @nickidlugash, should we possibly do another analysis for traditional Chinese characters used in Taiwan, Hong Kong and Macau? |
Start by adding more tile fixtures for these areas to and running analysis of what the request profile looks like using the existing common-cjk ranges. If these fare badly/worse than the current tests, then yes, I think you should look into finding out why. |
Committed results of a full analysis of North and South Korea. North Korea has very limited OSM coverage, but the entirety of glyphs in use ended up being just 445. It also appears that none of the glyphs in the Hangul Compatibility Jamo are used. |
Should we expand the OSM CJK common analysis to include all the CJK ranges? |
@yhahn These look significantly worse than mainland China to me, I think we'll be needing a |
|
Adding all Hangul Unicode ranges: |
Need to add some tile fixtures for Japan, but here are current results. Switching primary sorting to Unicode index instead of frequency adds 3-5 ranges to each, so these results only sort on index if frequency is equal. The extraneous ranges (CJK Symbols, Bopomofo for Taiwan, etc) are pretty well arranged within the Unicode spec, so adding them to the CJK common set just ended up bloating it and causing more ranges to be loaded unnecessarily.
|
Compressing into a single common set for each yields this:
|
Trimming
Trimming
|
Best attempt so far at splitting In this test,
|
Only real way to fix Taiwan is to not trim at all.
|
Added test fixtures for Japan:
Interesting that there appear to be requests for characters in the High Surrogates and Low Surrogates ranges, which all appear as � to me. |
TypeKit went with "dynamic subsetsetting", allowing requests for any number and combination of glyphs rather than predefined blocks.
http://blog.typekit.com/2015/06/15/announcing-east-asian-web-font-support/ It also sounds like this was a really easy issue for them to solve:
|
Super interesting read, thanks for posting! |
hi @mikemorris , what about the recent schedule? It's in great request for me. And is there any way to take part in? |
@yhahn @mikemorris Thank you for your answer. Could you tell me the exact details? Developing webGL.js (javaScript) in web web pages? |
TL;DR
Notes from weekend analysis (source committed here https://github.com/mapbox/fontserver/tree/char-spec/spec).
Background
Prep
spec/fixtures/
I've added 12 z14 vector tiles of major chinese cities.name
tags from OSM. This is dumped tocjk-osm.json
.cjk-modern.json
.Approach
Assuming a character that falls into the 4096 most freq character use can be grabbed from a set of CJK common PBFs or so, we can analyze the range count for our 12 vector tiles:
The scripts assume we split up the cjk-common PBF into chunks of 256. It looks like you will want most if not all of these 4096 characters as a baseline, always, all the time (if you run the script with 1 or 2 tiles alone you will often end up with 12-16 of the common ranges). You basically end up grabbing the common characters no matter what.
So combining the common glyphs into a single pack and eliminating the non cjk ranges from the list to reduce noise, we're down to:
Conclusion + questions
Overall having a cjk-common glyph that takes precedence if a character falls into a fixed list of n (I picked 4096 based on a few runs -- 3000 not enough, 5000 has diminishing returns) looks like a good approach. You'll have most commonly used characters loaded and cached and will fire off requests as normal for other ranges as you hit less common characters.
The discrepancy between OSM's top characters and other analysis' top characters is worth some study though. Note that once we set this list changing it is very painful. It will affect any implementations using these endpoints and likely means we need to spec and version any endpoints around this strictly.
Next actions
cc @mikemorris @kkaefer @ansis @nickidlugash
The text was updated successfully, but these errors were encountered: