CJK char frequency #36

yhahn · 2014-06-01T16:46:08Z

TL;DR

good news is that there are definitely commonly used chinese characters that we can bundle together into special glyph PBFs.
bad news is that this amounts to about ~3000-4000 characters which will amount to a ~2-3MB overhead and it's not likely we can reduce this overhead much.
other good news is this approach will likely work for hangul (korean) which has similar size/range waste issues though not as bad as cjk.

Notes from weekend analysis (source committed here https://github.com/mapbox/fontserver/tree/char-spec/spec).

Background

There are a lot of chinese characters -- "Chinese characters are theoretically an open set" -- and a modern dictionary puts the number in the 100,000+.
The problem we're dealing with is pretty well discussed especially in relation to learning chinese. Most discussions put the "99%" use character count at around 3000-4000 chars (e.g. http://www.commonchinesecharacters.com/Lists/MostCommon2500ChineseCharacters).

Prep

Sample of text that we need to render. In spec/fixtures/ I've added 12 z14 vector tiles of major chinese cities.
4096 most frequent chinese characters, in all name tags from OSM. This is dumped to cjk-osm.json.
4096 most frequent chinese characters, from an academic analysis of modern chinese character frequency (http://lingua.mtsu.edu/chinese-computing/statistics/). This is dumped to cjk-modern.json.

Approach

Assuming a character that falls into the 4096 most freq character use can be grabbed from a set of CJK common PBFs or so, we can analyze the range count for our 12 vector tiles:

none (78 ranges)
[ '0-255',
  '19968-20223',
  '20224-20479',
  '20736-20991',
  '20992-21247',
  '21248-21503',
  '21504-21759',
  '21760-22015',
  '22016-22271',
  '22272-22527',
  '22528-22783',
  '22784-23039',
  '23040-23295',
  '23296-23551',
  '23552-23807',
  '23808-24063',
  '24064-24319',
  '24320-24575',
  '24576-24831',
  '24832-25087',
  '25088-25343',
  '25344-25599',
  '256-511',
  '25600-25855',
  '25856-26111',
  '26112-26367',
  '26368-26623',
  '26624-26879',
  '26880-27135',
  '27136-27391',
  '27392-27647',
  '27648-27903',
  '27904-28159',
  '28160-28415',
  '28416-28671',
  '28672-28927',
  '28928-29183',
  '29184-29439',
  '29440-29695',
  '29696-29951',
  '29952-30207',
  '30208-30463',
  '30464-30719',
  '30720-30975',
  '30976-31231',
  '31232-31487',
  '31488-31743',
  '31744-31999',
  '32000-32255',
  '32256-32511',
  '32512-32767',
  '32768-33023',
  '33280-33535',
  '33536-33791',
  '33792-34047',
  '34048-34303',
  '34304-34559',
  '34560-34815',
  '34816-35071',
  '35072-35327',
  '35584-35839',
  '35840-36095',
  '36096-36351',
  '36608-36863',
  '36864-37119',
  '37120-37375',
  '37888-38143',
  '38144-38399',
  '38400-38655',
  '38656-38911',
  '38912-39167',
  '39168-39423',
  '39424-39679',
  '39936-40191',
  '40448-40703',
  '40704-40959',
  '8192-8447',
  '8448-8703' ]

osm (26 ranges)
[ '0-255',
  '20224-20479',
  '20992-21247',
  '256-511',
  '28416-28671',
  '29440-29695',
  '29952-30207',
  '33536-33791',
  '38912-39167',
  '8192-8447',
  '8448-8703',
  'cjk-common-0',
  'cjk-common-1',
  'cjk-common-10',
  'cjk-common-11',
  'cjk-common-12',
  'cjk-common-13',
  'cjk-common-14',
  'cjk-common-2',
  'cjk-common-3',
  'cjk-common-4',
  'cjk-common-5',
  'cjk-common-6',
  'cjk-common-7',
  'cjk-common-8',
  'cjk-common-9' ]

modern (34 ranges)
[ '0-255',
  '22272-22527',
  '256-511',
  '26880-27135',
  '27136-27391',
  '27648-27903',
  '27904-28159',
  '28416-28671',
  '29184-29439',
  '29952-30207',
  '30976-31231',
  '32512-32767',
  '33280-33535',
  '33536-33791',
  '34304-34559',
  '39424-39679',
  '8192-8447',
  '8448-8703',
  'cjk-common-0',
  'cjk-common-1',
  'cjk-common-10',
  'cjk-common-11',
  'cjk-common-12',
  'cjk-common-13',
  'cjk-common-14',
  'cjk-common-15',
  'cjk-common-2',
  'cjk-common-3',
  'cjk-common-4',
  'cjk-common-5',
  'cjk-common-6',
  'cjk-common-7',
  'cjk-common-8',
  'cjk-common-9' ]

The scripts assume we split up the cjk-common PBF into chunks of 256. It looks like you will want most if not all of these 4096 characters as a baseline, always, all the time (if you run the script with 1 or 2 tiles alone you will often end up with 12-16 of the common ranges). You basically end up grabbing the common characters no matter what.

So combining the common glyphs into a single pack and eliminating the non cjk ranges from the list to reduce noise, we're down to:

osm (8 ranges)
[ '20224-20479',
  '20992-21247',
  '28416-28671',
  '29440-29695',
  '29952-30207',
  '33536-33791',
  '38912-39167',
  'cjk-common' ]

modern (15 ranges)
[ '22272-22527',
  '26880-27135',
  '27136-27391',
  '27648-27903',
  '27904-28159',
  '28416-28671',
  '29184-29439',
  '29952-30207',
  '30976-31231',
  '32512-32767',
  '33280-33535',
  '33536-33791',
  '34304-34559',
  '39424-39679',
  'cjk-common' ]

Conclusion + questions

Overall having a cjk-common glyph that takes precedence if a character falls into a fixed list of n (I picked 4096 based on a few runs -- 3000 not enough, 5000 has diminishing returns) looks like a good approach. You'll have most commonly used characters loaded and cached and will fire off requests as normal for other ranges as you hit less common characters.

The discrepancy between OSM's top characters and other analysis' top characters is worth some study though. Note that once we set this list changing it is very painful. It will affect any implementations using these endpoints and likely means we need to spec and version any endpoints around this strictly.

"Place language" differs from normal usage significantly in english, german, etc. This is likely (?) the case for Chinese as well and could lead to this discrepancy.
OSM data quality in China is an unknown to me. Setting a list of character frequency list based on the current state of OSM could be a bad idea without some overall confidence in current OSM data in China being representative of Chinese-language maps as a whole and future OSM data/map data.

Next actions

Similar analysis for Hangul
Branch of fontserver that can generate a PBF from a charlist rather than a start/end range,
Branch of llmr that uses a cjk charlist to request the common CJK PBF and fallback to normal range glyphs otherwise,
Test IRL
Consult/further research on chinese char freq questions

cc @mikemorris @kkaefer @ansis @nickidlugash

The text was updated successfully, but these errors were encountered:

yhahn · 2014-06-01T17:08:45Z

Added to the test above a first example of what happens when there are no common character range PBFs around (78 range requests).

yhahn · 2014-06-19T16:30:33Z

@mikemorris answering some of your questions:

we'll need to build separate common glyphs for korean and japanese?

These languages all have a shared characteristic: Their writing systems all completely or partly use Chinese characters — Hànzì in Chinese, kanji in Japanese, hanja in Korean, and Chữ Nôm in Vietnamese. Chinese is written in Chinese characters only and requires approximately 4,000 characters for general literacy although there are up to 40,000 characters for reasonably complete coverage.

From http://en.wikipedia.org/wiki/CJK_characters

Basically: Korean, Japanese writing systems often include chinese characters for kind of "old-school" writing, and then have their own character/alphabet systems for the majority of normal use. For example, in Korea newspaper headlines are often have some CJK characters, and then the article will be in the Hangul character set.

Ideally we do not need to build separate common CJK glyph range PBFs for other languages that use CJK. We would only need to do this if the frequency distribution of character usage in Korean or Japanese of CJK characters is very different from Chinese. Let's stick with a single set of common CJK for now.

Korean: Hangul

There is the separate issue of whether Korean Hangul (a very different character set and also very large portion of unicode) needs a common glyph freq analysis. It is a big character range in unicode (http://jrgraphix.net/r/Unicode/AC00-D7AF) though not nearly as huge as CJK:

> parseInt('d7af',16) - parseInt('ac00',16);
11183

I did a quick analysis of this using a seoul OSM extract and recall from the results that likely a 256-512 set of common glyphs would have a very good impact. I haven't run an analysis that covers North/South Korea and that is what I think would be the next step here.

*Japanese: Hiragana + Katakana

I am less familiar with Japanese but these are two charactersets that do not cause as many headaches as CJK/Hangul. http://jrgraphix.net/r/Unicode/3040-309F, http://jrgraphix.net/r/Unicode/30A0-30FF

> parseInt('30ff',16) - parseInt('3040',16);
191

mikemorris · 2014-06-19T17:36:30Z

Emailed [email protected] asking for a Korea (North and South) sub region extract at http://download.geofabrik.de/asia.html, is this something @joto could help out with?

lxbarth · 2014-06-19T18:58:21Z

@mikemorris - per chat, use the Overpass API to download extracts. You can create a download URL with bbox and curl it.

@ajashton says there's also http://overpass-turbo.eu/ where you can use the wizard to easily make complex queries, eg place=city in "North Korea".

mikemorris · 2014-06-19T22:47:32Z

Thanks @lxbarth, got extracts of North Korea and South Korea to analyze now.

mikemorris · 2014-06-20T01:58:37Z

http://extract.bbbike.org/ is another resource for creating extracts, recommended by the folks at geofabrik

mikemorris · 2014-06-23T17:59:42Z

Per chat with @nickidlugash, should we possibly do another analysis for traditional Chinese characters used in Taiwan, Hong Kong and Macau?

yhahn · 2014-06-23T18:02:30Z

Start by adding more tile fixtures for these areas to

https://github.com/mapbox/node-fontnik/tree/char-spec/spec/fixtures

and running analysis of what the request profile looks like using the existing common-cjk ranges. If these fare badly/worse than the current tests, then yes, I think you should look into finding out why.

mikemorris · 2014-06-23T19:15:04Z

[For Hangul], a 256-512 set of common glyphs would have a very good impact

Committed results of a full analysis of North and South Korea. North Korea has very limited OSM coverage, but the entirety of glyphs in use ended up being just 445. It also appears that none of the glyphs in the Hangul Compatibility Jamo are used.

mikemorris · 2014-06-23T19:28:54Z

Should we expand the OSM CJK common analysis to include all the CJK ranges?

mikemorris · 2014-06-24T16:15:59Z

If these fare badly/worse than the current tests, then yes, I think you should look into finding out why.

@yhahn These look significantly worse than mainland China to me, I think we'll be needing a cjk-traditional common set as well.

mikemorris · 2014-06-24T17:28:34Z

About 30% of simplified Chinese characters match simplified kanji (see shinjitai).[28] This makes it easier for people who know simplified characters to be able to read and understand Japanese kanji. For example, the character 国 (country) is written the same way in Japanese (国) although in traditional Chinese it is 國. However, those who understand traditional Chinese will understand a much greater proportion of Japanese Kanji, as the current standard Japanese character set is much more similar to traditional Chinese.
https://en.wikipedia.org/wiki/Debate_on_traditional_and_simplified_Chinese_characters#Aesthetics

mikemorris · 2014-06-25T15:31:03Z

Adding all Hangul Unicode ranges:

mikemorris · 2014-06-26T00:04:03Z

Need to add some tile fixtures for Japan, but here are current results. Switching primary sorting to Unicode index instead of frequency adds 3-5 ranges to each, so these results only sort on index if frequency is equal. The extraneous ranges (CJK Symbols, Bopomofo for Taiwan, etc) are pretty well arranged within the Unicode spec, so adding them to the CJK common set just ended up bloating it and causing more ranges to be loaded unnecessarily.

china
none (78 ranges)
cjk-osm (26 ranges)
cjk-modern (34 ranges)
hangul-osm (78 ranges)

taiwan
none (90 ranges)
cjk-osm (33 ranges)
cjk-modern (106 ranges)
hangul-osm (90 ranges)

hong-kong
none (86 ranges)
cjk-osm (25 ranges)
cjk-modern (102 ranges)
hangul-osm (86 ranges)

macau
none (82 ranges)
cjk-osm (23 ranges)
cjk-modern (79 ranges)
hangul-osm (82 ranges)

north-korea
none (34 ranges)
cjk-osm (34 ranges)
cjk-modern (34 ranges)
hangul-osm (5 ranges)

south-korea
none (45 ranges)
cjk-osm (45 ranges)
cjk-modern (45 ranges)
hangul-osm (8 ranges)

mikemorris · 2014-06-26T00:11:28Z

Compressing into a single common set for each yields this:

china
none (78 ranges)
cjk-osm (5 ranges)
cjk-modern (34 ranges)
hangul-osm (78 ranges)

taiwan
none (90 ranges)
cjk-osm (9 ranges)
cjk-modern (106 ranges)
hangul-osm (90 ranges)

hong-kong
none (86 ranges)
cjk-osm (5 ranges)
cjk-modern (102 ranges)
hangul-osm (86 ranges)

macau
none (82 ranges)
cjk-osm (3 ranges)
cjk-modern (79 ranges)
hangul-osm (82 ranges)

north-korea
none (34 ranges)
cjk-osm (34 ranges)
cjk-modern (34 ranges)
hangul-osm (2 ranges)

south-korea
none (45 ranges)
cjk-osm (45 ranges)
cjk-modern (45 ranges)
hangul-osm (4 ranges)

mikemorris · 2014-06-26T16:36:01Z

Trimming cjk-common from 6405 to 4096 hits Taiwan REALLY hard, with a moderate impact on the rest. This could possibly be alleviated with a cjk-traditional-common set for Taiwan, Hong Kong and Macau.

china cjk-osm (15 ranges)
taiwan cjk-osm (71 ranges)
hong-kong cjk-osm (31 ranges)
macau cjk-osm (9 ranges)

Trimming hangul-common from 1110 to 1024 is manageable, but still adds a few ranges.

north-korea hangul-osm (2 ranges)
south-korea hangul-osm (7 ranges)

mikemorris · 2014-06-26T20:12:38Z

Best attempt so far at splitting cjk-common isn't all that impressive, feels like I'm just spinning wheels here and that even the best solution here is still incredibly fragile.

In this test, cjk-osm is built only from the China OSM extract, and cjk-extended-osm is built from China, Taiwan and Japan extracts deduped against cjk-osm.

cjk-combined-osm is the union of cjk-osm and cjk-extended-osm.

range sizes
cjk-osm 4096
cjk-extended-osm 2048
cjk-combined-osm 6144
hangul-osm 1024

china
cjk-osm (10 ranges)
cjk-extended-osm (79 ranges)
cjk-combined-osm (7 ranges)

taiwan
cjk-osm (85 ranges)
cjk-extended-osm (91 ranges)
cjk-combined-osm (17 ranges)

hong-kong
cjk-osm (20 ranges)
cjk-extended-osm (87 ranges)
cjk-combined-osm (8 ranges)

macau
cjk-osm (6 ranges)
cjk-extended-osm (83 ranges)
cjk-combined-osm (4 ranges)

north-korea
hangul-osm (2 ranges)

south-korea
hangul-osm (7 ranges)

mikemorris · 2014-06-26T20:17:48Z

Only real way to fix Taiwan is to not trim at all.

range sizes
cjk-osm 4096
cjk-extended-osm 2309
cjk-combined-osm 6405

china
cjk-osm (10 ranges)
cjk-extended-osm (79 ranges)
cjk-combined-osm (6 ranges)

taiwan
cjk-osm (85 ranges)
cjk-extended-osm (91 ranges)
cjk-combined-osm (10 ranges)

hong-kong
cjk-osm (20 ranges)
cjk-extended-osm (87 ranges)
cjk-combined-osm (6 ranges)

macau
cjk-osm (6 ranges)
cjk-extended-osm (83 ranges)
cjk-combined-osm (4 ranges)

mikemorris · 2014-06-26T21:41:25Z

Added test fixtures for Japan:

japan
none (95 ranges)
cjk-osm (14 ranges)
cjk-modern (111 ranges)
hangul-osm (95 ranges)
ranges [ '0-255',
  '1024-1279',
  '12288-12543',
  '13312-13567',
  '256-511',
  '55296-55551',
  '57088-57343',
  '64000-64255',
  '65280-65535',
  '8192-8447',
  '8448-8703',
  '8704-8959',
  '9472-9727',
  'cjk-common' ]

Interesting that there appear to be requests for characters in the High Surrogates and Low Surrogates ranges, which all appear as � to me.

mikemorris · 2015-08-05T16:07:48Z

TypeKit went with "dynamic subsetsetting", allowing requests for any number and combination of glyphs rather than predefined blocks.

Instead of redownloading an entirely new font, we can now simply request the additional glyphs, and perform the update right in your browser. Need one glyph? We can do that! And when you need another, no need to download the first again.

http://blog.typekit.com/2015/06/15/announcing-east-asian-web-font-support/

It also sounds like this was a really easy issue for them to solve:

After many years — working across four teams, on three continents, and in five time zones — we are proud to announce that we’ve extended Typekit’s web font service to support Chinese, Japanese and Korean fonts.

kkaefer · 2015-08-06T09:46:55Z

Super interesting read, thanks for posting!

huangyingjie · 2016-11-04T06:05:59Z

hi @mikemorris , what about the recent schedule? It's in great request for me. And is there any way to take part in?

glzcc · 2016-12-08T09:27:14Z

download the file,how use this file?

jayantchens · 2017-06-22T03:29:30Z

@yhahn @mikemorris Thank you for your answer. Could you tell me the exact details? Developing webGL.js (javaScript) in web web pages?

yhahn mentioned this issue Jun 1, 2014

reduce glyph downloads mapbox/mapbox-gl-js#395

Closed

yhahn added a commit that referenced this issue Jun 1, 2014

Refactor to accept an array of character codes as input. Refs #36.

9a88b55

yhahn mentioned this issue Jun 1, 2014

Char array #37

Merged

This was referenced Jun 2, 2014

unicode-range glyph PBFs #23

Closed

Base font glyph + uncommon glyph system #17

Closed

yhahn mentioned this issue Jun 4, 2014

Try different glyph range scenarios mapbox/mapbox-gl-js#388

Closed

mikemorris self-assigned this Jun 18, 2014

mikemorris mentioned this issue Jun 3, 2015

Glyph bitmap overflow, disappearing labels on rotation mapbox/mapbox-gl-native#1681

Closed

mikemorris added the question label Jul 7, 2015

mikemorris mentioned this issue Jul 26, 2016

Testing glyph rendering capacity for CJK fonts mapbox/mapbox-gl-test-suite#126

Closed

2 tasks

mikemorris removed their assignment Oct 19, 2016

mourner mentioned this issue Oct 26, 2016

Improve loading times for maps with CJK text mapbox/mapbox-gl-js#3466

Closed

mourner mentioned this issue Dec 6, 2016

Map loading times are very slow for areas with CJK labels mapbox/mapbox-gl-js#3748

Closed

aswamina added the jira-sync-pending label Aug 19, 2022

svc-jira-github-workato bot added jira-sync-complete and removed jira-sync-pending labels Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK char frequency #36

CJK char frequency #36

yhahn commented Jun 1, 2014

yhahn commented Jun 1, 2014

yhahn commented Jun 19, 2014

mikemorris commented Jun 19, 2014

lxbarth commented Jun 19, 2014

mikemorris commented Jun 19, 2014

mikemorris commented Jun 20, 2014

mikemorris commented Jun 23, 2014

yhahn commented Jun 23, 2014

mikemorris commented Jun 23, 2014

mikemorris commented Jun 23, 2014

mikemorris commented Jun 24, 2014

mikemorris commented Jun 24, 2014

mikemorris commented Jun 25, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Aug 5, 2015

kkaefer commented Aug 6, 2015

huangyingjie commented Nov 4, 2016

glzcc commented Dec 8, 2016

jayantchens commented Jun 22, 2017

CJK char frequency #36

CJK char frequency #36

Comments

yhahn commented Jun 1, 2014

Background

Prep

Approach

Conclusion + questions

Next actions

yhahn commented Jun 1, 2014

yhahn commented Jun 19, 2014

mikemorris commented Jun 19, 2014

lxbarth commented Jun 19, 2014

mikemorris commented Jun 19, 2014

mikemorris commented Jun 20, 2014

mikemorris commented Jun 23, 2014

yhahn commented Jun 23, 2014

mikemorris commented Jun 23, 2014

mikemorris commented Jun 23, 2014

mikemorris commented Jun 24, 2014

mikemorris commented Jun 24, 2014

mikemorris commented Jun 25, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Jun 26, 2014

mikemorris commented Aug 5, 2015

kkaefer commented Aug 6, 2015

huangyingjie commented Nov 4, 2016

glzcc commented Dec 8, 2016

jayantchens commented Jun 22, 2017