-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements Traditional Chinese Characters to Pinyin Support #16
Implements Traditional Chinese Characters to Pinyin Support #16
Conversation
Generated a new dictionary file via the Unihan database, using the kMandarin Reading field as pinyin phonetic keys
Hi Yun-yu Shen (@ujamer), Many thanks for your PR. The file you updated indeed seems to work well for both Simplified as well as Traditional Chinese, so that certainly is a big improvement. I do not care too much for the downside you mention, as if we update the plugin, it will only affect the slugs going forward, so the chance on conflicts is minimal and does not outweigh the advantage of now being able to use the plugin for both Traditional and Simplified Chinese. What I do care about is the non-existent characters. Not only do they don't show in the browser, I also don't get them to show in my text-editor(s). Would there be an issue removing all those you think? |
Hm. The only issue I can think of would be for those users who are using more complete sets of Chinese fonts, these characters would in fact render correctly on their machines, but the plugin would then fail to transform them into pinyin. I think this mainly occurs when the user is dealing with centuries old Chinese documents, poetry, and such. I don't mind removing them, since it's unlikely that I'd personally hit that particular use case, but it's something to keep in mind. |
Understood, let's take our chances on that.
Yes, please do. After that I can merge your PR :) |
Removed all unrendered characters that are missing from system fonts. (Tested on macOS Catalina)
Done. I had to remove the characters manually, because when I tried to filter them with apple's system fonts, (ie, Heiti and PingFang) my script ended up removing many rare characters that are actually rendered in Firefox. A bit of a head-scratcher on that one. I suppose this means that things might be different under Windows as well, depending on whats available in the system fonts. Let me know if I missed anything. |
Many thanks for your initiative Yun-yu Shen! |
No problem. Thanks for the plugin! |
* date: September 16, 2020 * dictionary update: UniHan database using the kMandarin Reading field as pinyin phonetic keys (https://unicode.org/charts/unihan.html); with many thanks to [Yun-yu Shen @ujamer](#16) * edits in readme files and on Settings page
* date: September 16, 2020 * dictionary update: UniHan database using the kMandarin Reading field as pinyin phonetic keys (https://unicode.org/charts/unihan.html); with many thanks to [Yun-yu Shen @ujamer](#16) * edits in readme files and on Settings page
Hey there,
Are you accepting PRs at this time? I've got a website here that requires Traditional Chinese support, and when I tried your plugin, a lot characters are lost on the slug output. So I went ahead and updated the dictionary file with Traditional Chinese characters. I generated the new dictionary file from the Unicode Consortium's Unihan database, with kMandarin Reading field as the pinyin keys. I've tested the new data set on my site, and it appears to work well.
The downside of this PR is that some of the characters fell under different pinyin keys, and it's probably because the source of the original data set used an alternate pronunciation compared to the Unihan database; thus the new slugs would give different pinyin for some characters compared to the old slugs. I suppose we can maintain "backwards compatibility" by searching for misplaced characters in the new dictionary; but I've found that most, if not all of these differences are context sensitive, and the correctness of the pinyin depends on how it's used in a particular phrase. Given that we don't have pinyin conversion on a phrase level here, and only on a character level, it's really difficult to gauge the correctness for either data set. Unicode.org does claim that kMandarin is the best pronunciation to use with Mandarin, so it's got that going for it.
There is also a ton of rarely used character in the Unihan data set, many of which my browser won't display at all, so the file size is about 5 times larger than before. I'm not sure about the performance impact of the extra character set, but from my limited testing, I don't feel much of a difference with my test strings.
Let me know if you have any concerns.