Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements Traditional Chinese Characters to Pinyin Support #16

Conversation

ujamer
Copy link
Contributor

@ujamer ujamer commented Sep 11, 2020

Hey there,

Are you accepting PRs at this time? I've got a website here that requires Traditional Chinese support, and when I tried your plugin, a lot characters are lost on the slug output. So I went ahead and updated the dictionary file with Traditional Chinese characters. I generated the new dictionary file from the Unicode Consortium's Unihan database, with kMandarin Reading field as the pinyin keys. I've tested the new data set on my site, and it appears to work well.

The downside of this PR is that some of the characters fell under different pinyin keys, and it's probably because the source of the original data set used an alternate pronunciation compared to the Unihan database; thus the new slugs would give different pinyin for some characters compared to the old slugs. I suppose we can maintain "backwards compatibility" by searching for misplaced characters in the new dictionary; but I've found that most, if not all of these differences are context sensitive, and the correctness of the pinyin depends on how it's used in a particular phrase. Given that we don't have pinyin conversion on a phrase level here, and only on a character level, it's really difficult to gauge the correctness for either data set. Unicode.org does claim that kMandarin is the best pronunciation to use with Mandarin, so it's got that going for it.

There is also a ton of rarely used character in the Unihan data set, many of which my browser won't display at all, so the file size is about 5 times larger than before. I'm not sure about the performance impact of the extra character set, but from my limited testing, I don't feel much of a difference with my test strings.

Let me know if you have any concerns.

Generated a new dictionary file via the Unihan database, using the kMandarin Reading field as pinyin phonetic keys
@senlin
Copy link
Owner

senlin commented Sep 14, 2020

Hi Yun-yu Shen (@ujamer),

Many thanks for your PR.

The file you updated indeed seems to work well for both Simplified as well as Traditional Chinese, so that certainly is a big improvement.

I do not care too much for the downside you mention, as if we update the plugin, it will only affect the slugs going forward, so the chance on conflicts is minimal and does not outweigh the advantage of now being able to use the plugin for both Traditional and Simplified Chinese.

What I do care about is the non-existent characters. Not only do they don't show in the browser, I also don't get them to show in my text-editor(s).

Would there be an issue removing all those you think?

@ujamer
Copy link
Contributor Author

ujamer commented Sep 15, 2020

Hm. The only issue I can think of would be for those users who are using more complete sets of Chinese fonts, these characters would in fact render correctly on their machines, but the plugin would then fail to transform them into pinyin. I think this mainly occurs when the user is dealing with centuries old Chinese documents, poetry, and such.

I don't mind removing them, since it's unlikely that I'd personally hit that particular use case, but it's something to keep in mind.

@senlin
Copy link
Owner

senlin commented Sep 15, 2020

I think this mainly occurs when the user is dealing with centuries old Chinese documents, poetry, and such.

Understood, let's take our chances on that.

I don't mind removing them

Yes, please do. After that I can merge your PR :)

Removed all unrendered characters that are missing from system fonts. (Tested on macOS Catalina)
@ujamer
Copy link
Contributor Author

ujamer commented Sep 15, 2020

Done.

I had to remove the characters manually, because when I tried to filter them with apple's system fonts, (ie, Heiti and PingFang) my script ended up removing many rare characters that are actually rendered in Firefox. A bit of a head-scratcher on that one. I suppose this means that things might be different under Windows as well, depending on whats available in the system fonts. Let me know if I missed anything.

@senlin senlin merged commit 63faa87 into senlin:master Sep 16, 2020
@senlin
Copy link
Owner

senlin commented Sep 16, 2020

Many thanks for your initiative Yun-yu Shen!

@ujamer
Copy link
Contributor Author

ujamer commented Sep 16, 2020

No problem. Thanks for the plugin!

senlin added a commit that referenced this pull request Sep 16, 2020
* date: September 16, 2020
* dictionary update: UniHan database using the kMandarin Reading field as pinyin phonetic keys (https://unicode.org/charts/unihan.html); with many thanks to [Yun-yu Shen @ujamer](#16)
* edits in readme files and on Settings page
senlin added a commit that referenced this pull request Sep 16, 2020
* date: September 16, 2020
* dictionary update: UniHan database using the kMandarin Reading field as pinyin phonetic keys (https://unicode.org/charts/unihan.html); with many thanks to [Yun-yu Shen @ujamer](#16)
* edits in readme files and on Settings page
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants