Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better zh_TW and zh_CN conversion #18

Closed
tddschn opened this issue May 6, 2024 · 9 comments · Fixed by #20
Closed

Better zh_TW and zh_CN conversion #18

tddschn opened this issue May 6, 2024 · 9 comments · Fixed by #20
Assignees

Comments

@tddschn
Copy link

tddschn commented May 6, 2024

Thank you for making this! As a native Hokkien speaker I find it very professionally done.

However, when doing conversion between zh_TW and zh_* (to_traditional & to_simplified), the context (the word and the sentence a char is in) should be considered, simple char-to-char mapping can be problematic in some cases.

https://github.com/BYVoid/OpenCC This library seem to be better at handling the subtlety of conversion.

@andreihar andreihar self-assigned this May 7, 2024
@andreihar
Copy link
Owner

Thank you very much for your kind words!

Indeed, the current simplified to traditional converter doesn't handle cases where the single simplified char maps to multiple traditional chars. I've modified both the conversion dataset and the codebase. When tested on taibun dataset, the accuracy improved by 10% (2.17% higher than OpenCC's conversion), and currently it's 32% more efficient than OpenCC's conversion.

I'll think about how to further boost efficiency and I plan to release the new version by week's end at the latest.

I deeply appreciate your valuable feedback!

@tddschn
Copy link
Author

tddschn commented May 8, 2024

Thank you Andrei!

How do you measure efficiency, is it the execution time of the function?

@andreihar
Copy link
Owner

Yes, I measure the time it takes to convert all items in words.json from Simplified to Traditional. The converter I've developed is specifically designed to handle the conversion of characters exclusively found in words.json rather than all Chinese characters, so this accounts for its faster execution.

@andreihar andreihar linked a pull request May 10, 2024 that will close this issue
@tddschn
Copy link
Author

tddschn commented May 10, 2024

@andreihar Thank you Andrei!

I made a simple Gradio app to make it easier for non-technical people to use taibun here https://huggingface.co/spaces/tddschn/taibun-converter , do you think you can include it in your README?

@andreihar
Copy link
Owner

Sorry for the late reply! It seems GitHub doesn't notify about messages in closed issues.

The live demo of Taibun can be currently accessed via this link: https://taibun.vercel.app/. I plan to change domains for all my web projects very soon, hence I don't have a link to it in the README. I hope I'll get to it in the near future.

@tddschn
Copy link
Author

tddschn commented May 22, 2024 via email

@tddschn
Copy link
Author

tddschn commented May 22, 2024 via email

@andreihar
Copy link
Owner

I currently live in the Metro Vancouver area, so I have quite a lot of Taiwanese friends. Besides that, the main grammar resource I use is Taiwanese Grammar: A Concise Reference by Philip T. Lin. It's written in English and explains many grammar points by comparing them with both English and Mandarin grammar, so it makes it very easy to understand the Taiwanese language.

When it comes to Written Taiwanese, pretty much nobody knows it since in schools Taiwanese is taught primarily as a spoken language. When I ask my friends to translate something into Taiwanese, they will usually use iTaigi and the Taiwanese Ministry of Education Dictionary to find Chinese characters for Taiwanese words.

@tddschn
Copy link
Author

tddschn commented May 31, 2024

Thank you Andrei!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants