Skip to content
This repository has been archived by the owner on Mar 29, 2023. It is now read-only.

Add support for other languages #26

Open
ryanphung opened this issue May 16, 2021 · 1 comment
Open

Add support for other languages #26

ryanphung opened this issue May 16, 2021 · 1 comment

Comments

@ryanphung
Copy link

ryanphung commented May 16, 2021

Hi @mechatroner great work, I really love your extension. At the moment I'm evaluating to see whether I could extend it to support my learning of Chinese. I realize that parsing and highlighting Chinese are a bit different compared to English, a few things that I'm contemplating over:

  • The concept of characters, words and idioms are slightly different. For example in English "g", "o", "o", "d" are characters, "good" is a word, and "good morning" is an idiom. In Chinese, "早“ is both character and a word, "早上" is a word, and "早上好" is an idiom. So I figure that this may influence what goes into the dictionary file and what goes into the idiom file. I'm trying to find a documentation on how you divide what is considered word and what is considered idiom, but I couldn't find it. Do you have some documentation on this?
  • What is "rare lemma"? I see that you use this in your algorithm but I can't figure out if this will apply to Chinese as well.
  • Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.
  • etc.

Hopefully I'll be able to figure them out! Meanwhile, shoutout for your great works! Hope to see you around @mechatroner.

@mechatroner
Copy link
Owner

Hi @ryanphung!
Thanks for the feedback!

how you divide what is considered word and what is considered idiom

The rule is very simple: if there are multiple tokens (one or more whitespace) then it is an idiom otherwise it is a word.

What is "rare lemma"

The threshold is set by the user, the extension itself just keeps a list of words sorted by frequency - which is a very standard and well-known method in computational linguistic. you can find a lot of such lists on the web built from different text corpuses https://en.wikipedia.org/wiki/Word_lists_by_frequency

Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.

Probably you are right, I guess there should be a lot of work there. Maybe you would have to completely re-think how the extension operates to make it useful for learning Chinese. To be honest I don't know Chinese at all so I can't provide any guidance in that matter.

I wish you luck with your efforts!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants