Add support for other languages #26

ryanphung · 2021-05-16T22:13:46Z

Hi @mechatroner great work, I really love your extension. At the moment I'm evaluating to see whether I could extend it to support my learning of Chinese. I realize that parsing and highlighting Chinese are a bit different compared to English, a few things that I'm contemplating over:

The concept of characters, words and idioms are slightly different. For example in English "g", "o", "o", "d" are characters, "good" is a word, and "good morning" is an idiom. In Chinese, "早“ is both character and a word, "早上" is a word, and "早上好" is an idiom. So I figure that this may influence what goes into the dictionary file and what goes into the idiom file. I'm trying to find a documentation on how you divide what is considered word and what is considered idiom, but I couldn't find it. Do you have some documentation on this?
What is "rare lemma"? I see that you use this in your algorithm but I can't figure out if this will apply to Chinese as well.
Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.
etc.

Hopefully I'll be able to figure them out! Meanwhile, shoutout for your great works! Hope to see you around @mechatroner.

mechatroner · 2021-05-24T02:41:03Z

Hi @ryanphung!
Thanks for the feedback!

how you divide what is considered word and what is considered idiom

The rule is very simple: if there are multiple tokens (one or more whitespace) then it is an idiom otherwise it is a word.

What is "rare lemma"

The threshold is set by the user, the extension itself just keeps a list of words sorted by frequency - which is a very standard and well-known method in computational linguistic. you can find a lot of such lists on the web built from different text corpuses https://en.wikipedia.org/wiki/Word_lists_by_frequency

Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.

Probably you are right, I guess there should be a lot of work there. Maybe you would have to completely re-think how the extension operates to make it useful for learning Chinese. To be honest I don't know Chinese at all so I can't provide any guidance in that matter.

I wish you luck with your efforts!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for other languages #26

Add support for other languages #26

ryanphung commented May 16, 2021 •

edited

Loading

mechatroner commented May 24, 2021

Add support for other languages #26

Add support for other languages #26

Comments

ryanphung commented May 16, 2021 • edited Loading

mechatroner commented May 24, 2021

ryanphung commented May 16, 2021 •

edited

Loading