You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 29, 2023. It is now read-only.
Hi @mechatroner great work, I really love your extension. At the moment I'm evaluating to see whether I could extend it to support my learning of Chinese. I realize that parsing and highlighting Chinese are a bit different compared to English, a few things that I'm contemplating over:
The concept of characters, words and idioms are slightly different. For example in English "g", "o", "o", "d" are characters, "good" is a word, and "good morning" is an idiom. In Chinese, "早“ is both character and a word, "早上" is a word, and "早上好" is an idiom. So I figure that this may influence what goes into the dictionary file and what goes into the idiom file. I'm trying to find a documentation on how you divide what is considered word and what is considered idiom, but I couldn't find it. Do you have some documentation on this?
What is "rare lemma"? I see that you use this in your algorithm but I can't figure out if this will apply to Chinese as well.
Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.
etc.
Hopefully I'll be able to figure them out! Meanwhile, shoutout for your great works! Hope to see you around @mechatroner.
The text was updated successfully, but these errors were encountered:
how you divide what is considered word and what is considered idiom
The rule is very simple: if there are multiple tokens (one or more whitespace) then it is an idiom otherwise it is a word.
What is "rare lemma"
The threshold is set by the user, the extension itself just keeps a list of words sorted by frequency - which is a very standard and well-known method in computational linguistic. you can find a lot of such lists on the web built from different text corpuses https://en.wikipedia.org/wiki/Word_lists_by_frequency
Chinese words are also not separated by spaces. So I think I might need to change how the tokenizer works to make it works for Chinese.
Probably you are right, I guess there should be a lot of work there. Maybe you would have to completely re-think how the extension operates to make it useful for learning Chinese. To be honest I don't know Chinese at all so I can't provide any guidance in that matter.
I wish you luck with your efforts!
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi @mechatroner great work, I really love your extension. At the moment I'm evaluating to see whether I could extend it to support my learning of Chinese. I realize that parsing and highlighting Chinese are a bit different compared to English, a few things that I'm contemplating over:
Hopefully I'll be able to figure them out! Meanwhile, shoutout for your great works! Hope to see you around @mechatroner.
The text was updated successfully, but these errors were encountered: