-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec scan_vocab() pruning method #2024
Comments
CC: @gojomo |
The reason is efficiency. Each pruning pass is costly. Moreover, due to Zipf's law, in your version the dictionary would fill up very quickly again, leading to lots of pruning invocations (the vocab size would hover around In the original, Zipf's law ensures that the number of tokens removed grows exponentially (frequency * rank = ~constant), meaning the time between two pruning invocations remains more or less constant. The Gensim mailing list is a better place for such questions and discussions. |
I agree that the escalating-floor pruning method is very crude. Not just does it disadvantage later-arriving words, but each prune tends to eliminate 90% of all words. So when a prune comes close-to-the-end, you can wind up with a final vocab-size much smaller than your stated maximum. So it's giving up a bunch of correctness (in which words should survive) for efficiency of pruning-less-often. However, I believe it follows the same algorithm as the code in the original Google word2vec.c code that the gensim implementation was modeled after. Some users might prefer spending more time on more-frequent smaller prunes to get a more correct survivor-set, but all such choices involve tradeoffs. If using this mechanism, best practice is probably to set a |
I'd much prefer to have Bounter properly integrated (PR #1962) than come up with more ad-hoc schemes. |
Not really an issue, but i wondered why it is done that way.
In the word2vec scan_vocab() method the
min_reduce
count is increased after every pruning (link to code). New tokens, that appear late or evenly spreaded in the dataset will be very likely pruned out that way.Wouldn't it be better to restart the pruning from 1 every time pruning is needed and increase it by 1 until vocab_size < max_vocab_size?
The text was updated successfully, but these errors were encountered: