Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec scan_vocab() pruning method #2024

Closed
villmow opened this issue Apr 10, 2018 · 4 comments
Closed

Word2Vec scan_vocab() pruning method #2024

villmow opened this issue Apr 10, 2018 · 4 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@villmow
Copy link

villmow commented Apr 10, 2018

Not really an issue, but i wondered why it is done that way.

In the word2vec scan_vocab() method the min_reduce count is increased after every pruning (link to code). New tokens, that appear late or evenly spreaded in the dataset will be very likely pruned out that way.

Wouldn't it be better to restart the pruning from 1 every time pruning is needed and increase it by 1 until vocab_size < max_vocab_size?

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Apr 10, 2018
@menshikh-iv
Copy link
Contributor

CC: @gojomo

@piskvorky
Copy link
Owner

piskvorky commented Apr 10, 2018

The reason is efficiency. Each pruning pass is costly.

Moreover, due to Zipf's law, in your version the dictionary would fill up very quickly again, leading to lots of pruning invocations (the vocab size would hover around max_vocab_size all the time, with min_reduce not changing much).

In the original, Zipf's law ensures that the number of tokens removed grows exponentially (frequency * rank = ~constant), meaning the time between two pruning invocations remains more or less constant.

The Gensim mailing list is a better place for such questions and discussions.

@gojomo
Copy link
Collaborator

gojomo commented Apr 11, 2018

I agree that the escalating-floor pruning method is very crude. Not just does it disadvantage later-arriving words, but each prune tends to eliminate 90% of all words. So when a prune comes close-to-the-end, you can wind up with a final vocab-size much smaller than your stated maximum.

So it's giving up a bunch of correctness (in which words should survive) for efficiency of pruning-less-often. However, I believe it follows the same algorithm as the code in the original Google word2vec.c code that the gensim implementation was modeled after. Some users might prefer spending more time on more-frequent smaller prunes to get a more correct survivor-set, but all such choices involve tradeoffs. If using this mechanism, best practice is probably to set a max_vocab much larger than your intended final-size - so you still get a cap on total RAM usage, but the arbitrariness around the prune-boundary, and overpruning, are less likely to affect your final vocab. See also the newer parameter max_final_vocab which sets an actual target for final size, rather than just a cap-that-triggers-pruing for the sake of bounding memory usage.

@piskvorky
Copy link
Owner

I'd much prefer to have Bounter properly integrated (PR #1962) than come up with more ad-hoc schemes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

4 participants