Word2Vec scan_vocab() pruning method #2024

villmow · 2018-04-10T12:45:39Z

Not really an issue, but i wondered why it is done that way.

In the word2vec scan_vocab() method the min_reduce count is increased after every pruning (link to code). New tokens, that appear late or evenly spreaded in the dataset will be very likely pruned out that way.

Wouldn't it be better to restart the pruning from 1 every time pruning is needed and increase it by 1 until vocab_size < max_vocab_size?

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-04-10T13:50:02Z

CC: @gojomo

piskvorky · 2018-04-10T14:23:36Z

The reason is efficiency. Each pruning pass is costly.

Moreover, due to Zipf's law, in your version the dictionary would fill up very quickly again, leading to lots of pruning invocations (the vocab size would hover around max_vocab_size all the time, with min_reduce not changing much).

In the original, Zipf's law ensures that the number of tokens removed grows exponentially (frequency * rank = ~constant), meaning the time between two pruning invocations remains more or less constant.

The Gensim mailing list is a better place for such questions and discussions.

gojomo · 2018-04-11T12:24:58Z

I agree that the escalating-floor pruning method is very crude. Not just does it disadvantage later-arriving words, but each prune tends to eliminate 90% of all words. So when a prune comes close-to-the-end, you can wind up with a final vocab-size much smaller than your stated maximum.

So it's giving up a bunch of correctness (in which words should survive) for efficiency of pruning-less-often. However, I believe it follows the same algorithm as the code in the original Google word2vec.c code that the gensim implementation was modeled after. Some users might prefer spending more time on more-frequent smaller prunes to get a more correct survivor-set, but all such choices involve tradeoffs. If using this mechanism, best practice is probably to set a max_vocab much larger than your intended final-size - so you still get a cap on total RAM usage, but the arbitrariness around the prune-boundary, and overpruning, are less likely to affect your final vocab. See also the newer parameter max_final_vocab which sets an actual target for final size, rather than just a cap-that-triggers-pruing for the sake of bounding memory usage.

piskvorky · 2018-04-11T12:29:18Z

I'd much prefer to have Bounter properly integrated (PR #1962) than come up with more ad-hoc schemes.

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Apr 10, 2018

piskvorky closed this as completed Apr 10, 2018

gojomo mentioned this issue Oct 6, 2020

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec scan_vocab() pruning method #2024

Word2Vec scan_vocab() pruning method #2024

villmow commented Apr 10, 2018

menshikh-iv commented Apr 10, 2018

piskvorky commented Apr 10, 2018 •

edited

Loading

gojomo commented Apr 11, 2018

piskvorky commented Apr 11, 2018

Word2Vec scan_vocab() pruning method #2024

Word2Vec scan_vocab() pruning method #2024

Comments

villmow commented Apr 10, 2018

menshikh-iv commented Apr 10, 2018

piskvorky commented Apr 10, 2018 • edited Loading

gojomo commented Apr 11, 2018

piskvorky commented Apr 11, 2018

piskvorky commented Apr 10, 2018 •

edited

Loading