Off-by-one counting error in TFIDF #2375

piskvorky · 2019-02-06T15:17:25Z

Building a TFIDF model outputs this in the log:

INFO - 2019-02-06 16:09:54,670 - resulting dictionary: Dictionary(10000 unique tokens: [u'writings', u'foul', u'prefix', u'woods', u'hanging']...)
INFO - 2019-02-06 16:10:01,718 - collecting document frequencies
INFO - 2019-02-06 16:10:01,718 - PROGRESS: processing document #0
INFO - 2019-02-06 16:10:02,162 - calculating IDF weights for 1701 documents and 9999 features (2220559 matrix non-zeros)

Note the "9999 features", despite the dictionary having 10000 features. (any dictionary and any number of features -- the actual values don't matter, it's always off-by-one).

This seems to be a bug introduced in this refactoring.

Not sure if this is an isolated problem or what other places are affected by similar "optimizations".

The text was updated successfully, but these errors were encountered:

* Fix the off-by-one bug in the TFIDF model. Fixes #2375. Use len to compute the number of features. Since the ids are zero-indexed, Using max causes an off-by-one bug. * Use the maximum token identifier to compute the number of TFIDF features * Tweak the number of features computation

piskvorky added the bug Issue described a bug label Feb 6, 2019

AMR-KELEG mentioned this issue Feb 22, 2019

Fix the off-by-one bug in the TFIDF model. #2392

Merged

mpenkov closed this as completed in #2392 Apr 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off-by-one counting error in TFIDF #2375

Off-by-one counting error in TFIDF #2375

piskvorky commented Feb 6, 2019 •

edited

Loading

Off-by-one counting error in TFIDF #2375

Off-by-one counting error in TFIDF #2375

Comments

piskvorky commented Feb 6, 2019 • edited Loading

piskvorky commented Feb 6, 2019 •

edited

Loading