Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-by-one counting error in TFIDF #2375

Closed
piskvorky opened this issue Feb 6, 2019 · 0 comments · Fixed by #2392
Closed

Off-by-one counting error in TFIDF #2375

piskvorky opened this issue Feb 6, 2019 · 0 comments · Fixed by #2392
Labels
bug Issue described a bug

Comments

@piskvorky
Copy link
Owner

piskvorky commented Feb 6, 2019

Building a TFIDF model outputs this in the log:

INFO - 2019-02-06 16:09:54,670 - resulting dictionary: Dictionary(10000 unique tokens: [u'writings', u'foul', u'prefix', u'woods', u'hanging']...)
INFO - 2019-02-06 16:10:01,718 - collecting document frequencies
INFO - 2019-02-06 16:10:01,718 - PROGRESS: processing document #0
INFO - 2019-02-06 16:10:02,162 - calculating IDF weights for 1701 documents and 9999 features (2220559 matrix non-zeros)

Note the "9999 features", despite the dictionary having 10000 features. (any dictionary and any number of features -- the actual values don't matter, it's always off-by-one).

This seems to be a bug introduced in this refactoring.

Not sure if this is an isolated problem or what other places are affected by similar "optimizations".

@piskvorky piskvorky added the bug Issue described a bug label Feb 6, 2019
mpenkov pushed a commit that referenced this issue Apr 28, 2019
* Fix the off-by-one bug in the TFIDF model.

Fixes #2375.
Use len to compute the number of features.
Since the ids are zero-indexed, Using max causes an off-by-one bug.

* Use the maximum token identifier to compute the number of TFIDF features

* Tweak the number of features computation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant