You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO - 2019-02-06 16:09:54,670 - resulting dictionary: Dictionary(10000 unique tokens: [u'writings', u'foul', u'prefix', u'woods', u'hanging']...)
INFO - 2019-02-06 16:10:01,718 - collecting document frequencies
INFO - 2019-02-06 16:10:01,718 - PROGRESS: processing document #0
INFO - 2019-02-06 16:10:02,162 - calculating IDF weights for 1701 documents and 9999 features (2220559 matrix non-zeros)
Note the "9999 features", despite the dictionary having 10000 features. (any dictionary and any number of features -- the actual values don't matter, it's always off-by-one).
* Fix the off-by-one bug in the TFIDF model.
Fixes#2375.
Use len to compute the number of features.
Since the ids are zero-indexed, Using max causes an off-by-one bug.
* Use the maximum token identifier to compute the number of TFIDF features
* Tweak the number of features computation
Building a TFIDF model outputs this in the log:
Note the "9999 features", despite the dictionary having 10000 features. (any dictionary and any number of features -- the actual values don't matter, it's always off-by-one).
This seems to be a bug introduced in this refactoring.
Not sure if this is an isolated problem or what other places are affected by similar "optimizations".
The text was updated successfully, but these errors were encountered: