-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix negative sampling floating-point error for gensim.models.PoincareModel
. Fix #1917
#1959
Fix negative sampling floating-point error for gensim.models.PoincareModel
. Fix #1917
#1959
Conversation
@jayantj I agree about the test (let's skip it because this can be reproduced only with the really large model, that we have no chance to test in CI service). The really interesting reason, nice catch! |
gensim/models/poincare.py
Outdated
@@ -202,8 +202,12 @@ def _get_candidate_negatives(self): | |||
""" | |||
|
|||
if self._negatives_buffer.num_items() < self.negative: | |||
# Note: np.random.choice much slower than random.sample for large populations, possible bottleneck | |||
uniform_numbers = self._np_random.random_sample(self._negatives_buffer_size) | |||
# self._node_probabilities_cumsum sometimes doesn't have 1 as the last value due to floating point error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be a good idea to add a link to the issue directly to comment
633520b
to
fea4ba5
Compare
gensim.models.PoincareModel
. Fix #1917
I investigated the failing test - it seems rather strange and looks to have to do with pickle. The poincare model saved in py3.5 wasn't being loaded correctly in py2.7, with this error - In [1]: from gensim.models.poincare import PoincareModel
In [2]: PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-2-1d8d6b315418> in <module>()
----> 1 PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')
/home/jayant/Projects/gensim/gensim/gensim/models/poincare.pyc in load(cls, *args, **kwargs)
344 def load(cls, *args, **kwargs):
345 """Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`."""
--> 346 model = super(PoincareModel, cls).load(*args, **kwargs)
347 model._init_node_probabilities()
348 return model
/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in load(cls, fname, mmap)
423 compress, subname = SaveLoad._adapt_by_suffix(fname)
424
--> 425 obj = unpickle(fname)
426 obj._load_specials(fname, mmap, compress, subname)
427 logger.info("loaded %s", fname)
/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in unpickle(fname)
1332 return _pickle.load(f, encoding='latin1')
1333 else:
-> 1334 return _pickle.loads(f.read())
1335
1336
UnpicklingError: state is not a dictionary However, it works fine when I use pickle instead of cPickle - In [3]: import pickle, cPickle
In [4]: pickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
Out[4]: <gensim.models.poincare.PoincareModel at 0x7f8e8da57dd0>
In [5]: cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-5-d18194368f07> in <module>()
----> 1 cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
UnpicklingError: state is not a dictionary
For now, I've trained the model used in the tests with py2.7, which seems to work fine across all python versions. We should probably investigate this in a subsequent issue. Attaching the model here - poincare_test_3.4.0.zip |
Fixes #1917
The bug was resulting from the fact that the probability of each node for negative sampling is subject to floating-point error (even with float128), which increases with increasing number of nodes - which explains why we hadn't encountered it before training on wikipedia.
Since the individual probabilities are subject to some error, the cumulative sum can be != 1 (an error of around
1e-10
in the case of en wikipedia), and thusnp.searchsorted
for a randomly chosen value in[0.0, 1.0)
can returnlast index of cumsum table + 1
when the random value is greater than the last value in the cumsum table.To avoid this, I'm multiplying the randomly chosen values in
[0.0, 1.0)
by the last value in the cumulative sum table to ensure it lies within the cumsum range (extremely close to[0.0, 1.0]
in most cases)This is rather hard to write a simple test for, so I'm skipping it. Let me know what you think @menshikh-iv