Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 #1959

jayantj · 2018-03-07T11:05:35Z

Fixes #1917

The bug was resulting from the fact that the probability of each node for negative sampling is subject to floating-point error (even with float128), which increases with increasing number of nodes - which explains why we hadn't encountered it before training on wikipedia.

Since the individual probabilities are subject to some error, the cumulative sum can be != 1 (an error of around 1e-10 in the case of en wikipedia), and thus np.searchsorted for a randomly chosen value in [0.0, 1.0) can return last index of cumsum table + 1 when the random value is greater than the last value in the cumsum table.

To avoid this, I'm multiplying the randomly chosen values in [0.0, 1.0) by the last value in the cumulative sum table to ensure it lies within the cumsum range (extremely close to [0.0, 1.0] in most cases)

This is rather hard to write a simple test for, so I'm skipping it. Let me know what you think @menshikh-iv

menshikh-iv · 2018-03-07T11:08:44Z

@jayantj I agree about the test (let's skip it because this can be reproduced only with the really large model, that we have no chance to test in CI service).

The really interesting reason, nice catch!

menshikh-iv · 2018-03-07T11:30:37Z

gensim/models/poincare.py

@@ -202,8 +202,12 @@ def _get_candidate_negatives(self):
        """

        if self._negatives_buffer.num_items() < self.negative:
-            # Note: np.random.choice much slower than random.sample for large populations, possible bottleneck
-            uniform_numbers = self._np_random.random_sample(self._negatives_buffer_size)
+            # self._node_probabilities_cumsum sometimes doesn't have 1 as the last value due to floating point error.


I think this will be a good idea to add a link to the issue directly to comment

…g point errors

jayantj · 2018-03-14T07:41:10Z

I investigated the failing test - it seems rather strange and looks to have to do with pickle. The poincare model saved in py3.5 wasn't being loaded correctly in py2.7, with this error -

In [1]: from gensim.models.poincare import PoincareModel

In [2]: PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-2-1d8d6b315418> in <module>()
----> 1 PoincareModel.load('gensim/test/test_data/poincare_test_3.4.0')

/home/jayant/Projects/gensim/gensim/gensim/models/poincare.pyc in load(cls, *args, **kwargs)
    344     def load(cls, *args, **kwargs):
    345         """Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`."""
--> 346         model = super(PoincareModel, cls).load(*args, **kwargs)
    347         model._init_node_probabilities()
    348         return model

/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in load(cls, fname, mmap)
    423         compress, subname = SaveLoad._adapt_by_suffix(fname)
    424 
--> 425         obj = unpickle(fname)
    426         obj._load_specials(fname, mmap, compress, subname)
    427         logger.info("loaded %s", fname)

/home/jayant/Projects/gensim/gensim/gensim/utils.pyc in unpickle(fname)
   1332             return _pickle.load(f, encoding='latin1')
   1333         else:
-> 1334             return _pickle.loads(f.read())
   1335 
   1336 

UnpicklingError: state is not a dictionary

However, it works fine when I use pickle instead of cPickle -

In [3]: import pickle, cPickle

In [4]: pickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
Out[4]: <gensim.models.poincare.PoincareModel at 0x7f8e8da57dd0>

In [5]: cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-5-d18194368f07> in <module>()
----> 1 cPickle.load(open('gensim/test/test_data/poincare_test_3.4.0', 'rb'))

UnpicklingError: state is not a dictionary

For now, I've trained the model used in the tests with py2.7, which seems to work fine across all python versions. We should probably investigate this in a subsequent issue.

Attaching the model here -

poincare_test_3.4.0.zip
gensim-3.4 was used to train this model

menshikh-iv · 2018-03-14T11:13:50Z

@jayantj aha, really suspicious behavior (and I also think that this problem is wider than only with Poincare).
So, what're our next actions? Merge current PR & raise issue about the pickle?

UPD: We discuss it, current PR is done, @jayantj will raise an issue later.

menshikh-iv reviewed Mar 7, 2018

View reviewed changes

jayantj added 5 commits March 8, 2018 13:54

Fixes bug in negative sampling due to floating point error

9965fc1

Uses counts in cumsum table instead of probabilities to avoid floatin…

0cbe2c9

…g point errors

Adds failing tests for loading old models and re-training loaded models

b911328

Adds fix for added tests

2ade17b

Fixes test docstrings

fea4ba5

jayantj force-pushed the poincare_negative_sampling_bugfix branch from 633520b to fea4ba5 Compare March 8, 2018 08:24

menshikh-iv changed the title ~~[MRG] Fixes bug in negative sampling due to floating point error~~ Fix negative sampling floating-point error for gensim.models.PoincareModel. Fix #1917 Mar 9, 2018

Updates saved poincare model for tests

a42c6b7

menshikh-iv merged commit e8264bd into piskvorky:develop Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 #1959

Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 #1959

jayantj commented Mar 7, 2018 •

edited

Loading

menshikh-iv commented Mar 7, 2018

menshikh-iv Mar 7, 2018

jayantj commented Mar 14, 2018 •

edited

Loading

menshikh-iv commented Mar 14, 2018 •

edited

Loading

Fix negative sampling floating-point error for gensim.models.PoincareModel. Fix #1917 #1959

Fix negative sampling floating-point error for gensim.models.PoincareModel. Fix #1917 #1959

Conversation

jayantj commented Mar 7, 2018 • edited Loading

menshikh-iv commented Mar 7, 2018

menshikh-iv Mar 7, 2018

Choose a reason for hiding this comment

jayantj commented Mar 14, 2018 • edited Loading

menshikh-iv commented Mar 14, 2018 • edited Loading

Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 #1959

Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 #1959

jayantj commented Mar 7, 2018 •

edited

Loading

jayantj commented Mar 14, 2018 •

edited

Loading

menshikh-iv commented Mar 14, 2018 •

edited

Loading