Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Phraser memory usage (drop frequencies) #2208

Merged
merged 12 commits into from
Jan 11, 2019
7 changes: 5 additions & 2 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -805,7 +805,7 @@ def __init__(self, phrases_model):
for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
if bigram in self.phrasegrams:
logger.info('Phraser repeat %s', bigram)
self.phrasegrams[bigram] = (phrases_model.vocab[self.delimiter.join(bigram)], score)
self.phrasegrams[bigram] = score
count += 1
if not count % 50000:
logger.info('Phraser added %i phrasegrams', count)
Expand Down Expand Up @@ -848,7 +848,10 @@ def score_item(self, worda, wordb, components, scorer):

"""
try:
return self.phrasegrams[tuple(components)][1]
if list(self.phrasegrams.values())[0].__class__ is tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this check is very tricky to just check that value is instance of tuple. Python has isinstance builtin function for that.
  2. creation of list of values is not easy operation. It would be better to avoid this.
Suggested change
if list(self.phrasegrams.values())[0].__class__ is tuple:
score = self.phrasegrams[tuple(components)]
return score[-1] if isinstance(score, tuple) else score

Copy link
Contributor Author

@jenishah jenishah Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isinstance has to be applied to dictionary value, which requires using list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenishah yes, and as you can see in suggested change, we are getting it and saving in score variable. We don't need a concrete value of dict, so we can just get that value which will be returned later.

return self.phrasegrams[tuple(components)][-1]
else:
return self.phrasegrams[tuple(components)]
except KeyError:
return -1

Expand Down
Binary file added gensim/test/test_data/phraser_model_3dot6
Binary file not shown.
11 changes: 10 additions & 1 deletion gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
import unittest

import six

import numpy as np

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the blank please. We separate third party modules from internal modules by a blank line (another visual block).

The blank line between six and numpy shouldn't be there though :)

from gensim.utils import to_unicode
Expand Down Expand Up @@ -646,6 +645,16 @@ def testEncoding(self):
self.assertTrue(isinstance(transformed, six.text_type))


class TestPhraserModelCompatibilty(unittest.TestCase):

def testCompatibilty(self):
bigram_loaded = Phraser.load(datapath("phraser_model_3dot6"))
test_sentences = [u'trees', u'graph', u'minors']
prev_ver = bigram_loaded[test_sentences]
expected_res = ['trees_graph', 'minors']
self.assertEqual(prev_ver, expected_res)


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
unittest.main()