-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1533 fix and 1464 1423 comments #1573
1533 fix and 1464 1423 comments #1573
Conversation
One unfortunate omission here is a lack of a test of pickling a Phrases and a Phraser object, to make sure pluggable scoring works properly after unpickling. I don't know enough about pickling to know what the risks here are, or if it is even worth testing this. I'm worried that if a Phraser (or Phrases) object from an older version of gensim is loaded, it won't work now since older Phraser objects won't have an assigned scoring function and |
@michaelwsherman thanks, much appreciated! The best place to handle backward compatibility is inside You can see that approach in action in word2vec here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1408 |
I'm failing a test in The fix for this likely involves some custom pickling. My (brief) research has led me to the I can maybe embark on these fixes, but I'd like to know I'm on the right track before I go down this rabbit hole. It's also likely going to be a bit (a few weeks) for this fix, as I'll probably need a couple of days to figure out how this is done (mainly for the pickling part).
|
@michaelwsherman that doesn't sound right. Functions are picklable no problem, they just need to be named functions (e.g. The traceback suggests a named function, so I suspect the problem lies elsewhere. |
@piskvorky Sorry, you're right, and thank you for your quick response. This is a good learning experience for me. I think it is because the methods are static methods in the I'm going to make this change, add support for the custom scoring to the scikit interface, and add tests for using a custom scorer (including persistence) to the scikit interface, add the save/load support for backwards compatibility, and throw together some tests for that as well. These changes may take a little bit. I'm not going to add support for backwards compatibility through the scikit interface, as I expect that persistence via pickle across versions is not supported. Tell me if this is incorrect, and I'll start looking into the possibility of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions + code style comments.
gensim/models/phrases.py
Outdated
@@ -177,9 +177,9 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0, | |||
# to still run the check of scoring function parameters in the next code block | |||
if type(scoring) is str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(scoring, basestring)
more standard and future-proof.
from math import log | ||
from inspect import getargspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used in line 188 (in the commit your comments are on) to check for the proper parameters in the pluggable scoring function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see it now. What is that check for though? Python is duck-typed by convention, so "type checks" are best postponed until truly needed (something breaks).
What is the rationale for this pre-emptive type check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly to save the stress that would result from improperly specifying a scoring function when initializing the phrases object. I know Python will do the type checking when the scoring function is called, but that won't happen until export_phrases or getitem is called. The "normal" workflow for the Phrases object is to just specify sentences on load, or to use add_vocab. Only after that does the scoring function get called.
I could easily see a user specifying a bad scoring method and then making the vocab dictionary from their large corpus. Only after significant time extracting vocab from a corpus do they then discover that something is wrong with how they specified scoring. At this point you could manually specify a correct scoring function, but that requires you to set it directly. Users also wouldn't have an easy bailout in the form of use one of the scorer string settings, since those are only checked when the Phrases object is created--the user would have to figure out how to specify those built in scorers which would mean opening up the code. This seems a bit user unfriendly, I feel it is friendlier to just do the type checking on initialization even if it is less Pythonic.
This could be fixed with a set_scorer method that takes the string or function input, but that seems a bit more awkward than just doing this type check.
There's also an issue with wanting to raise an informative exception when the scoring function is called in getitem or export_phrases and the types don't match, but that means adding a try/except into the main scoring loop and that seems awkward as well. I think its better to just do that try/except once when the object is initialized.
But I defer to your judgement on this--what do you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see your argument (that checking early a little more convenient).
I'm not sure if it's worth it, but don't care much either way. I'll defer to @menshikh-iv :)
gensim/models/phrases.py
Outdated
# len_vocab and min_count set so functools.partial works | ||
@staticmethod | ||
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0): | ||
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count): | ||
return (bigram_count - min_count) / worda_count / wordb_count * len_vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad indent.
gensim/models/phrases.py
Outdated
def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0): | ||
|
||
# normalized PMI, requires corpus size | ||
def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count): | ||
pa = worda_count / corpus_word_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about these, very sloppy on my part.
gensim/models/word2vec.py
Outdated
self.input_files = [self.source] # force code compatibility with list of files | ||
elif os.path.isdir(self.source): | ||
self.source = os.path.join(self.source, '') # ensures os-specific slash at end of path | ||
logging.debug('reading directory ' + self.source) | ||
logger.warning('reading directory %s', self.source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake, changing to info.
gensim/models/word2vec.py
Outdated
@@ -1563,7 +1563,7 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None): | |||
""" | |||
`source` should be a path to a directory (as a string) where all files can be opened by the | |||
LineSentence class. Each file will be read up to | |||
`limit` lines (or no clipped if limit is None, the default). | |||
`limit` lines (or not clipped if limit is None, the default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are not clear -- does the "will process all files in a directory" work recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not. Maybe wishlist? I've clarified the docs.
gensim/models/word2vec.py
Outdated
@@ -1577,23 +1577,23 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None): | |||
self.limit = limit | |||
|
|||
if os.path.isfile(self.source): | |||
logging.warning('single file read, better to use models.word2vec.LineSentence') | |||
logger.warning('single file read, better to use models.word2vec.LineSentence') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the class API contract supports it, this is no warning (maybe debug, at most).
If it's outside the API contract, this is an error and we should raise an exception, not log a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified this message a bit, made it debug.
gensim/models/phrases.py
Outdated
@@ -177,6 +176,13 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0, | |||
# set scoring based on string | |||
# intentially override the value of the scoring parameter rather than set self.scoring here, | |||
# to still run the check of scoring function parameters in the next code block | |||
|
|||
# for python 2 and 3 compatibility. basestring is used to check if scoring is a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there :) We use six
in gensim for py2/py3 compatibility, so isinstance
on six.string_types
is probably what we want.
This should be ready to go now, the requested changes have been made (or discussed) and everything is up to date with develop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelwsherman sorry for waiting, looks good for me:+1:
Let's wait for merge #1568, after - need to resolve conflicts here and merge.
Nice work @michaelwsherman 🔥 |
…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring
Fix for #1533 , Phrases.getitem now support custom scoring. Pluggable scoring via a function parameter is Phrases is now supported.
Made fixes in comments in PRs #1464 and #1423, except did not explicitly cast floats in the pre-defined scoring methods. Rather, float casting is now done before calling the scoring method.