Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring function in Phrases model is hardcoded #1635

Closed
hristo-vrigazov opened this issue Oct 18, 2017 · 6 comments
Closed

Scoring function in Phrases model is hardcoded #1635

hristo-vrigazov opened this issue Oct 18, 2017 · 6 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@hristo-vrigazov
Copy link

The Phrases model is based on word counting and bigram counting and it can process sentences by a given scoring function, which can be supplied via the construtor of Phrases (the parameter scoring). However, the field for scoring function is used only in the export_phrases method. Have a look here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L269
and here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L284

count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
score = scoring_function(count_a, count_b, count_ab)

However, in the __getitem__ method, the scoring uses the default scoring always https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L334 :

pa = float(vocab[word_a])
pb = float(vocab[word_b])
pab = float(vocab[bigram_word])
score = (pab - min_count) / pa / pb * len(vocab)

This looks like a bug to me (we are always using the default scoring, even if we explicitly stated npmi in the constructor). Is it okay if I open a pull request fixing this one?

@gojomo
Copy link
Collaborator

gojomo commented Oct 19, 2017

@michaelwsherman thoughts?

@menshikh-iv
Copy link
Contributor

This problem fixed in #1573 from @michaelwsherman
Before merge #1573 I think we should merge #1568 -> resolve conflicts in #1573, wdyt @gojomo?

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 19, 2017
@gojomo
Copy link
Collaborator

gojomo commented Oct 19, 2017

I have no opinion on which should merge first.

@michaelwsherman
Copy link
Contributor

I'm of the opinion that #1573 should be merged first :), but I'll wait for #1568 -- just be aware that it could easily be a few weeks after #1568 until I merge the code--looks like it will be a meaty merge. If it's important that the merge happen quickly then maybe #1573 should merge first since #1568 looks more active right now. But happy to make whatever work if y'all don't mind the wait.

@menshikh-iv
Copy link
Contributor

Ok, let's merge #1573 first, thanks for clarification @michaelwsherman.

menshikh-iv pushed a commit that referenced this issue Oct 24, 2017
* initial commit of fixes in comments of #1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring #1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
@menshikh-iv
Copy link
Contributor

Resolved in #1573

horpto pushed a commit to horpto/gensim that referenced this issue Oct 28, 2017
…iskvorky#1573)

* initial commit of fixes in comments of piskvorky#1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

4 participants