Skip to content

Commit

Permalink
LsiModel: Only log top words that actually exist in the dictionary (#…
Browse files Browse the repository at this point in the history
…3091)

* lsimodel: Only log top words that actually exist in <id2word>

In some pathological cases, we might try to log the top N words, even
though we haven't seen N words yet.  In these cases, we can just exit
the loop early.

Closes #3090.

* utils: Implement FakeDict::__contains__()

In 8f8cb49, I added a check that checks

    val in self.id2word

When testing, `id2word` is actually an instance of `FakeDict`, which
doesn't implement `__contains__()` (so Python falls back to calling
`__getitem__()`[1]).  The tests didn't like this[2].

[1] https://docs.python.org/3.6/reference/datamodel.html#object.__contains__
[2] https://github.com/RaRe-Technologies/gensim/runs/2197137529

* Update lsimodel.py

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <[email protected]>
  • Loading branch information
kmurphy4 and mpenkov authored Apr 9, 2021
1 parent a7d01fb commit 840df94
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
Changes
=======

## Unreleased

- LsiModel: Only log top words that actually exist in the dictionary (PR [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091), [@kmurphy4](https://github.com/kmurphy4))

## 4.0.1, 2021-04-01

Bugfix release to address issues with Wheels on Windows:
Expand Down
4 changes: 3 additions & 1 deletion gensim/models/lsimodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,9 @@ def show_topic(self, topicno, topn=10):
c = np.asarray(self.projection.u.T[topicno, :]).flatten()
norm = np.sqrt(np.sum(np.dot(c, c)))
most = matutils.argsort(np.abs(c), topn, reverse=True)
return [(self.id2word[val], 1.0 * c[val] / norm) for val in most]

# Output only (word, score) pairs for `val`s that are within `self.id2word`. See #3090 for details.
return [(self.id2word[val], 1.0 * c[val] / norm) for val in most if val in self.id2word]

def show_topics(self, num_topics=-1, num_words=10, log=False, formatted=True):
"""Get the most significant topics.
Expand Down
3 changes: 3 additions & 0 deletions gensim/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -835,6 +835,9 @@ def __getitem__(self, val):
return str(val)
raise ValueError("internal id out of bounds (%s, expected <0..%s))" % (val, self.num_terms))

def __contains__(self, val):
return 0 <= val < self.num_terms

def iteritems(self):
"""Iterate over all keys and values.
Expand Down

0 comments on commit 840df94

Please sign in to comment.