Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942 #1957

13 changes: 13 additions & 0 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,19 @@ def get_vector(self, entity):
else:
raise KeyError("'%s' not in vocabulary" % entity)

def add(self, entity, weights):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, need to add tests to check this functionality:

  • load some kv, add more vectors in a manual way and check that this added fine
  • create empty kv, fill it manually and check that all fine

Copy link
Contributor Author

@persiyanov persiyanov Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv

About reusing this function:

It's a bit difficult because in this function vstack is used to append new word vector to self.vectors, while add_word in utils_any2vec creates vectors array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).

While it's possible to follow DRY here, the interface of BaseKeyedVectors.add() method will be more complicated (or I can change the logic in utils_any2vec -- not to create vectors = np.zeros(...) but append each word to the array, but it could decrease the performance of load_word2vec_format function).

If some of these two options is okay, I'll implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thanks for suggestion, let's stay it as is.

"""Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings

If `entity` is already in vocabulary, the call of method has no effect.
"""
entity_id = len(self.vocab)
if entity in self.vocab:
logger.warning("duplicate entity '%s' in vocab, keeping old vector", entity)
return

self.vocab[entity] = Vocab(index=entity_id, count=1)
self.vectors = vstack((self.vectors, weights))
self.index2entity.append(entity)

def __getitem__(self, entities):
"""
Accept a single entity (string tag) or list of entities as input.
Expand Down