-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Merged
menshikh-iv
merged 13 commits into
piskvorky:develop
from
persiyanov:feature/add-word-method-to-keyed-vectors
Mar 20, 2018
Merged
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Changes from 2 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
99bcf44
Introduce BaseKeyedVectors.add(...) method
06955c4
make default count=1
089d346
add test on add_word method
f428571
Merge branch 'develop' into feature/add-word-method-to-keyed-vectors
0aff584
address @menshikh-iv comments
f6e5e79
fix test_keyedvectors after removing add_word alias
d4b0ffe
add __setitem__, add bulk entities processing + some tests on new fun…
912d462
addressing @menshikh-iv comments on docstrings
3611320
Merge branch 'develop' into feature/add-word-method-to-keyed-vectors
437a142
addressing @gojomo comments
737cd36
adrressing nitpicks
070fbed
make self.vectors = np.zeros((0, vector_size)) by default
2294c07
fix pep8
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,6 +154,19 @@ def get_vector(self, entity): | |
else: | ||
raise KeyError("'%s' not in vocabulary" % entity) | ||
|
||
def add(self, entity, weights): | ||
"""Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use numpy-style docstrings |
||
If `entity` is already in vocabulary, the call of method has no effect. | ||
""" | ||
entity_id = len(self.vocab) | ||
if entity in self.vocab: | ||
logger.warning("duplicate entity '%s' in vocab, keeping old vector", entity) | ||
return | ||
|
||
self.vocab[entity] = Vocab(index=entity_id, count=1) | ||
self.vectors = vstack((self.vectors, weights)) | ||
self.index2entity.append(entity) | ||
|
||
def __getitem__(self, entities): | ||
""" | ||
Accept a single entity (string tag) or list of entities as input. | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, need to add tests to check this functionality:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv
About reusing this function:
It's a bit difficult because in this function
vstack
is used to append new word vector toself.vectors
, whileadd_word
in utils_any2vec createsvectors
array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).While it's possible to follow DRY here, the interface of
BaseKeyedVectors.add()
method will be more complicated (or I can change the logic inutils_any2vec
-- not to createvectors = np.zeros(...)
but append each word to the array, but it could decrease the performance ofload_word2vec_format
function).If some of these two options is okay, I'll implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, thanks for suggestion, let's stay it as is.