Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Updating Gensim's Word2vec-Keras integration #7

Merged

Conversation

chinmayapancholi13
Copy link
Collaborator

This PR adds support for integration of Keras with Gensim's word2vec model using the get_embedding_layer function added to Gensim in PR#1248.

@chinmayapancholi13
Copy link
Collaborator Author

Example usage of the code added in this PR :

import shorttext

from gensim.models import word2vec as w2v

input_text_file = "word_vectors_training_data.txt"
input_data = w2v.LineSentence(input_text_file)
w2v_model = w2v.Word2Vec(input_data, size=300)
w2v = w2v_model.wv

trainclassdict = shorttext.data.subjectkeywords()
kmodel = shorttext.classifiers.frameworks.CNNWord2Vec(len(trainclassdict.keys()), w2v)
classifier = shorttext.classifiers.VarNNWord2VecClassifier(w2v)
classifier.train(trainclassdict, kmodel)

print(classifier.score('artificial intelligence'))

Output:
{'mathematics': 0.66653651, 'physics': 0.14110285, 'theology': 0.19236061}

@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Word2vec keras integration [WIP] Updating Gensim's Word2vec-Keras integration Jun 8, 2017
@chinmayapancholi13
Copy link
Collaborator Author

@stephenhky I have refactored the code in the latest commit by removing some redundant files and classes. I have also added a boolean variable with_gensim (with default value True) which indicates whether the source of the word-embeddings being used in the Keras model is Gensim's Word2Vec model or not. This boolean variable is used in the classes CNNWordEmbed and VarNNEmbeddedVecClassifier.

from keras.preprocessing.sequence import pad_sequences

@cio.compactio({'classifier': 'nnlibvec'}, 'nnlibvec', ['_classlabels.txt', '.json', '.h5'])
class VarNNWord2VecClassifier:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was a lousy name for calling another class VarNNEmbedVecClassification. I know you named this new class after the original class, but maybe you want to refactor it to another name. On the other hand, I know gensim supported various embedded vectors, and thus I want the name not to be restricted to word2vec.

Do you want to come up with a new name?

:param wvmodel: Word2Vec model
:param vecsize: length of the embedded vectors in the model (Default: 300)
:param maxlen: maximum number of words in a sentence (Default: 15)
:type wvmodel: gensim.models.word2vec.Word2Vec
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type needs to be changed to gensim.models.keyedvectors.KeyedVectors.

:param classdict: training data
:return: a tuple of three, containing a list of class labels, matrix of embedded word vectors, and corresponding outputs
:type classdict: dict
:rtype: (list, numpy.ndarray, list)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second element of the returned type needs to be changed

# convert classdict to training input vectors
self.classlabels, x_train, y_train = self.convert_trainingdata_matrix(classdict)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason choosing the tokenizer provided by keras?

@stephenhky
Copy link
Owner

Thanks for making the codes reusable and backward compatible.

Right now, with this implementation, a trained model is saved without specifying whether use_gensim is True or not. Similarly, when a saved model is being loaded, this information is lacking. But it is important. So somehow there has to be a way to specify.

@stephenhky
Copy link
Owner

Let me authorize this pull request first. But remember the I/O needs to be done.

@stephenhky stephenhky merged commit 311d41e into stephenhky:master Jun 27, 2017
@chinmayapancholi13
Copy link
Collaborator Author

@stephenhky Sure. I'll update the code to include the modifications to the save, load functions as well.

@stephenhky
Copy link
Owner

Thanks. Also do the same thing for the DoubleCNNWordEmbed and CLSTMWordEmbed as well.

@chinmayapancholi13
Copy link
Collaborator Author

Sure. :)

@stephenhky
Copy link
Owner

What I mean is that, whatever changes you have done to CNNWordEmbed, do the similar things to DoubleCNNWordEmbed and CLSTMWordEmbed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants