[WIP] Updating Gensim's Word2vec-Keras integration #7

chinmayapancholi13 · 2017-06-08T03:20:26Z

This PR adds support for integration of Keras with Gensim's word2vec model using the get_embedding_layer function added to Gensim in PR#1248.

just to update

…tTextCategorization

chinmayapancholi13 · 2017-06-08T03:23:15Z

Example usage of the code added in this PR :

import shorttext

from gensim.models import word2vec as w2v

input_text_file = "word_vectors_training_data.txt"
input_data = w2v.LineSentence(input_text_file)
w2v_model = w2v.Word2Vec(input_data, size=300)
w2v = w2v_model.wv

trainclassdict = shorttext.data.subjectkeywords()
kmodel = shorttext.classifiers.frameworks.CNNWord2Vec(len(trainclassdict.keys()), w2v)
classifier = shorttext.classifiers.VarNNWord2VecClassifier(w2v)
classifier.train(trainclassdict, kmodel)

print(classifier.score('artificial intelligence'))

Output:
{'mathematics': 0.66653651, 'physics': 0.14110285, 'theology': 0.19236061}

chinmayapancholi13 · 2017-06-19T09:08:23Z

@stephenhky I have refactored the code in the latest commit by removing some redundant files and classes. I have also added a boolean variable with_gensim (with default value True) which indicates whether the source of the word-embeddings being used in the Keras model is Gensim's Word2Vec model or not. This boolean variable is used in the classes CNNWordEmbed and VarNNEmbeddedVecClassifier.

stephenhky · 2017-06-08T19:44:22Z

shorttext/classifiers/embed/nnlib/VarNNWord2VecClassification.py

+from keras.preprocessing.sequence import pad_sequences
+
+@cio.compactio({'classifier': 'nnlibvec'}, 'nnlibvec', ['_classlabels.txt', '.json', '.h5'])
+class VarNNWord2VecClassifier:


I know it was a lousy name for calling another class VarNNEmbedVecClassification. I know you named this new class after the original class, but maybe you want to refactor it to another name. On the other hand, I know gensim supported various embedded vectors, and thus I want the name not to be restricted to word2vec.

Do you want to come up with a new name?

stephenhky · 2017-06-08T19:45:27Z

shorttext/classifiers/embed/nnlib/VarNNWord2VecClassification.py

+        :param wvmodel: Word2Vec model
+        :param vecsize: length of the embedded vectors in the model (Default: 300)
+        :param maxlen: maximum number of words in a sentence (Default: 15)
+        :type wvmodel: gensim.models.word2vec.Word2Vec


the type needs to be changed to gensim.models.keyedvectors.KeyedVectors.

stephenhky · 2017-06-08T19:45:58Z

shorttext/classifiers/embed/nnlib/VarNNWord2VecClassification.py

+        :param classdict: training data
+        :return: a tuple of three, containing a list of class labels, matrix of embedded word vectors, and corresponding outputs
+        :type classdict: dict
+        :rtype: (list, numpy.ndarray, list)


the second element of the returned type needs to be changed

stephenhky · 2017-06-08T20:01:28Z

shorttext/classifiers/embed/nnlib/VarNNWord2VecClassification.py

+        # convert classdict to training input vectors
+        self.classlabels, x_train, y_train = self.convert_trainingdata_matrix(classdict)
+        tokenizer = Tokenizer()
+        tokenizer.fit_on_texts(x_train)


is there a particular reason choosing the tokenizer provided by keras?

stephenhky · 2017-06-20T03:31:01Z

Thanks for making the codes reusable and backward compatible.

Right now, with this implementation, a trained model is saved without specifying whether use_gensim is True or not. Similarly, when a saved model is being loaded, this information is lacking. But it is important. So somehow there has to be a way to specify.

stephenhky · 2017-06-27T20:46:24Z

Let me authorize this pull request first. But remember the I/O needs to be done.

chinmayapancholi13 · 2017-06-27T21:03:25Z

@stephenhky Sure. I'll update the code to include the modifications to the save, load functions as well.

stephenhky · 2017-06-27T21:28:36Z

Thanks. Also do the same thing for the DoubleCNNWordEmbed and CLSTMWordEmbed as well.

chinmayapancholi13 · 2017-06-27T21:32:41Z

Sure. :)

stephenhky · 2017-07-06T12:53:47Z

What I mean is that, whatever changes you have done to CNNWordEmbed, do the similar things to DoubleCNNWordEmbed and CLSTMWordEmbed.

chinmayapancholi13 added 3 commits May 18, 2017 13:13

Merge pull request #1 from stephenhky/master

c0a924d

just to update

Merge branch 'master' of https://github.com/chinmayapancholi13/PyShor…

a2a0253

…tTextCategorization

added support for integration with Gensim's word2vec

4efd189

chinmayapancholi13 changed the title ~~[WIP] Word2vec keras integration~~ [WIP] Updating Gensim's Word2vec-Keras integration Jun 8, 2017

refactored code and removed redundant classes

6400793

stephenhky reviewed Jun 20, 2017

View reviewed changes

stephenhky merged commit 311d41e into stephenhky:master Jun 27, 2017

chinmayapancholi13 mentioned this pull request Jul 5, 2017

[WIP] Incorporating feedback for PR#7 #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Updating Gensim's Word2vec-Keras integration #7

[WIP] Updating Gensim's Word2vec-Keras integration #7

chinmayapancholi13 commented Jun 8, 2017

chinmayapancholi13 commented Jun 8, 2017

chinmayapancholi13 commented Jun 19, 2017

stephenhky Jun 8, 2017

stephenhky Jun 8, 2017

stephenhky Jun 8, 2017

stephenhky Jun 8, 2017

stephenhky commented Jun 20, 2017

stephenhky commented Jun 27, 2017

chinmayapancholi13 commented Jun 27, 2017

stephenhky commented Jun 27, 2017

chinmayapancholi13 commented Jun 27, 2017

stephenhky commented Jul 6, 2017

[WIP] Updating Gensim's Word2vec-Keras integration #7

[WIP] Updating Gensim's Word2vec-Keras integration #7

Conversation

chinmayapancholi13 commented Jun 8, 2017

chinmayapancholi13 commented Jun 8, 2017

chinmayapancholi13 commented Jun 19, 2017

stephenhky Jun 8, 2017

Choose a reason for hiding this comment

stephenhky Jun 8, 2017

Choose a reason for hiding this comment

stephenhky Jun 8, 2017

Choose a reason for hiding this comment

stephenhky Jun 8, 2017

Choose a reason for hiding this comment

stephenhky commented Jun 20, 2017

stephenhky commented Jun 27, 2017

chinmayapancholi13 commented Jun 27, 2017

stephenhky commented Jun 27, 2017

chinmayapancholi13 commented Jun 27, 2017

stephenhky commented Jul 6, 2017