Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Sklearn wrapper for RandomProjections Model #1395

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0c5bcb0
created new file for rpmodel_sklearn_wrapper
chinmayapancholi13 Jun 6, 2017
0810428
updated get_params, set_params functions
chinmayapancholi13 Jun 6, 2017
d67f047
correction in calling init function
chinmayapancholi13 Jun 7, 2017
a9ce401
added fit, transform, partial_fit function
chinmayapancholi13 Jun 7, 2017
05ad743
added tests for Rp model's sklearn wrapper
chinmayapancholi13 Jun 7, 2017
f1b9c4a
minor correction in docstring in LDA and LSI models
chinmayapancholi13 Jun 7, 2017
8696e54
added newline before class definition (PEP8)
chinmayapancholi13 Jun 8, 2017
fe2f947
removed 'corpus' from 'init' and set 'corpus' in 'fit'
chinmayapancholi13 Jun 8, 2017
7317173
updated docstring for 'fit' function
chinmayapancholi13 Jun 8, 2017
692be88
refactored code to use 'self.model'
chinmayapancholi13 Jun 13, 2017
a2ec746
code style changes
chinmayapancholi13 Jun 13, 2017
954715e
refactored wrapper and tests
chinmayapancholi13 Jun 14, 2017
6c3b819
removed 'self.corpus' attribute and refactored slightly
chinmayapancholi13 Jun 14, 2017
aee04ff
updated 'self.__model' to 'self.gensim_model'
chinmayapancholi13 Jun 15, 2017
a73dacc
updated test data
chinmayapancholi13 Jun 15, 2017
da602d9
updated 'fit' and 'transform' methods
chinmayapancholi13 Jun 15, 2017
c1087ac
updated 'testTransform' test
chinmayapancholi13 Jun 15, 2017
00f5336
PEP8 change
chinmayapancholi13 Jun 15, 2017
376959d
updated 'testTransform' test
chinmayapancholi13 Jun 15, 2017
9c888d6
added 'NotFittedError' in 'transform' function
chinmayapancholi13 Jun 16, 2017
373c36c
added 'testPersistence' and 'testModelNotFitted' tests
chinmayapancholi13 Jun 16, 2017
f3c3601
added input 'docs' description in 'transform' function
chinmayapancholi13 Jun 16, 2017
ab90b68
added 'testPipeline' test
chinmayapancholi13 Jun 16, 2017
928c7f2
replaced 'text_lda' variable with 'text_rp'
chinmayapancholi13 Jun 18, 2017
cf13c9a
updated 'testPersistence' test
chinmayapancholi13 Jun 19, 2017
cde12f2
set fixed seed in 'testPipeline' test
chinmayapancholi13 Jun 19, 2017
26cd2df
Merge branch 'develop' into rp_wrapper_scikitlearn
menshikh-iv Jun 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def set_params(self, **parameters):
def fit(self, X, y=None):
"""
For fitting corpus into the class object.
Calls gensim.model.LdaModel:
Calls gensim.models.LdaModel:
>>> gensim.models.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word, passes=passes, update_every=update_every, alpha=alpha, iterations=iterations, eta=eta, random_state=random_state)
"""
if sparse.issparse(X):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def set_params(self, **parameters):
def fit(self, X, y=None):
"""
For fitting corpus into the class object.
Calls gensim.model.LsiModel:
Calls gensim.models.LsiModel:
>>>gensim.models.LsiModel(corpus=corpus, num_topics=num_topics, id2word=id2word, chunksize=chunksize, decay=decay, onepass=onepass, power_iters=power_iters, extra_samples=extra_samples)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation line doesn't seem to help -- what are these undefined variables like id2word, chunksize etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These params (id2word, chunksize etc) are associated with the LSI model. This change is in the file sklearn_wrapper_gensim_lsimodel.py. Since this change was so small (literally one word in a docstring), I added this change in this PR (PR concerning RP model wrapper) itself.
There is also a similar change for LDA model here. Should I remove these changes from this PR?

"""
if sparse.issparse(X):
Expand Down
57 changes: 57 additions & 0 deletions gensim/sklearn_integration/sklearn_wrapper_gensim_rpmodel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2011 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
#
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: remove #, insert blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated now.

"""
Scikit learn interface for gensim for easy use of gensim with scikit-learn
Follows scikit-learn API conventions
"""
from gensim import models
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line before imports.

Also, block the imports: built-in first, 3rd party second, local package imports last.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated now.

from gensim.sklearn_integration import base_sklearn_wrapper
from sklearn.base import TransformerMixin, BaseEstimator


class SklearnWrapperRpModel(models.RpModel, base_sklearn_wrapper.BaseSklearnWrapper, TransformerMixin, BaseEstimator):
"""
Base RP module
"""

def __init__(self, corpus, id2word=None, num_topics=300):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove corpus argument, you should pass corpus only to fit method

"""
Sklearn wrapper for RP model. Class derived from gensim.models.RpModel.
"""
self.corpus = corpus
self.id2word = id2word
self.num_topics = num_topics

def get_params(self, deep=True):
"""
Returns all parameters as dictionary.
"""
return {"corpus": self.corpus, "id2word": self.id2word, "num_topics": self.num_topics}

def set_params(self, **parameters):
"""
Set all parameters.
"""
super(SklearnWrapperRpModel, self).set_params(**parameters)

def fit(self, X, y=None):
"""
For fitting corpus into class object.
Calls gensim.models.RpModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace doc-string to Fit the model according to the given training data.

>>>gensim.models.RpModel(corpus=self.corpus, id2word=self.id2word, num_topics=self.num_topics)
"""
super(SklearnWrapperRpModel, self).__init__(corpus=self.corpus, id2word=self.id2word, num_topics=self.num_topics)

def transform(self, doc):
"""
Take document/corpus as input.
Return RP representation of the input document/corpus.
"""
return self[doc]

def partial_fit(self, X):
raise NotImplementedError("'partial_fit' has not been implemented for the RandomProjections model")
23 changes: 23 additions & 0 deletions gensim/test/test_sklearn_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklearnWrapperLdaModel
from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import SklearnWrapperLsiModel
from gensim.sklearn_integration.sklearn_wrapper_gensim_rpmodel import SklearnWrapperRpModel
from gensim.corpora import Dictionary
from gensim import matutils

Expand Down Expand Up @@ -192,5 +193,27 @@ def testSetGetParams(self):
self.assertEqual(model_params[key], param_dict[key])


class TestSklearnRpModelWrapper(unittest.TestCase):
def setUp(self):
numpy.random.seed(13)
self.model = SklearnWrapperRpModel(corpus, num_topics=2)
self.model.fit(corpus)

def testTransform(self):
# transform one document
doc = list(self.model.corpus)[0]
transformed_doc = self.model.transform(doc)
vec = matutils.sparse2full(transformed_doc, 2) # convert to dense vector, for easier equality tests

expected_vec = numpy.array([-0.70710677, 0.70710677])
self.assertTrue(numpy.allclose(vec, expected_vec)) # transformed entries must be equal up to sign

def testSetGetParams(self):
# updating only one param
self.model.set_params(num_topics=3)
model_params = self.model.get_params()
self.assertEqual(model_params["num_topics"], 3)


if __name__ == '__main__':
unittest.main()