Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] scikit_learn wrapper for LSI Model in Gensim #1244

Merged
merged 8 commits into from
Mar 28, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR adds a scikit-learn wrapper for LSI Model in Gensim.

self.model.fit(corpus)

def testPrintTopic(self):
topic = self.model.print_topics(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems redundant. Please justify including it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk My rationale for having this test was basically to serve as a sanity check after setUp() prepares the test fixture. Since such a test was present among the unit tests for the sklearn wrapper written for LDA model as well, I decided to include it here too. Please let me know if we should drop this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename as testModelSanity then

cache = pickle.loads(uncompressed_content)
data = cache
id2word=Dictionary(map(lambda x : x.split(), data.data))
corpus = [id2word.doc2bow(i.split()) for i in data.data]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation of a text into gensim bow format corpus should have a sklearn-wrapper around so that the pipeline can take in text, not bow. See more about Gensim corpus in this [tutorial(https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb) Please either add it in this PR or create a wishlist issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please respond to this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk My sincerest apologies for the delayed response. I'd be happy to submit a PR to resolve this issue. I did not add code for this in the current PR because some of the code for the wrapper for LdaModel might also have to be refactored. So maintaining modularity and proposing a solution in a different PR seemed to be a better choice in this case.

"""

def __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000,
decay=1.0, onepass=True, power_iters=P2_EXTRA_ITERS, extra_samples=P2_EXTRA_DIMS):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make defaults explicits. what is the reasons for using constant variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk Thanks for pointing this out. I'll make the default values explicit here.

@tmylk
Copy link
Contributor

tmylk commented Mar 28, 2017

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Sure. Adding LsiModel wrapper example to the IPython notebook.

@chinmayapancholi13
Copy link
Contributor Author

@tmylk For LsiModel, I have added an example which uses pipeline.Pipeline. I was initially planning to add examples like the isolated usage (without scikit_learn) of the wrapper and usage with only CountVectorizer. But such cases would be exactly congruent to the examples given for LdaModel and would lead to unnecessary repetition. Please let me know if we should add such examples in the notebook.

@tmylk tmylk merged commit a83e61b into piskvorky:develop Mar 28, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 28, 2017

Thanks for the new wrapper!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants