Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for training word2vec models on a corpus consisting of multiple files #1364

Closed
michaelwsherman opened this issue May 24, 2017 · 6 comments
Labels
wishlist Feature request

Comments

@michaelwsherman
Copy link
Contributor

models.word2vec.LineSentence offers a simple generator that takes either a file object or a path to a single object and reads each line as a sentence where the tokens are separated by whitespaces. This assumes that the corpus to be trained on is all in a single file, which may not be the case. And adding additional files to a word2vec model after its been trained is very complex, requiring multiple calls and manual management of learning rates.

Rather than force the user to write their own generator, to concatenate their files, or to manually manage learning rates across multiple files, it is desirable if word2vec could take a directory reference or a list of files directly, through LineSentence or a similar generator.

I have code for a generator that takes all files in a directory and makes a generator suitable for training a word2vec model (or any other model compatible with LineSentence). If desired, I can create a pull request with this generator. I can also (attempt to) change LineSentence to detect if it has been passed a file-like object, a file path, or a directory path (and also possibly a list of files) and then generate lists of tokens appropriately, and then submit the pull request.

Please let me know if this is desired. Thank you.

@menshikh-iv
Copy link
Contributor

Hi @michaelwsherman, I think this will be useful. Please, submit your PR

@menshikh-iv menshikh-iv added the wishlist Feature request label May 25, 2017
@michaelwsherman
Copy link
Contributor Author

Ok, give me a week or two, I've never submitted anything to open source before and I want to make sure I don't create extra work for others by messing something up. But it will come.

Right now its a separate method that just assumes all files in a directory are text files. Should I add some improvements (gzip support? merging it into the LineSentence method and adding input detection?) or just start with what I have?

@menshikh-iv
Copy link
Contributor

It's better to start with what you already have, we discuss about another features after your PR

@gojomo
Copy link
Collaborator

gojomo commented May 25, 2017

When the work is in a single git branch, you can create the PR, get feedback, and continue to update the branch/PR with improvements – so always OK to start with something rough. (You won't mess anything up unless/until it's reviewed and accepted, and you can mark it as 'in progress'/'for review' to indicate it's not yet ready.)

You should use the smart_open package for file-opening - it will automatically detect gzip name-extensions and do the right thing.

But it's ok to start with whatever you've got, and see what suggestions for extension/refactoring/etc come up once that's concretely viewable via github!

michaelwsherman pushed a commit to bloomberg/gensim that referenced this issue Jun 16, 2017
added method models.word2vec.LineSentencePath

method to read an entire directory's files in the same style as
models.word2vec.LineSentence
michaelwsherman pushed a commit to bloomberg/gensim that referenced this issue Jun 16, 2017
initial attempt at test, including files. test just splits the
lee_background.cor file into two parts and puts them in a directory,
then makes sure they match the unsplit file as loaded by
word2vec.LineSentence
@michaelwsherman
Copy link
Contributor Author

Pull request at #1423 . Would love to hear any feedback--first time OSS contributor.

menshikh-iv pushed a commit that referenced this issue Jul 18, 2017
…1364) (#1423)

* issue #1364 first commit, corpus from a directory

added method models.word2vec.LineSentencePath

method to read an entire directory's files in the same style as
models.word2vec.LineSentence

* test for word2vec.LineSentencePath issue #1364

initial attempt at test, including files. test just splits the
lee_background.cor file into two parts and puts them in a directory,
then makes sure they match the unsplit file as loaded by
word2vec.LineSentence

* better handling of input for LineSentencePath

no longer sensitive to an input without a trailing os-specific slash

* LineSentencePath renamed PathLineSentences

in word2vec.py . Test updated as well

* LineSentencePath rename to PathLineSentences

in models.word2vec . Tests also updated

* fix whitespace style error

had only 1 space before an inline comment, flagged by travis CI build

* updated PathLineSentences test and test data

Removed LineSentencePath directory, created PathLineSentences
lee corpus duplicates were in LineSentencePath, was wasting space
made new small corpus to test PathLineSentences, put in directory
changed test to read both files manually, combine, and compare to
PathLineSentences (rather than having a separate single file to match
the entire contents of the PathLineSentences test_data directory

* word2vec.PathLineSentences single file support

changed PathLineSentences to support a single file in addition to a
directory, raises a warning to use LineSentence when a single file is
given as a parameter. added corresponding test.

* fixing style issues

* fix style issue
@menshikh-iv
Copy link
Contributor

Resolved in #1423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist Feature request
Projects
None yet
Development

No branches or pull requests

3 participants