Better support for training word2vec models on a corpus consisting of multiple files #1364

michaelwsherman · 2017-05-24T23:07:23Z

models.word2vec.LineSentence offers a simple generator that takes either a file object or a path to a single object and reads each line as a sentence where the tokens are separated by whitespaces. This assumes that the corpus to be trained on is all in a single file, which may not be the case. And adding additional files to a word2vec model after its been trained is very complex, requiring multiple calls and manual management of learning rates.

Rather than force the user to write their own generator, to concatenate their files, or to manually manage learning rates across multiple files, it is desirable if word2vec could take a directory reference or a list of files directly, through LineSentence or a similar generator.

I have code for a generator that takes all files in a directory and makes a generator suitable for training a word2vec model (or any other model compatible with LineSentence). If desired, I can create a pull request with this generator. I can also (attempt to) change LineSentence to detect if it has been passed a file-like object, a file path, or a directory path (and also possibly a list of files) and then generate lists of tokens appropriately, and then submit the pull request.

Please let me know if this is desired. Thank you.

menshikh-iv · 2017-05-25T06:40:01Z

Hi @michaelwsherman, I think this will be useful. Please, submit your PR

michaelwsherman · 2017-05-25T14:28:09Z

Ok, give me a week or two, I've never submitted anything to open source before and I want to make sure I don't create extra work for others by messing something up. But it will come.

Right now its a separate method that just assumes all files in a directory are text files. Should I add some improvements (gzip support? merging it into the LineSentence method and adding input detection?) or just start with what I have?

menshikh-iv · 2017-05-25T14:48:47Z

It's better to start with what you already have, we discuss about another features after your PR

gojomo · 2017-05-25T21:19:35Z

When the work is in a single git branch, you can create the PR, get feedback, and continue to update the branch/PR with improvements – so always OK to start with something rough. (You won't mess anything up unless/until it's reviewed and accepted, and you can mark it as 'in progress'/'for review' to indicate it's not yet ready.)

You should use the smart_open package for file-opening - it will automatically detect gzip name-extensions and do the right thing.

But it's ok to start with whatever you've got, and see what suggestions for extension/refactoring/etc come up once that's concretely viewable via github!

added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence

initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence

michaelwsherman · 2017-06-16T22:24:24Z

Pull request at #1423 . Would love to hear any feedback--first time OSS contributor.

…1364) (#1423) * issue #1364 first commit, corpus from a directory added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence * test for word2vec.LineSentencePath issue #1364 initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence * better handling of input for LineSentencePath no longer sensitive to an input without a trailing os-specific slash * LineSentencePath renamed PathLineSentences in word2vec.py . Test updated as well * LineSentencePath rename to PathLineSentences in models.word2vec . Tests also updated * fix whitespace style error had only 1 space before an inline comment, flagged by travis CI build * updated PathLineSentences test and test data Removed LineSentencePath directory, created PathLineSentences lee corpus duplicates were in LineSentencePath, was wasting space made new small corpus to test PathLineSentences, put in directory changed test to read both files manually, combine, and compare to PathLineSentences (rather than having a separate single file to match the entire contents of the PathLineSentences test_data directory * word2vec.PathLineSentences single file support changed PathLineSentences to support a single file in addition to a directory, raises a warning to use LineSentence when a single file is given as a parameter. added corresponding test. * fixing style issues * fix style issue

menshikh-iv · 2017-07-18T09:15:40Z

Resolved in #1423

menshikh-iv added the wishlist Feature request label May 25, 2017

menshikh-iv closed this as completed Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for training word2vec models on a corpus consisting of multiple files #1364

Better support for training word2vec models on a corpus consisting of multiple files #1364

michaelwsherman commented May 24, 2017

menshikh-iv commented May 25, 2017

michaelwsherman commented May 25, 2017

menshikh-iv commented May 25, 2017

gojomo commented May 25, 2017

michaelwsherman commented Jun 16, 2017

menshikh-iv commented Jul 18, 2017

Better support for training word2vec models on a corpus consisting of multiple files #1364

Better support for training word2vec models on a corpus consisting of multiple files #1364

Comments

michaelwsherman commented May 24, 2017

menshikh-iv commented May 25, 2017

michaelwsherman commented May 25, 2017

menshikh-iv commented May 25, 2017

gojomo commented May 25, 2017

michaelwsherman commented Jun 16, 2017

menshikh-iv commented Jul 18, 2017