-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for training word2vec models on a corpus consisting of multiple files #1364
Comments
Hi @michaelwsherman, I think this will be useful. Please, submit your PR |
Ok, give me a week or two, I've never submitted anything to open source before and I want to make sure I don't create extra work for others by messing something up. But it will come. Right now its a separate method that just assumes all files in a directory are text files. Should I add some improvements (gzip support? merging it into the LineSentence method and adding input detection?) or just start with what I have? |
It's better to start with what you already have, we discuss about another features after your PR |
When the work is in a single git branch, you can create the PR, get feedback, and continue to update the branch/PR with improvements – so always OK to start with something rough. (You won't mess anything up unless/until it's reviewed and accepted, and you can mark it as 'in progress'/'for review' to indicate it's not yet ready.) You should use the But it's ok to start with whatever you've got, and see what suggestions for extension/refactoring/etc come up once that's concretely viewable via github! |
added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence
initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence
Pull request at #1423 . Would love to hear any feedback--first time OSS contributor. |
…1364) (#1423) * issue #1364 first commit, corpus from a directory added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence * test for word2vec.LineSentencePath issue #1364 initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence * better handling of input for LineSentencePath no longer sensitive to an input without a trailing os-specific slash * LineSentencePath renamed PathLineSentences in word2vec.py . Test updated as well * LineSentencePath rename to PathLineSentences in models.word2vec . Tests also updated * fix whitespace style error had only 1 space before an inline comment, flagged by travis CI build * updated PathLineSentences test and test data Removed LineSentencePath directory, created PathLineSentences lee corpus duplicates were in LineSentencePath, was wasting space made new small corpus to test PathLineSentences, put in directory changed test to read both files manually, combine, and compare to PathLineSentences (rather than having a separate single file to match the entire contents of the PathLineSentences test_data directory * word2vec.PathLineSentences single file support changed PathLineSentences to support a single file in addition to a directory, raises a warning to use LineSentence when a single file is given as a parameter. added corresponding test. * fixing style issues * fix style issue
Resolved in #1423 |
models.word2vec.LineSentence offers a simple generator that takes either a file object or a path to a single object and reads each line as a sentence where the tokens are separated by whitespaces. This assumes that the corpus to be trained on is all in a single file, which may not be the case. And adding additional files to a word2vec model after its been trained is very complex, requiring multiple calls and manual management of learning rates.
Rather than force the user to write their own generator, to concatenate their files, or to manually manage learning rates across multiple files, it is desirable if word2vec could take a directory reference or a list of files directly, through LineSentence or a similar generator.
I have code for a generator that takes all files in a directory and makes a generator suitable for training a word2vec model (or any other model compatible with LineSentence). If desired, I can create a pull request with this generator. I can also (attempt to) change LineSentence to detect if it has been passed a file-like object, a file path, or a directory path (and also possibly a list of files) and then generate lists of tokens appropriately, and then submit the pull request.
Please let me know if this is desired. Thank you.
The text was updated successfully, but these errors were encountered: