-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Seq2seq example #63
Merged
Changes from 46 commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
81413a1
minor fix
keisukefukuda 7cadc00
Merged https://github.com/pfnet/chainer/pull/2555
keisukefukuda 553e121
Cache pre-processed input data
keisukefukuda 6025ec0
Changed report trigger to 'epoch'
keisukefukuda 74fc0ca
merge master
keisukefukuda 11af9b8
recovered seq2seq example after merging master
keisukefukuda 9dc41f6
Merge branch 'master' into seq2seq
keisukefukuda b935372
Merge branch 'master' into seq2seq
keisukefukuda 3f115ce
passes flake8 and autopep8 check
keisukefukuda 9cfac7e
Merge branch 'master' into seq2seq
keisukefukuda 5e90bcc
Merge branch 'master' into seq2seq
keisukefukuda 5a298d3
Merge branch 'master' into seq2seq
keisukefukuda 147f35c
fix to work with Chainer2
keisukefukuda 0759bc1
added BleuEvaluator
keisukefukuda 648d124
fix flake8
keisukefukuda 2e4c4d8
added updated seq2seq.py and europal.py from the latest chainer repos…
keisukefukuda dc38939
reflected the latest commits to Chainer's main repository
keisukefukuda 7dc4e81
Merge branch 'master' into seq2seq
keisukefukuda a37114a
edit code so make it easier to compare with the original seq2seq.py
keisukefukuda 6050513
Renamed variables to be similar to the original seq2seq example
keisukefukuda 0006530
Removed get_epoch_trigger
keisukefukuda c8080b0
fix flake8
keisukefukuda 0d179f7
multi-node evaluator
keisukefukuda d8b4942
fix flake8/pep8
keisukefukuda 72f513e
minor fix
keisukefukuda badeaa4
Merge branch 'master' into seq2seq
keisukefukuda 3f6749e
WIP
keisukefukuda bb155d9
Merge branch 'master' into seq2seq
keisukefukuda 5d72622
some minor fixes
keisukefukuda 278d43d
minor fix
keisukefukuda 2fc93d6
added BLEU evaluator & minor options
keisukefukuda db773b0
Merge branch 'seq2seq' of github.com:pfnet/chainermn into seq2seq
keisukefukuda df9c099
removed 'train=False' argument
keisukefukuda b9ca386
minor fix
keisukefukuda 61fe839
Merge branch 'master' into seq2seq
keisukefukuda 876637c
Merge branch 'master' into seq2seq
keisukefukuda 2550e58
minor fixes
keisukefukuda a8e64a6
Merge branch 'master' into seq2seq
keisukefukuda 55f6130
added README.md in examples/seq2seq directory
keisukefukuda dac5bc7
Merge branch 'seq2seq' of github.com:pfnet/chainermn into seq2seq
keisukefukuda 6212243
added create_optimizer() function
keisukefukuda 7476c1d
Merge branch 'better-err-msg-in-large-scatter' into seq2seq
keisukefukuda 5f9c40e
added a support DataSizeError in scatter_dataset (based on PR#111)
keisukefukuda f243bf9
Merge branch 'master' into seq2seq
keisukefukuda 944d507
Merge branch 'master' into seq2seq
keisukefukuda c6e4076
minor fix
keisukefukuda f2ee572
removed redundant import
keisukefukuda 3fa6d29
added comments, removed debug print and fix pep8
keisukefukuda 51fefe1
added README
keisukefukuda ad0106b
removed unused code
keisukefukuda 695477d
minor fix
keisukefukuda 72cf8a1
Merge branch 'better-err-msg-in-large-scatter' into seq2seq
keisukefukuda 9d44a1e
fix pep8
keisukefukuda a45f8bc
Merge branch 'better-err-msg-in-large-scatter' into seq2seq
keisukefukuda dc56485
Removed unnecessary file
keisukefukuda 4b924d6
added _get_num_split() and _slices() to calculate best partitioning o…
keisukefukuda d5774c1
Merge branch 'better-err-msg-in-large-scatter' into seq2seq
keisukefukuda 0566d77
fix pep8
keisukefukuda fa74e3e
added a docstring
keisukefukuda db86083
bugfix
keisukefukuda 1fd9fb4
Merge branch 'better-err-msg-in-large-scatter' into seq2seq
keisukefukuda fa5bd26
Removed unnecessary file
keisukefukuda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# ChainerMN seq2seq example | ||
|
||
An sample implementation of seq2seq model. | ||
|
||
## Data download and setup | ||
|
||
First, go to http://www.statmt.org/wmt15/translation-task.html#download and donwload necessary dataset. | ||
Let's assume you are in a working directory called `$WMT_DIR`. | ||
|
||
``` | ||
$ cd $WMT_DIR | ||
$ wget http://www.statmt.org/wmt10/training-giga-fren.tar | ||
$ wget http://www.statmt.org/wmt15/dev-v2.tgz | ||
$ tar -xf training-giga-fren.tar | ||
$ tar -xf dev-v2.tgz | ||
$ ls | ||
dev/ dev-v2.tgz giga-fren.release2.fixed.en.gz giga-fren.release2.fixed.fr.gz training-giga-fren.tar | ||
|
||
``` | ||
|
||
Next, you need to install required packages. | ||
|
||
``` | ||
$ pip install nltk progressbar2 | ||
|
||
## Run | ||
|
||
```bash | ||
$ cd $CHAINERMN | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need a little more lines to run the script 😉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed! |
||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
from __future__ import unicode_literals | ||
|
||
import collections | ||
import gzip | ||
import io | ||
import os | ||
import re | ||
|
||
import numpy | ||
import progressbar | ||
|
||
|
||
split_pattern = re.compile(r'([.,!?"\':;)(])') | ||
digit_pattern = re.compile(r'\d') | ||
|
||
|
||
def split_sentence(s): | ||
s = s.lower() | ||
s = s.replace('\u2019', "'") | ||
s = digit_pattern.sub('0', s) | ||
words = [] | ||
for word in s.strip().split(): | ||
words.extend(split_pattern.split(word)) | ||
words = [w for w in words if w] | ||
return words | ||
|
||
|
||
def open_file(path): | ||
if path.endswith('.gz'): | ||
return gzip.open(path, 'rt', 'utf-8') | ||
else: | ||
# Find gzipped version of the file | ||
gz = path + '.gz' | ||
if os.path.exists(gz): | ||
return open_file(gz) | ||
else: | ||
return io.open(path, encoding='utf-8', errors='ignore') | ||
|
||
|
||
def count_lines(path): | ||
with open_file(path) as f: | ||
return sum([1 for _ in f]) | ||
|
||
|
||
def read_file(path): | ||
n_lines = count_lines(path) | ||
bar = progressbar.ProgressBar() | ||
with open_file(path) as f: | ||
for line in bar(f, max_value=n_lines): | ||
words = split_sentence(line) | ||
yield words | ||
|
||
|
||
def count_words(path): | ||
counts = collections.Counter() | ||
for words in read_file(path): | ||
for word in words: | ||
counts[word] += 1 | ||
|
||
vocab = [word for (word, _) in counts.most_common(40000)] | ||
return vocab | ||
|
||
|
||
def make_dataset(path, vocab): | ||
word_id = {word: index for index, word in enumerate(vocab)} | ||
dataset = [] | ||
token_count = 0 | ||
unknown_count = 0 | ||
for words in read_file(path): | ||
array = make_array(word_id, words) | ||
dataset.append(array) | ||
token_count += array.size | ||
unknown_count += (array == 1).sum() | ||
print('# of tokens: %d' % token_count) | ||
print('# of unknown: %d (%.2f %%)' | ||
% (unknown_count, 100. * unknown_count / token_count)) | ||
return dataset | ||
|
||
|
||
def make_array(word_id, words): | ||
ids = [word_id.get(word, 1) for word in words] | ||
return numpy.array(ids, 'i') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
from __future__ import unicode_literals | ||
|
||
import collections | ||
import gzip | ||
import io | ||
import os | ||
import re | ||
|
||
import numpy | ||
import progressbar | ||
|
||
|
||
split_pattern = re.compile(r'([.,!?"\':;)(])') | ||
digit_pattern = re.compile(r'\d') | ||
|
||
|
||
def split_sentence(s): | ||
s = s.lower() | ||
s = s.replace('\u2019', "'") | ||
s = digit_pattern.sub('0', s) | ||
words = [] | ||
for word in s.strip().split(): | ||
words.extend(split_pattern.split(word)) | ||
words = [w for w in words if w] | ||
return words | ||
|
||
|
||
def open_file(path): | ||
if path.endswith('.gz'): | ||
return gzip.open(path, 'rt', encoding='utf-8') | ||
else: | ||
# Find gzipped version of the file | ||
gz = path + '.gz' | ||
if os.path.exists(gz): | ||
return open_file(gz) | ||
else: | ||
return io.open(path, encoding='utf-8', errors='ignore') | ||
|
||
|
||
def count_lines(path): | ||
print(path) | ||
with open_file(path) as f: | ||
return sum([1 for _ in f]) | ||
|
||
|
||
def read_file(path): | ||
n_lines = count_lines(path) | ||
bar = progressbar.ProgressBar() | ||
with open_file(path) as f: | ||
for line in bar(f, max_value=n_lines): | ||
words = split_sentence(line) | ||
yield words | ||
|
||
|
||
def count_words(path): | ||
counts = collections.Counter() | ||
for words in read_file(path): | ||
for word in words: | ||
counts[word] += 1 | ||
|
||
vocab = [word for (word, _) in counts.most_common(40000)] | ||
return vocab | ||
|
||
|
||
def make_dataset(path, vocab): | ||
word_id = {word: index for index, word in enumerate(vocab)} | ||
dataset = [] | ||
token_count = 0 | ||
unknown_count = 0 | ||
for words in read_file(path): | ||
array = make_array(word_id, words) | ||
dataset.append(array) | ||
token_count += array.size | ||
unknown_count += (array == 1).sum() | ||
print('# of tokens: %d' % token_count) | ||
print('# of unknown: %d (%.2f %%)' | ||
% (unknown_count, 100. * unknown_count / token_count)) | ||
return dataset | ||
|
||
|
||
def make_array(word_id, words): | ||
ids = [word_id.get(word, 1) for word in words] | ||
return numpy.array(ids, 'i') | ||
|
||
|
||
if __name__ == '__main__': | ||
vocab = count_words('wmt/giga-fren.release2.fixed.en') | ||
make_dataset('wmt/giga-fren.release2.fixed.en', vocab) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An sample
->A sample