Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process the shards using multiple processes in prepare_train_data #813

Merged
merged 6 commits into from
May 25, 2020

Conversation

hmashlah
Copy link
Contributor

Process the shards using multiple processes in prepare_train_data.

I tested the change by running the following commands and checking the outputs

python -m sockeye.prepare_data --source source --target target --output prepared_data --max-processes 10 --min-num-shards 20

python -m sockeye.prepare_data --source source --target target --output prepared_data

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice change! Only a few comments.

sockeye/data_io.py Outdated Show resolved Hide resolved
sockeye/data_io.py Outdated Show resolved Hide resolved
sockeye/data_io.py Outdated Show resolved Hide resolved
sockeye/data_io.py Outdated Show resolved Hide resolved
sockeye/data_io.py Outdated Show resolved Hide resolved
test/unit/test_arguments.py Outdated Show resolved Hide resolved
@fhieber
Copy link
Contributor

fhieber commented May 23, 2020

Could you also bump the minor version (sockeye/__init__.py) and update the Changelog? Thank you!

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! How about making the max_processes flag in data_io.py a proper int with default 1 instead of None?

CHANGELOG.md Outdated Show resolved Hide resolved
sockeye/data_io.py Show resolved Hide resolved
shard_sources: List[str], shard_target: str,
shard_stats: 'DataStatistics', output_prefix: str, keep_tmp_shard_files: bool):
"""
Load a shard source/target data files into an NDArrays and then save it to desk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Load a shard source/target data files into an NDArrays and then save it to desk.
Load shard source and target data files into NDArrays and save to disk.

sockeye/data_io.py Outdated Show resolved Hide resolved
sockeye/data_io.py Outdated Show resolved Hide resolved
test/unit/test_vocab.py Outdated Show resolved Hide resolved
@hmashlah hmashlah force-pushed the sockeye_2_github branch from 5356b15 to c422dd8 Compare May 25, 2020 14:19
@fhieber fhieber merged commit b1b0973 into awslabs:sockeye_2 May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants