Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process alignments in chunks #739

Open
Tracked by #453
eu9ene opened this issue Jul 16, 2024 · 2 comments
Open
Tracked by #453

Process alignments in chunks #739

eu9ene opened this issue Jul 16, 2024 · 2 comments
Labels
cost & perf Speeding up and lowering cost for the pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jul 16, 2024

It will be cheaper to split it into chunks and run on smaller preemptible machines rather than one big standard instance. The downside is that it will increase the complexity of the graph and will be harder to maintain.

@eu9ene eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jul 16, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 19, 2024

Another approach would be to process it in chunks on one machine and if it's preempted, continue from the last unprocessed chunk. This approach can work but it takes longer to process compared to parallelization. One more thing to take into account is that we need to calculate priors on a large part of the original parallel corpus first before we can start processing anything in smaller chunks. I think a 100M sample might be sufficient.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 31, 2024

Chunking on one machine has already been implemented in #763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline
Projects
None yet
Development

No branches or pull requests

1 participant