Process alignments in chunks #739

eu9ene · 2024-07-16T21:22:12Z

It will be cheaper to split it into chunks and run on smaller preemptible machines rather than one big standard instance. The downside is that it will increase the complexity of the graph and will be harder to maintain.

eu9ene · 2024-07-19T16:37:25Z

Another approach would be to process it in chunks on one machine and if it's preempted, continue from the last unprocessed chunk. This approach can work but it takes longer to process compared to parallelization. One more thing to take into account is that we need to calculate priors on a large part of the original parallel corpus first before we can start processing anything in smaller chunks. I think a 100M sample might be sufficient.

eu9ene · 2024-07-31T21:49:48Z

Chunking on one machine has already been implemented in #763

eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jul 16, 2024

eu9ene mentioned this issue Jul 16, 2024

[meta] Cost efficiency #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process alignments in chunks #739

Process alignments in chunks #739

eu9ene commented Jul 16, 2024

eu9ene commented Jul 19, 2024

eu9ene commented Jul 31, 2024 •

edited

Loading

Process alignments in chunks #739

Process alignments in chunks #739

Comments

eu9ene commented Jul 16, 2024

eu9ene commented Jul 19, 2024

eu9ene commented Jul 31, 2024 • edited Loading

eu9ene commented Jul 31, 2024 •

edited

Loading