Improve implementation of alignments #507

eu9ene · 2024-04-01T21:45:58Z

Issues with the current implementation:

We use naive tokenization because it's what OpusTrainer requires. This might produce alignments of lower quality because we don't take into account punctuation and also the vocabulary for eflomal is getting very large. Ideally, we should switch somehow to Moses tokenization that also separates punctuation.
It's likely also more efficient and faster to process the Moses tokenized text due to smaller vocabulary
Because we don't do str.split() explicitly there might be some double spaces that might lead to discrepancies in tokenization of alignments and what OpusTrainer does
We can see some warnings while training [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs') likely related to the different whitespace, but it requires further investigation. There are not too many of them.

The text was updated successfully, but these errors were encountered:

eu9ene added enhancement New feature or request quality Improving robustness and translation quality labels Apr 1, 2024

This was referenced May 7, 2024

Report empty alignments separately #571

Merged

[meta] General translation quality improvements #216

Open

eu9ene mentioned this issue Jun 11, 2024

Examine strategies for more efficient alignments #663

Closed

eu9ene self-assigned this Jun 11, 2024

eu9ene mentioned this issue Jun 13, 2024

Refactor b-cpu-xlargedisk worker pools to allow for experimentation with different configurations #674

Merged

eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jun 13, 2024

eu9ene mentioned this issue Jun 25, 2024

Optimize alignments #703

Merged

eu9ene closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve implementation of alignments #507

Improve implementation of alignments #507

eu9ene commented Apr 1, 2024

Improve implementation of alignments #507

Improve implementation of alignments #507

Comments

eu9ene commented Apr 1, 2024