Improve implementation of alignments #507
Labels
cost & perf
Speeding up and lowering cost for the pipeline
enhancement
New feature or request
quality
Improving robustness and translation quality
Issues with the current implementation:
We use naive tokenization because it's what OpusTrainer requires. This might produce alignments of lower quality because we don't take into account punctuation and also the vocabulary for eflomal is getting very large. Ideally, we should switch somehow to Moses tokenization that also separates punctuation.
It's likely also more efficient and faster to process the Moses tokenized text due to smaller vocabulary
Because we don't do str.split() explicitly there might be some double spaces that might lead to discrepancies in tokenization of alignments and what OpusTrainer does
We can see some warnings while training
[Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
likely related to the different whitespace, but it requires further investigation. There are not too many of them.The text was updated successfully, but these errors were encountered: