Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve implementation of alignments #507

Closed
Tracked by #216
eu9ene opened this issue Apr 1, 2024 · 0 comments
Closed
Tracked by #216

Improve implementation of alignments #507

eu9ene opened this issue Apr 1, 2024 · 0 comments
Assignees
Labels
cost & perf Speeding up and lowering cost for the pipeline enhancement New feature or request quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 1, 2024

Issues with the current implementation:

  • We use naive tokenization because it's what OpusTrainer requires. This might produce alignments of lower quality because we don't take into account punctuation and also the vocabulary for eflomal is getting very large. Ideally, we should switch somehow to Moses tokenization that also separates punctuation.

  • It's likely also more efficient and faster to process the Moses tokenized text due to smaller vocabulary

  • Because we don't do str.split() explicitly there might be some double spaces that might lead to discrepancies in tokenization of alignments and what OpusTrainer does

  • We can see some warnings while training [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs') likely related to the different whitespace, but it requires further investigation. There are not too many of them.

@eu9ene eu9ene added enhancement New feature or request quality Improving robustness and translation quality labels Apr 1, 2024
@eu9ene eu9ene self-assigned this Jun 11, 2024
@eu9ene eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jun 13, 2024
@eu9ene eu9ene closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline enhancement New feature or request quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

1 participant