-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469
Comments
I think I'm hitting a similar problem. My setup is quite a lot different than yours though. I used elfomal to align space-tokenized (in fact not tokenized at all) text, then OpusTrainer is supposed to retokenize it using SentencePiece model. I wonder if it's something new or the same bug you've encountered with our old alignments scheme and OpusTrainer with no modifiers.
|
Ok, it has failed with only Tags modifier enabled in my branch. We can't remove it in the inline noise branch because it's supposed to remap the alignments. https://firefox-ci-tc.services.mozilla.com/tasks/dTCAmrqKQgCg3e6hSEiKVA/runs/0/logs/public/logs/live.log datasets:
original: /home/ubuntu/tasks/task_171105412365522/fetches/corpus.lten.tsv # Original parallel corpus
stages:
- train
train:
- original 1.0
- until original inf # General training until marian early stops
modifiers:
#- UpperCase: 0.07 # Apply randomly to 7% of sentences
#- TitleCase: 0.05
#- Typos: 0.05
## inserts new noise sentences
#- Noise: 0.0005
# min_word_length: 2 # Minimum word length for each word in the noisy sentence
# max_word_length: 5 # Maximum word length for each word in the noisy sentence
# max_words: 6 # Maximum number of words in each noisy sentence
# generates inline noise (emojis etc.) matching position in source and target using alignments
# spm_vocab argument: retokenize alignments from spaces to Sentencepiece subwords and feed to Marian
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
augment: 1
spm_vocab: /home/ubuntu/tasks/task_171105412365522/fetches/vocab.spm
seed: 1111
# parallel sentences + token alignments
num_fields: 3
|
I've found at least one bug in the implementation:
hplt-project/OpusTrainer#53
The text was updated successfully, but these errors were encountered: