Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

Closed
gregtatum opened this issue Mar 1, 2024 · 2 comments · Fixed by #491
Assignees
Labels
bug Something is broken or not correct

Comments

@gregtatum
Copy link
Member

I've found at least one bug in the implementation:

hplt-project/OpusTrainer#53

@eu9ene
Copy link
Collaborator

eu9ene commented Mar 18, 2024

I think I'm hitting a similar problem. My setup is quite a lot different than yours though. I used elfomal to align space-tokenized (in fact not tokenized at all) text, then OpusTrainer is supposed to retokenize it using SentencePiece model. I wonder if it's something new or the same bug you've encountered with our old alignments scheme and OpusTrainer with no modifiers.

https://firefox-ci-tc.services.mozilla.com/tasks/e4bBbBRbSZmTDtKsPJZMkQ/runs/0/logs/public/logs/live.log

[task 2024-03-16T14:21:13.879Z] [2024-03-16 14:21:13] [memory] Reserving 32 MB, device gpu0
[task 2024-03-16T14:21:13.883Z] [2024-03-16 14:21:13] Ep. 1 : Up. 1 : Sen. 760 : Cost 0.87012815 : Time 431.98s : 158.72 words/s : gNorm 3.8305 : L.r. 1.8750e-08
[task 2024-03-16T14:21:14.166Z] [2024-03-16 14:21:14] Ep. 1 : Up. 2 : Sen. 10,520 : Cost 4.55021906 : Time 0.28s : 448754.57 words/s : gNorm 4.8969 : L.r. 3.7500e-08
[task 2024-03-16T14:21:14.475Z] [2024-03-16 14:21:14] Ep. 1 : Up. 3 : Sen. 16,928 : Cost 3.01045871 : Time 0.31s : 435164.87 words/s : gNorm 4.6959 : L.r. 5.6250e-08
[task 2024-03-16T14:21:14.747Z] [2024-03-16 14:21:14] Ep. 1 : Up. 4 : Sen. 18,480 : Cost 1.29863644 : Time 0.27s : 331105.01 words/s : gNorm 4.3090 : L.r. 7.5000e-08
[task 2024-03-16T14:21:15.065Z] [2024-03-16 14:21:15] Ep. 1 : Up. 5 : Sen. 21,950 : Cost 2.17084908 : Time 0.32s : 469510.57 words/s : gNorm 4.5838 : L.r. 9.3750e-08
[task 2024-03-16T14:21:15.376Z] [2024-03-16 14:21:15] Ep. 1 : Up. 6 : Sen. 23,502 : Cost 1.19824779 : Time 0.31s : 354549.51 words/s : gNorm 4.3286 : L.r. 1.1250e-07
[task 2024-03-16T14:21:15.616Z] [2024-03-16 14:21:15] Ep. 1 : Up. 7 : Sen. 25,734 : Cost 1.84143591 : Time 0.24s : 306548.30 words/s : gNorm 4.2291 : L.r. 1.3125e-07
[task 2024-03-16T14:21:15.869Z] [2024-03-16 14:21:15] Ep. 1 : Up. 8 : Sen. 26,542 : Cost 0.93719184 : Time 0.25s : 271700.20 words/s : gNorm 4.0934 : L.r. 1.5000e-07
[task 2024-03-16T14:21:16.090Z] [2024-03-16 14:21:16] Ep. 1 : Up. 9 : Sen. 28,278 : Cost 1.63907695 : Time 0.22s : 298939.48 words/s : gNorm 4.0213 : L.r. 1.6875e-07
[task 2024-03-16T14:21:16.369Z] [2024-03-16 14:21:16] Ep. 1 : Up. 10 : Sen. 29,142 : Cost 0.90442389 : Time 0.28s : 288550.01 words/s : gNorm 4.0063 : L.r. 1.8750e-07
[task 2024-03-16T14:32:03.387Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:03.393Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:58.725Z] [2024-03-16 14:32:58] Ep. 1 : Up. 1000 : Sen. 3,819,068 : Cost 1.48244727 : Time 702.36s : 144022.15 words/s : gNorm 1.3209 : L.r. 1.8750e-05
[task 2024-03-16T14:43:24.199Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:43:24.205Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:07.294Z] [2024-03-16 14:47:07] Ep. 1 : Up. 2000 : Sen. 7,592,507 : Cost 1.09335911 : Time 848.57s : 121878.65 words/s : gNorm 1.4593 : L.r. 3.7500e-05
[task 2024-03-16T14:47:47.260Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:47.266Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Segmentation fault
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /builds/worker/fetches/marian-source/src/common/logging.cpp:130
[task 2024-03-16T14:49:55.673Z] 
[task 2024-03-16T14:49:55.673Z] [CALL STACK]
[task 2024-03-16T14:49:55.673Z] [0x5616e0584fc5]                                                       + 0x519fc5
[task 2024-03-16T14:49:55.673Z] [0x5616e058520f]                                                       + 0x51a20f
[task 2024-03-16T14:49:55.673Z] [0x7f8bca042520]                                                       + 0x42520
[task 2024-03-16T14:49:55.673Z] [0x5616e065e0f8]    marian::data::CorpusBase::  addAlignmentsToBatch  (std::shared_ptr<marian::data::CorpusBatch>,  std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x438
[task 2024-03-16T14:49:55.673Z] [0x5616e0673562]    marian::data::Corpus::  toBatch  (std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x1252
[task 2024-03-16T14:49:55.674Z] [0x5616e055ada4]    marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x1204
[task 2024-03-16T14:49:55.674Z] [0x5616e055bab3]    marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1}::  operator()  () const + 0x33
[task 2024-03-16T14:49:55.674Z] [0x5616e055ca21]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x51
[task 2024-03-16T14:49:55.674Z] [0x5616e04861fd]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2024-03-16T14:49:55.674Z] [0x7f8bca099ee8]                                                       + 0x99ee8
[task 2024-03-16T14:49:55.674Z] [0x5616e048ad70]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::  _M_run  () + 0xf0
[task 2024-03-16T14:49:55.674Z] [0x5616e048bcd5]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2024-03-16T14:49:55.674Z] [0x7f8bca4dc253]                                                       + 0xdc253
[task 2024-03-16T14:49:55.674Z] [0x7f8bca094ac3]                                                       + 0x94ac3
[task 2024-03-16T14:49:55.674Z] [0x7f8bca126850]                                                       + 0x126850
[task 2024-03-16T14:49:55.674Z] 
[task 2024-03-16T14:49:56.803Z] [2024-03-16 14:49:56] [Trainer] [INFO] trainer stopped reading input
[fetches 2024-03-16T14:50:00.102Z] removing /home/ubuntu/tasks/task_171059821904953/fetches
[fetches 2024-03-16T14:50:02.054Z] finished

@eu9ene eu9ene added the bug Something is broken or not correct label Mar 18, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Mar 25, 2024

Ok, it has failed with only Tags modifier enabled in my branch. We can't remove it in the inline noise branch because it's supposed to remap the alignments. https://firefox-ci-tc.services.mozilla.com/tasks/dTCAmrqKQgCg3e6hSEiKVA/runs/0/logs/public/logs/live.log

datasets:
  original: /home/ubuntu/tasks/task_171105412365522/fetches/corpus.lten.tsv # Original parallel corpus

stages:
  - train

train:
  - original 1.0
  - until original inf # General training until marian early stops

modifiers:
#- UpperCase: 0.07 # Apply randomly to 7% of sentences
#- TitleCase: 0.05
#- Typos: 0.05
## inserts new noise sentences
#- Noise: 0.0005
#  min_word_length: 2 # Minimum word length for each word in the noisy sentence
#  max_word_length: 5 # Maximum word length for each word in the noisy sentence
#  max_words: 6 # Maximum number of words in each noisy sentence
# generates inline noise (emojis etc.) matching position in source and target using alignments
# spm_vocab argument: retokenize alignments from spaces to Sentencepiece subwords and feed to Marian
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
  augment: 1
  spm_vocab: /home/ubuntu/tasks/task_171105412365522/fetches/vocab.spm

seed: 1111
# parallel sentences + token alignments
num_fields: 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants