OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

gregtatum · 2024-03-01T15:40:05Z

I've found at least one bug in the implementation:

eu9ene · 2024-03-18T22:31:53Z

I think I'm hitting a similar problem. My setup is quite a lot different than yours though. I used elfomal to align space-tokenized (in fact not tokenized at all) text, then OpusTrainer is supposed to retokenize it using SentencePiece model. I wonder if it's something new or the same bug you've encountered with our old alignments scheme and OpusTrainer with no modifiers.

https://firefox-ci-tc.services.mozilla.com/tasks/e4bBbBRbSZmTDtKsPJZMkQ/runs/0/logs/public/logs/live.log

[task 2024-03-16T14:21:13.879Z] [2024-03-16 14:21:13] [memory] Reserving 32 MB, device gpu0
[task 2024-03-16T14:21:13.883Z] [2024-03-16 14:21:13] Ep. 1 : Up. 1 : Sen. 760 : Cost 0.87012815 : Time 431.98s : 158.72 words/s : gNorm 3.8305 : L.r. 1.8750e-08
[task 2024-03-16T14:21:14.166Z] [2024-03-16 14:21:14] Ep. 1 : Up. 2 : Sen. 10,520 : Cost 4.55021906 : Time 0.28s : 448754.57 words/s : gNorm 4.8969 : L.r. 3.7500e-08
[task 2024-03-16T14:21:14.475Z] [2024-03-16 14:21:14] Ep. 1 : Up. 3 : Sen. 16,928 : Cost 3.01045871 : Time 0.31s : 435164.87 words/s : gNorm 4.6959 : L.r. 5.6250e-08
[task 2024-03-16T14:21:14.747Z] [2024-03-16 14:21:14] Ep. 1 : Up. 4 : Sen. 18,480 : Cost 1.29863644 : Time 0.27s : 331105.01 words/s : gNorm 4.3090 : L.r. 7.5000e-08
[task 2024-03-16T14:21:15.065Z] [2024-03-16 14:21:15] Ep. 1 : Up. 5 : Sen. 21,950 : Cost 2.17084908 : Time 0.32s : 469510.57 words/s : gNorm 4.5838 : L.r. 9.3750e-08
[task 2024-03-16T14:21:15.376Z] [2024-03-16 14:21:15] Ep. 1 : Up. 6 : Sen. 23,502 : Cost 1.19824779 : Time 0.31s : 354549.51 words/s : gNorm 4.3286 : L.r. 1.1250e-07
[task 2024-03-16T14:21:15.616Z] [2024-03-16 14:21:15] Ep. 1 : Up. 7 : Sen. 25,734 : Cost 1.84143591 : Time 0.24s : 306548.30 words/s : gNorm 4.2291 : L.r. 1.3125e-07
[task 2024-03-16T14:21:15.869Z] [2024-03-16 14:21:15] Ep. 1 : Up. 8 : Sen. 26,542 : Cost 0.93719184 : Time 0.25s : 271700.20 words/s : gNorm 4.0934 : L.r. 1.5000e-07
[task 2024-03-16T14:21:16.090Z] [2024-03-16 14:21:16] Ep. 1 : Up. 9 : Sen. 28,278 : Cost 1.63907695 : Time 0.22s : 298939.48 words/s : gNorm 4.0213 : L.r. 1.6875e-07
[task 2024-03-16T14:21:16.369Z] [2024-03-16 14:21:16] Ep. 1 : Up. 10 : Sen. 29,142 : Cost 0.90442389 : Time 0.28s : 288550.01 words/s : gNorm 4.0063 : L.r. 1.8750e-07
[task 2024-03-16T14:32:03.387Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:03.393Z] [2024-03-16 14:32:03] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:32:58.725Z] [2024-03-16 14:32:58] Ep. 1 : Up. 1000 : Sen. 3,819,068 : Cost 1.48244727 : Time 702.36s : 144022.15 words/s : gNorm 1.3209 : L.r. 1.8750e-05
[task 2024-03-16T14:43:24.199Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:43:24.205Z] [2024-03-16 14:43:24] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:07.294Z] [2024-03-16 14:47:07] Ep. 1 : Up. 2000 : Sen. 7,592,507 : Cost 1.09335911 : Time 848.57s : 121878.65 words/s : gNorm 1.4593 : L.r. 3.7500e-05
[task 2024-03-16T14:47:47.260Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:47:47.266Z] [2024-03-16 14:47:47] [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Segmentation fault
[task 2024-03-16T14:49:55.660Z] [2024-03-16 14:49:55] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /builds/worker/fetches/marian-source/src/common/logging.cpp:130
[task 2024-03-16T14:49:55.673Z] 
[task 2024-03-16T14:49:55.673Z] [CALL STACK]
[task 2024-03-16T14:49:55.673Z] [0x5616e0584fc5]                                                       + 0x519fc5
[task 2024-03-16T14:49:55.673Z] [0x5616e058520f]                                                       + 0x51a20f
[task 2024-03-16T14:49:55.673Z] [0x7f8bca042520]                                                       + 0x42520
[task 2024-03-16T14:49:55.673Z] [0x5616e065e0f8]    marian::data::CorpusBase::  addAlignmentsToBatch  (std::shared_ptr<marian::data::CorpusBatch>,  std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x438
[task 2024-03-16T14:49:55.673Z] [0x5616e0673562]    marian::data::Corpus::  toBatch  (std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x1252
[task 2024-03-16T14:49:55.674Z] [0x5616e055ada4]    marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x1204
[task 2024-03-16T14:49:55.674Z] [0x5616e055bab3]    marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1}::  operator()  () const + 0x33
[task 2024-03-16T14:49:55.674Z] [0x5616e055ca21]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x51
[task 2024-03-16T14:49:55.674Z] [0x5616e04861fd]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2024-03-16T14:49:55.674Z] [0x7f8bca099ee8]                                                       + 0x99ee8
[task 2024-03-16T14:49:55.674Z] [0x5616e048ad70]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::  _M_run  () + 0xf0
[task 2024-03-16T14:49:55.674Z] [0x5616e048bcd5]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2024-03-16T14:49:55.674Z] [0x7f8bca4dc253]                                                       + 0xdc253
[task 2024-03-16T14:49:55.674Z] [0x7f8bca094ac3]                                                       + 0x94ac3
[task 2024-03-16T14:49:55.674Z] [0x7f8bca126850]                                                       + 0x126850
[task 2024-03-16T14:49:55.674Z] 
[task 2024-03-16T14:49:56.803Z] [2024-03-16 14:49:56] [Trainer] [INFO] trainer stopped reading input
[fetches 2024-03-16T14:50:00.102Z] removing /home/ubuntu/tasks/task_171059821904953/fetches
[fetches 2024-03-16T14:50:02.054Z] finished

eu9ene · 2024-03-25T23:07:23Z

Ok, it has failed with only Tags modifier enabled in my branch. We can't remove it in the inline noise branch because it's supposed to remap the alignments. https://firefox-ci-tc.services.mozilla.com/tasks/dTCAmrqKQgCg3e6hSEiKVA/runs/0/logs/public/logs/live.log

datasets:
  original: /home/ubuntu/tasks/task_171105412365522/fetches/corpus.lten.tsv # Original parallel corpus

stages:
  - train

train:
  - original 1.0
  - until original inf # General training until marian early stops

modifiers:
#- UpperCase: 0.07 # Apply randomly to 7% of sentences
#- TitleCase: 0.05
#- Typos: 0.05
## inserts new noise sentences
#- Noise: 0.0005
#  min_word_length: 2 # Minimum word length for each word in the noisy sentence
#  max_word_length: 5 # Maximum word length for each word in the noisy sentence
#  max_words: 6 # Maximum number of words in each noisy sentence
# generates inline noise (emojis etc.) matching position in source and target using alignments
# spm_vocab argument: retokenize alignments from spaces to Sentencepiece subwords and feed to Marian
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
  augment: 1
  spm_vocab: /home/ubuntu/tasks/task_171105412365522/fetches/vocab.spm

seed: 1111
# parallel sentences + token alignments
num_fields: 3

eu9ene added the bug Something is broken or not correct label Mar 18, 2024

eu9ene mentioned this issue Mar 26, 2024

Update Marian to v1.12.14 2d067afb 2024-02-16 #491

Merged

eu9ene self-assigned this Mar 26, 2024

eu9ene closed this as completed in #491 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

gregtatum commented Mar 1, 2024

eu9ene commented Mar 18, 2024

eu9ene commented Mar 25, 2024

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments #469

Comments

gregtatum commented Mar 1, 2024

eu9ene commented Mar 18, 2024

eu9ene commented Mar 25, 2024