Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate alignments - Segmentation fault with addAlignmentsToBatch #450

Closed
Tracked by #369
gregtatum opened this issue Feb 15, 2024 · 6 comments
Closed
Tracked by #369
Assignees
Labels
bug Something is broken or not correct

Comments

@gregtatum
Copy link
Member

gregtatum commented Feb 15, 2024

In the en-ca training #384 I'm getting a crash in Marian from the alignments.

Taskcluster Log

Error: Segmentation fault
Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /builds/worker/fetches/marian-source/src/common/logging.cpp:130

marian::data::CorpusBase::  addAlignmentsToBatch  (std::shared_ptr<marian::data::CorpusBatch>,  std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x438
marian::data::Corpus::  toBatch  (std::vector<marian::data::SentenceTuple,std::allocator<marian::data::SentenceTuple>> const&) + 0x1252
marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x1204
marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1}::  operator()  () const + 0x33
std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x51
std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::  _M_run  () + 0xf0
std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5

In this message from Jorg: https://groups.google.com/g/marian-nmt/c/PjA-rQJ3Oio

To answer my own post: it was indeed a bug in my alignment. I had some links outside of throng of tokens in one language.

So I suspect there is an alignment that is broken somehow in our code. We should validate the alignments. I'll investigate.

@gregtatum gregtatum added the bug Something is broken or not correct label Feb 15, 2024
@gregtatum gregtatum self-assigned this Feb 15, 2024
@gregtatum
Copy link
Member Author

@gregtatum
Copy link
Member Author

gregtatum commented Feb 21, 2024

Edit: The pruned lexical list appears broken.

Just kidding, I just didn't understand the format

@gregtatum
Copy link
Member Author

I've validated the correctness of the generated alignments.

https://firefox-ci-tc.services.mozilla.com/tasks/b-7CDsKNQ_Cn7wf3RcxDXw#artifacts

@gregtatum
Copy link
Member Author

This is still happening with OpusTrainer even with no augmentation. I removed it and used Marian directly in my training and worked around this, but we should fix this to support augmentation in teachers.

@marco-c
Copy link
Collaborator

marco-c commented Mar 29, 2024

Was this fixed by #491?

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 1, 2024

Yes, the students are training now and we don't see this error after Marian update

@eu9ene eu9ene closed this as completed Apr 1, 2024
nordzilla pushed a commit that referenced this issue Sep 19, 2024
* Enables model ensembles

Adds the ability to use ensembles of models. This supports ensembles of
binary- or npz-format models, as well as mixtures of both.

When all models in the ensembles are of binary format, the load from
memory path is used. Otherwise, they are loaded via the file system.
Enable log-level debug for output related to this.

* Fix formatting

* Fix WASM bindings for MemoryBundle

For now, this does not support ensembles.

* Remove shared_ptr wrapping the AlignedMemory of models.

* Fix formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

3 participants