Out of memory on shuffling huge datasets #21

eu9ene · 2021-08-26T00:42:59Z

300M dataset, 128 GB RAM

the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram and use --shuffle batches

The text was updated successfully, but these errors were encountered:

eu9ene · 2022-02-10T00:29:02Z

This might be a bug of Marian. Memory shouldn't grow after --shuffle-in-ram is removed and we should use --shuffle data mode. It was discussed in #70 (comment)

eu9ene · 2022-06-10T19:42:17Z

Training teachers with --shuffle batches leads to such training curves. Maybe other factors are at play here.

eu9ene · 2022-06-10T23:41:02Z

Related Marian issue: marian-nmt/marian-dev#148

XapaJIaMnu · 2022-06-11T08:07:03Z

--sqlite should help but I've found it slow in practise.

jelmervdl · 2022-06-11T17:01:10Z

I suspect the running out of memory, even when --shuffle-in-ram is not used, comes from here:

https://github.com/marian-nmt/marian-dev/blob/042ed8f2e23557d0cdb956aea7d79be8c817e0b0/src/data/corpus.cpp#L227-L241

Assuming that's actually the cause, we could replace it with a two-pass shuffle:

Read the unshuffled dataset, and write each line to one of N temp files, chosen randomly for each line.
How large N needs to be can probably be determined by looking at how large the input file is, and how much memory is available for shuffling. Might be trickier to estimate if the input file is gzipped.
Shuffle each of the temp files as is done now:
read it into memory, do std::shuffle.
Concatenate temp files into the shuffled temp files.
Or implement some reader class that takes ownership and reads from the bunch of temp files consecutively as if it were one.

Edit: or do it like this

Edit: for why --shuffle batches performs worse: in the training loop the corpus is shuffled repeatedly (the batchGenerator->prepare() call). I don't know how often this happens in practice, but I can imagine that without that shuffle the order isn't random enough.

eu9ene · 2024-05-08T23:58:00Z

I didn't see this for some time and I assume it's fixed by using OpusTrainer.

eu9ene self-assigned this Aug 26, 2021

eu9ene added the optimization label Oct 30, 2021

eu9ene removed their assignment Jan 19, 2022

eu9ene mentioned this issue Feb 7, 2022

Integrate deduplication in the pipeline #70

Merged

eu9ene changed the title ~~Student training is out of memory on huge datasets~~ Out of memory on shuffling huge datasets Jun 10, 2022

eu9ene added the bug Something is broken or not correct label Jun 13, 2022

eu9ene mentioned this issue Jun 13, 2022

Don't load entire corpus into memory on start up (enhancement request) marian-nmt/marian-dev#148

Closed

eu9ene mentioned this issue Oct 23, 2023

Add support for on-prem workers #230

Open

eu9ene mentioned this issue Nov 8, 2023

Automatically adjust machine dependent settings #253

Open

eu9ene closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory on shuffling huge datasets #21

Out of memory on shuffling huge datasets #21

eu9ene commented Aug 26, 2021

eu9ene commented Feb 10, 2022

eu9ene commented Jun 10, 2022

eu9ene commented Jun 10, 2022

XapaJIaMnu commented Jun 11, 2022

jelmervdl commented Jun 11, 2022 •

edited

Loading

eu9ene commented May 8, 2024

Out of memory on shuffling huge datasets #21

Out of memory on shuffling huge datasets #21

Comments

eu9ene commented Aug 26, 2021

eu9ene commented Feb 10, 2022

eu9ene commented Jun 10, 2022

eu9ene commented Jun 10, 2022

XapaJIaMnu commented Jun 11, 2022

jelmervdl commented Jun 11, 2022 • edited Loading

eu9ene commented May 8, 2024

jelmervdl commented Jun 11, 2022 •

edited

Loading