Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory on shuffling huge datasets #21

Closed
eu9ene opened this issue Aug 26, 2021 · 6 comments
Closed

Out of memory on shuffling huge datasets #21

eu9ene opened this issue Aug 26, 2021 · 6 comments
Labels
bug Something is broken or not correct

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Aug 26, 2021

300M dataset, 128 GB RAM

the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram and use --shuffle batches

@eu9ene eu9ene self-assigned this Aug 26, 2021
@eu9ene eu9ene removed their assignment Jan 19, 2022
@eu9ene
Copy link
Collaborator Author

eu9ene commented Feb 10, 2022

This might be a bug of Marian. Memory shouldn't grow after --shuffle-in-ram is removed and we should use --shuffle data mode. It was discussed in #70 (comment)

@eu9ene eu9ene changed the title Student training is out of memory on huge datasets Out of memory on shuffling huge datasets Jun 10, 2022
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 10, 2022

Training teachers with --shuffle batches leads to such training curves. Maybe other factors are at play here.
Screen Shot 2022-06-10 at 12 40 14 PM

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 10, 2022

Related Marian issue: marian-nmt/marian-dev#148

@XapaJIaMnu
Copy link
Contributor

--sqlite should help but I've found it slow in practise.

@jelmervdl
Copy link
Contributor

jelmervdl commented Jun 11, 2022

I suspect the running out of memory, even when --shuffle-in-ram is not used, comes from here:

https://github.com/marian-nmt/marian-dev/blob/042ed8f2e23557d0cdb956aea7d79be8c817e0b0/src/data/corpus.cpp#L227-L241

Assuming that's actually the cause, we could replace it with a two-pass shuffle:

  1. Read the unshuffled dataset, and write each line to one of N temp files, chosen randomly for each line.
    How large N needs to be can probably be determined by looking at how large the input file is, and how much memory is available for shuffling. Might be trickier to estimate if the input file is gzipped.
  2. Shuffle each of the temp files as is done now:
    read it into memory, do std::shuffle.
  3. Concatenate temp files into the shuffled temp files.
    Or implement some reader class that takes ownership and reads from the bunch of temp files consecutively as if it were one.

Edit: or do it like this

Edit: for why --shuffle batches performs worse: in the training loop the corpus is shuffled repeatedly (the batchGenerator->prepare() call). I don't know how often this happens in practice, but I can imagine that without that shuffle the order isn't random enough.

@eu9ene
Copy link
Collaborator Author

eu9ene commented May 8, 2024

I didn't see this for some time and I assume it's fixed by using OpusTrainer.

@eu9ene eu9ene closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

3 participants