Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability encode files to tokens separately from fine-tuning? #19

Closed
bob80333 opened this issue Apr 22, 2019 · 5 comments
Closed

Ability encode files to tokens separately from fine-tuning? #19

bob80333 opened this issue Apr 22, 2019 · 5 comments

Comments

@bob80333
Copy link

I have a dataset of ~160MB fine-tuning in google colab, but a dataset of ~180MB causes the runtime to crash while loading the dataset, due to using all available RAM. However, while fine-tuning, I noticed that the VRAM has 6GB available, and the RAM has ~10GB available.

My dataset was originally many smaller files that are combined, if I could encode each of these separately into tokens, then combine the encoded dataset, and skip the encoding process when loading the dataset, I think I could use larger datasets while avoiding running out of RAM.

I did notice that on line 27 of load_dataset.py it seems to be able to load pre-encoded files.

@bob80333
Copy link
Author

Update: I was able to use use nsheppard's encode.py to encode a 544MB dataset into a tokenized 160MB dataset, which loaded in seconds on Google colab, and would have previously crashed it. The tokenization process did use > 20GB of RAM on my computer, and read over 37GB from disk, so I think there is probably some room for optimization there.

However, I am currently successfully training a dataset that was ~544MB with 137968501 tokens, that previously crashed google colab.

@minimaxir
Copy link
Owner

Huh, TIL. I wasn't sure what encode.py was in that repo; pre-encoding makes sense! Thanks for the technical breakdown!

I should probably port that to this repo.

@bob80333
Copy link
Author

Yeah, I can't claim full credit for that, I read about it in this blog post on retraining gpt-2 for poetry, it had some interesting ideas for improving output like beam search or tree search instead of just a greedy search. I did find a tensorflow beam search decoder, but I couldn't find any information on how to use it.

@minimaxir
Copy link
Owner

Added with gpt2.encode_dataset().

@saippuakauppias
Copy link

maybe someone will try to use https://github.com/Blosc/bcolz ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants