-
-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability encode files to tokens separately from fine-tuning? #19
Comments
Update: I was able to use use nsheppard's encode.py to encode a 544MB dataset into a tokenized 160MB dataset, which loaded in seconds on Google colab, and would have previously crashed it. The tokenization process did use > 20GB of RAM on my computer, and read over 37GB from disk, so I think there is probably some room for optimization there. However, I am currently successfully training a dataset that was ~544MB with 137968501 tokens, that previously crashed google colab. |
Huh, TIL. I wasn't sure what I should probably port that to this repo. |
Yeah, I can't claim full credit for that, I read about it in this blog post on retraining gpt-2 for poetry, it had some interesting ideas for improving output like beam search or tree search instead of just a greedy search. I did find a tensorflow beam search decoder, but I couldn't find any information on how to use it. |
Added with |
maybe someone will try to use https://github.com/Blosc/bcolz ? |
I have a dataset of ~160MB fine-tuning in google colab, but a dataset of ~180MB causes the runtime to crash while loading the dataset, due to using all available RAM. However, while fine-tuning, I noticed that the VRAM has 6GB available, and the RAM has ~10GB available.
My dataset was originally many smaller files that are combined, if I could encode each of these separately into tokens, then combine the encoded dataset, and skip the encoding process when loading the dataset, I think I could use larger datasets while avoiding running out of RAM.
I did notice that on line 27 of
load_dataset.py
it seems to be able to load pre-encoded files.The text was updated successfully, but these errors were encountered: