Ability encode files to tokens separately from fine-tuning? #19

bob80333 · 2019-04-22T06:46:56Z

I have a dataset of ~160MB fine-tuning in google colab, but a dataset of ~180MB causes the runtime to crash while loading the dataset, due to using all available RAM. However, while fine-tuning, I noticed that the VRAM has 6GB available, and the RAM has ~10GB available.

My dataset was originally many smaller files that are combined, if I could encode each of these separately into tokens, then combine the encoded dataset, and skip the encoding process when loading the dataset, I think I could use larger datasets while avoiding running out of RAM.

I did notice that on line 27 of load_dataset.py it seems to be able to load pre-encoded files.

The text was updated successfully, but these errors were encountered:

bob80333 · 2019-04-22T16:54:43Z

Update: I was able to use use nsheppard's encode.py to encode a 544MB dataset into a tokenized 160MB dataset, which loaded in seconds on Google colab, and would have previously crashed it. The tokenization process did use > 20GB of RAM on my computer, and read over 37GB from disk, so I think there is probably some room for optimization there.

However, I am currently successfully training a dataset that was ~544MB with 137968501 tokens, that previously crashed google colab.

minimaxir · 2019-04-23T02:11:13Z

Huh, TIL. I wasn't sure what encode.py was in that repo; pre-encoding makes sense! Thanks for the technical breakdown!

I should probably port that to this repo.

bob80333 · 2019-04-23T03:06:27Z

Yeah, I can't claim full credit for that, I read about it in this blog post on retraining gpt-2 for poetry, it had some interesting ideas for improving output like beam search or tree search instead of just a greedy search. I did find a tensorflow beam search decoder, but I couldn't find any information on how to use it.

minimaxir · 2019-06-19T05:52:43Z

Added with gpt2.encode_dataset().

saippuakauppias · 2019-07-12T15:01:00Z

maybe someone will try to use https://github.com/Blosc/bcolz ?

bob80333 mentioned this issue May 19, 2019

Not able to load the dataset #54

Closed

woctezuma mentioned this issue Jun 18, 2019

RAM Memory error when loading dataset #73

Closed

minimaxir closed this as completed Jun 19, 2019

woctezuma mentioned this issue Aug 7, 2019

Memory Error // Questions #98

Open

saippuakauppias mentioned this issue Oct 6, 2019

Read large dataset line by line (minimize memory on load dataset) #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability encode files to tokens separately from fine-tuning? #19

Ability encode files to tokens separately from fine-tuning? #19

bob80333 commented Apr 22, 2019

bob80333 commented Apr 22, 2019

minimaxir commented Apr 23, 2019

bob80333 commented Apr 23, 2019

minimaxir commented Jun 19, 2019

saippuakauppias commented Jul 12, 2019

Ability encode files to tokens separately from fine-tuning? #19

Ability encode files to tokens separately from fine-tuning? #19

Comments

bob80333 commented Apr 22, 2019

bob80333 commented Apr 22, 2019

minimaxir commented Apr 23, 2019

bob80333 commented Apr 23, 2019

minimaxir commented Jun 19, 2019

saippuakauppias commented Jul 12, 2019