training is very slow for DALLE #82

smallflyingpig · 2021-03-12T12:27:04Z

I am training the DALLE model with 8 layers transformer (lr=3e-4) on CUB dataset, but the training is very slow. the loss almost doesnot decrease for about 300 epochs. Is there any tricks to train the model? Anyone successes to get a well-trained DALLE model?

smallflyingpig · 2021-03-12T12:29:41Z

I use the pretrained dVAE model released by OpenAI

afiaka87 · 2021-03-14T05:52:02Z

I use the pretrained dVAE model released by OpenAI

@smallflyingpig

I did a hyperparameter sweep optimizing for loss (not runtime). Nevertheless, you'll find the average runtime for an A100 (a pretty good GPU) for about 12 runs (1200 iterations each) here. As you can see, a lower depth will provide you with lower runtimes, but you may take a hit in loss as well. These are all done with a batch size of twelve.

Are you using the latest release? Your experience of needing 300 (is that a typo?! how long did that take?) epochs in order to get anywhere doesn't match mine at all. I generally achieve a result in a full 28 epochs training on a shuffle of my dataset containing about a million image-text pairs.

How large is CUBS? The size of your dataset may be loosely correlated, but if you're shuffling it I would hope you get a result approaching reasonable reconstructions by at least 30,000 iterations. I realize that's a lot compared to other things but, well, you're running CLIP, the dVAE, all the while training a transformer. It's going to take awhile if you're not on the latest GPU.

#84 (comment)

smallflyingpig · 2021-03-15T01:54:01Z

I use the pretrained dVAE model released by OpenAI

@smallflyingpig

I did a hyperparameter sweep optimizing for loss (not runtime). Nevertheless, you'll find the average runtime for an A100 (a pretty good GPU) for about 12 runs (1200 iterations each) here. As you can see, a lower depth will provide you with lower runtimes, but you may take a hit in loss as well. These are all done with a batch size of twelve.

Are you using the latest release? Your experience of needing 300 (is that a typo?! how long did that take?) epochs in order to get anywhere doesn't match mine at all. I generally achieve a result in a full 28 epochs training on a shuffle of my dataset containing about a million image-text pairs.

How large is CUBS? The size of your dataset may be loosely correlated, but if you're shuffling it I would hope you get a result approaching reasonable reconstructions by at least 30,000 iterations. I realize that's a lot compared to other things but, well, you're running CLIP, the dVAE, all the while training a transformer. It's going to take awhile if you're not on the latest GPU.

#84 (comment)

@afiaka87
Thanks for your reply. It is a great work to tune the hyper parameters of DALL-E, because most of us (including me) do not have many GPUs to do such attempt (I only have a 2080TI GPU with only 10G RAM).

As for my training, CUBS is a fine-grained dataset for birds classification, which has about 8000 images for training and 2500 for testing. Typically, I set the depth=8 and head=2 because my GPUs are not powerful enough for training a very large model. Moreover, I resize the images into 64x64 to make the training faster. Finally, a epoch costs about 3 minutes.

Thanks for your sharing the optimized parameters. I will try your optimized parameters and post the results if I succeed.

afiaka87 · 2021-03-15T02:39:53Z

Are you aware of the the --reversible=True parameter? It should help you run with higher depth and heads at the cost of runtime on lower memory set-ups.

smallflyingpig · 2021-03-15T09:08:06Z

Are you aware of the the --reversible=True parameter? It should help you run with higher depth and heads at the cost of runtime on lower memory set-ups.

yeah, I set the 'reversible' parameter. Your suggestions are very help. I believe I can get some meaningful results in several days.

afiaka87 · 2021-03-15T09:18:24Z

Are you aware of the the --reversible=True parameter? It should help you run with higher depth and heads at the cost of runtime on lower memory set-ups.

yeah, I set the 'reversible' parameter. Your suggestions are very help. I believe I can get some meaningful results in several days.

Fantastic! Let me know if you find useful parameters for your configuration!

afiaka87 · 2021-03-16T05:22:41Z

@smallflyingpig We now have the VQGAN working! It's not a panacea, but it runs significantly faster and with significantly less memory. Plese check the README for how to use it.

I've confirmed this usage in another closed issue:

@afiaka87 wow, running at 32 depth, 16 heads (reversible OFF) and it only needs 11 GiB of VRAM!

afiaka87 mentioned this issue May 31, 2021

Closing some old issues #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training is very slow for DALLE #82

training is very slow for DALLE #82

smallflyingpig commented Mar 12, 2021

smallflyingpig commented Mar 12, 2021

afiaka87 commented Mar 14, 2021 •

edited

Loading

smallflyingpig commented Mar 15, 2021

afiaka87 commented Mar 15, 2021

smallflyingpig commented Mar 15, 2021

afiaka87 commented Mar 15, 2021

afiaka87 commented Mar 16, 2021 •

edited

Loading

training is very slow for DALLE #82

training is very slow for DALLE #82

Comments

smallflyingpig commented Mar 12, 2021

smallflyingpig commented Mar 12, 2021

afiaka87 commented Mar 14, 2021 • edited Loading

smallflyingpig commented Mar 15, 2021

afiaka87 commented Mar 15, 2021

smallflyingpig commented Mar 15, 2021

afiaka87 commented Mar 15, 2021

afiaka87 commented Mar 16, 2021 • edited Loading

afiaka87 commented Mar 14, 2021 •

edited

Loading

afiaka87 commented Mar 16, 2021 •

edited

Loading