DALL-E

About

Re-implementation of Dall-E.

Model

This repository is loosely based on the original DALL-E paper by OpenAI. Instead of using a GPT2/GPT3 like autoregressive transformer decoder architecture, it uses the Megabyte based model from lucidrains.

Method

Use VQ-VAE to encode and decode images.
Ingest text tokens and predict VQ-VAE Codes
- Use megabyte model (Will also allow massive context length)
- Just encode text using chars for now
- Auto-regressively predict VQ-VAE codes from text tokens
CIFAR 10 results bad. Perhaps because VQ-VAE bad with images below 64x64, switching to Tiny ImageNet. (NOTE: There was issues processing data, nothing to do with CIFAR-10).

Datasets

Tiny ImageNet

Validate Tiny ImageNet captions and images (so they matchup)
- Labels needed to be sorted same as the trainloader/testloader.
Overfit DALL-E model on one caption image pair.
Overfit DALL-E model on one batch of caption image pairs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DALL-E

About

Model

Method

Datasets

Tiny ImageNet

Files

README.md

Latest commit

History

README.md

File metadata and controls

DALL-E

About

Model

Method

Datasets

Tiny ImageNet