Re-implementation of Dall-E.
This repository is loosely based on the original DALL-E paper by OpenAI. Instead of using a GPT2/GPT3 like autoregressive transformer decoder architecture, it uses the Megabyte based model from lucidrains.
- Use VQ-VAE to encode and decode images.
- Ingest text tokens and predict VQ-VAE Codes
- Use megabyte model (Will also allow massive context length)
- Just encode text using chars for now
- Auto-regressively predict VQ-VAE codes from text tokens
- CIFAR 10 results bad. Perhaps because VQ-VAE bad with images below 64x64, switching to Tiny ImageNet. (NOTE: There was issues processing data, nothing to do with CIFAR-10).
- Validate Tiny ImageNet captions and images (so they matchup)
- Labels needed to be sorted same as the trainloader/testloader.
- Overfit DALL-E model on one caption image pair.
- Overfit DALL-E model on one batch of caption image pairs.