-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAIVae implementation #74
Comments
@CDitzel that's exactly my understanding Carsten :) |
all right. So then there is not Gumbel Softmax or reparameterization trick at work at all. Also no explicit KL loss to consider as there are no constraints placed on the "codebook" to begin with. That would explain why the first author of the paper struggled to explain the loss formula he used in his own paper when Andrej and Justin asked him about it in clubhouse oO |
@CDitzel hmm, I think gumbel softmax was still used. can you point me to where in the clubhouse conversation they discuss the kl loss? |
of course Phil. Happy to be of service. Its timestamped to the correct position. Mh you are right, they do mention Gumbel softmax in the paper but I am wondering how when and where. I tried to finetune the given OpenAI models after downloading them with your skript but didnt manage to set "requires_grad" to True. Do you happen to know if it is possible to prohibit that and if so if OpenAI did that to their model? |
the gumbel softmax would be used on the encoder output, just before it is sent back to the decoder (where-as the hard version uses a one-hot) nope, you can't prohibit that! once you have the weights, the sky is the limit |
but its hard trying to finetune the provided encoder and decoder when I still neither understoof completely their data regime mapping procedure nor the specifics of their loss function. Particularly their averaging procedure and their mean/std output Trying simple L1 loss and backpropping through their networks yield NaN pretty soon... |
I found that the Gumbel Softmax significantly impairs the reconstruction quality. Leaving it out and just doing the tensor contraction defaults to OpenAIs provided implementation. However, with their code I cant seem to get satisfying reconstructions with tokens via the one_hot encoding scheme. Crazy... |
@CDitzel > "I found that the Gumbel Softmax significantly impairs the reconstruction quality. Leaving it out and just doing the tensor contraction defaults to OpenAIs provided implementation. However, with their code I cant seem to get satisfying reconstructions with tokens via the one_hot encoding scheme. Crazy..." \ I find that this notebook solves all the problems and makes it much easier to get a grasp on it. I feel that it's a waste of time trying to train a not very deep "discrete" VAE which is like 90 mb - when it took tech giants like 3 weeks on a million tpus to get a pretrained model which they didn't even release. The VAE they release is practically just an encoder and a decoder - there are plenty. The REAL magic of Dall_E is not the VAE - but CLIP itself. Behind CLIP is the VIsual Transformer - VIT. I don't see ViT mentioned in this notebook at all - it focuses exclusively on the vae, which is literally just a middle man between CLIP and the model. Clip is only mentioned in passing and nothing is said about what model you should use or what the architecture for that is. Just my feel. I spent days trying to get this to do anything and i feel like those hours were in VAEn. Is this all just to get a graph? Wheres the pics lol. Where's the hedgehog violin? Dall-E = A normalized latent space to start, text, tokenizer, mapper + ** ViT ** Correct me if I'm wrong, but from my very amateur understanding, but having already generated hundreds of dall-e-like images in practice, this is my two cents: The central player in this implementation is Clip.load('ViT-B/32', jit=True)' -- the VisualTransformer in this, ViT-B-32.pt 343 mb which is far larger than the 'discrete' 'tinyVAE' and far more robust, the VIT pretrained model is the real performer. It's trained on thousands of imagenet images - this is where the magic happens. As far as encoding and decoding of text to tokens, and pixels to tensor numbers -- any language model will do. You can use simple.tokenizer. you can use bert, or gpt2. You can use VAE. It's a fairly linear translation from a very small dictionary that frankly could be bigger afaic. Convert text to tokens, and then send those tokens to a visual transformer model so it knows what categories it should force-hallucinate into existence from the latent space -- reward the mapper for increasing the similarity to the categories, and penalize the mapper if the similarity falls --until the mapper gets its p*xels together. . The latest version of this notebook even decided to drop the dall_e encoder completely and it didn't even change the result except that it uses less memory. The encoder is super inefficient in compute. Perceptor(clip.model(ViT-size-int) is just as good at it, knows better than it, and is pretrained on a lot of images you'll see in the Multi-Modal Neurons distil.pub ; VAE encoder seems redundant when you already have simple.tokenizer in the clip model doing the same thing. What you don't have is a visual perceiver that tells the pixel mapper how good it's doing until it learns that moving the latent space pixels in xyz directions achieves less loss and more similarity to the text_input, than moving the pixels in incorrect directions. Using the pretrained ViT, with the proper setup of the temp / tau, lr (.1 works fine), ZERO_GRAD, ncols, mean, std, and .clipping min and max -- it achieves similar Image examples of the type: "Hedgehogs made out of legos" and whatever you want. This is Dall_E : It's simply CLIP, within which is embedded Google's VIsual Transformer model or a similar model trained on a HUGE dataset + a simple tokenizer and decoder. The way forward is to train a Visual Transformer model using clip - with or without the involvement of TinyVAE (I would think a very deep vae would be better), on as much images as possible. CLIP handles the labeling of them, that's what its for! Clip CAN see why kids love cinnamon toast crunch. The ViT itself has a built in encoder and decoder, attention, etc and maps the pixels. This Attention is ALL you need. if you have a pretrained ViT, and the ViT knows what it's looking at, and CLIP knows how concepts relate, and relay the right amount of reward and penalty to get the mapper correct itself -- then you get the image you want to see. If you only have a VAE that only knows how to encode and decode from text to token and token to array -- its still not good or much If it doesn't even know what it's looking at or even see the results. So far based on this paper - https://www.kaggle.com/abhinand05/vision-transformer-vit-tutorial-baseline# - it looks like ViT-H-14 and ViT-L-16 are the best , achieving up to 99.74% accuracy between image and text predictions or vice versa. The notebook i linked possibly builds the 'vae' architecture into the "perceptor' state_dict but i'm not sure on that. Take a look at it if you haven't seen it. I would really love to try loading ViT-L-16 or ViT-H-14 into perceptor / clip but so far I can't figure out how to configure the .PT with the correct keys like "version" so it rejects my attempts at loading in custom models. |
ViT > VaE lol |
@CDitzel I'm not an expert on the subject, but my understanding is that DALL-E paper authors didn't use "vanilla" VQ-VAE (that requires explicit codebook that you were looking for). Everywhere in the paper they refer to "dVAE", for instance:
It makes me think that they have 32x32x8192 logits as encoder output that is sampled/argmax'ed to obtain 32x32x1 token tensor - this works with the code and weights for VAE that they've shared publicly.
I don't see any issue in applying KL loss to the logits I've mentioned above.
How exactly did you arrive at this conclusion? |
wowell my thoughts are playing tricks on me, something which happens when I concern myself too long with a topic exclusively xD I got a working gumbel softmax dVAE, but still have trouble training it on a larger, i.e. 50k samples data set. would you be interested in talking about the specific architecture further in detail? best regards from Germany to spb (mu girlfriend is from there ;)) |
Hahah thanks :) I'm currently working on small-scale reimplementation of the paper for fun, probably will share the results as soon as I'll be sure they are correct (surprisingly I haven't seen anything that is close enough to the paper yet) Until then I'm open to discussing the details - this might help to "fill the blanks" |
I claim that Dall-E has no working model or else they wouldn't be able to stop making more and more examples. How could they just set it down and not update the site for months? How!!! |
I have to agree with you on this one. OpenAIs policy of reproducible research and being honest/open about their work has been questionable, to say the least. I am going to post the vae-gumbel softmax snippet here soon as a reference and so that we can talk about it. First have to get coffee and breakfast though xD |
all right, so this is what I have come up with so far. It closely resembles Lucids implementations but parameterizes the gumbel softmax with the distance of the encoder output (logits) to the codebook vectors (described in this paper) and akin to VQ-VAEs, but in contrast to Lucids implementation which uses the logits directly as input to the Gumbel. Phils (and Karpathys implementation) never worked for me when I rightfully included the KL loss i.e. a kl loss > 0. With this implementation the KL loss can be included as it should with a uniform prior. However, the results on a larger data set are still underwhelming and not really satisfying in terms of reconstruction quality. Maybe someone can take a look at it and assess the correctness of this implementation?
|
This comment has been minimized.
This comment has been minimized.
Yes totally. And I don't see the words "white background" in the prompt
…On Wed, Apr 7, 2021, 10:11 AM afiaka87 ***@***.***> wrote:
[image: image]
<https://user-images.githubusercontent.com/37323518/113177250-4e474d00-921b-11eb-935d-23d4eee0a394.png>
vqvae > tiny-vae(big memory) aka dall-e the model that never was
Do you have a method for implementing the "image mask" feature? I've done
extensive scraping of that blog post in particular (check the discussions
tab for a scrape of all 1.1 million image text pairs) and they aren't
always clear about when they're using that feature.
It's possible they're using a mask containing white pixels at the sides
and a transparent square cropped in the center.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#74 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI4YF7V36FIE7ZQ2KGYVVXTTHRR2LANCNFSM4Y4DFRAQ>
.
|
@CDitzel |
But logits are not between 0 and 1 either, could you please elaborate on "quite different from the scale of gumbel distribution"? |
@enhuiz When we call F.gumbel_softmax in pytorch, it adds the provided logits with G(0, 1) distribution. More details are here |
but just passing the encoder outputs/logits doesnt make any guarantees on the value ranges either, does it? |
@sidml Thanks for the reply. I agree that logits should have a similar range with g_i (i.e., roughly from -2 to 6), but since the latent space is learnable, I guess the scale of the distance can be learned to suits the scale of g_i. |
have a look at this notebook https://github.com/shaabhishek/gumbel-softmax-pytorch/blob/master/Gumbel-softmax%20visualization.ipynb here they normalize the logits before passing them to Gumbel-Softmax. Apparently that is not necessary for the built-in Pytorch implementation. On another note, can someone explain to me the top row of Figure 1 in the original paper? It says expectation but how is that calculated or even depending on the temperature parameter? |
@CDitzel I believe they divide the logits by temperature before sampling from categorical distribution in Figure 1 of the paper. |
For test procedure, may it need to make some changes here?
By the way, if you want a weighted sum, why not directly using softmax rather than gumbelsoftmax ?
|
I also tried all existing gumbel quantizer but they all perform worse than normal quantizer. But I've recently notice from OpenAI's code for DALL-E, the encoder outputs the probability and take that directly as the input for the decoder without using any codebook embedding. Which leads me to following dummy gumbel quantizer and I notice that with some hyper-parameter tweaking, it even works better than or on par with previous gumbel quantizer implementations:
I think this means that something is wrong with the KL loss because it is almost useless |
do I see it correctly that the code fragments provided by OpenAI and the way you binded it in the vae.py file means that there is no actual codebook in the form of an explicit nn.Paramter or nn.Embeddings but the very first layer of the decoder serves as the vocabulary?
this would explain why I couldnt finy any oO
The text was updated successfully, but these errors were encountered: