-
Notifications
You must be signed in to change notification settings - Fork 450
I have computed histograms and confusion matrix on validation dataset #18
Comments
Hi, thanks for sharing your analysis. Here are few comments/thoughts:
|
HI @clennan, thanks for your answer. The reason why I did random shuffle of predictions is that I was confused by the fact that histogram resembles target distribution of scores very closely (I have the same for my MobileNet v2 model), but correlation is just 0.6, like in Google paper. Random shuffle, of course, removes correlation, but does not change confusion matrix much. I guess confusion matrix is how end-user sees rankings, and it looks not very good. I did strong oversampling of tails (elements with scores <4 and >7), but that did not change much for me, just model starts to assign low and high scores, but with the same corr=0.6. I have provided rank Spearman correlation for your model SRCC=0.59, which is close to Google. In my model after long training with manual tweaks of learning rate and batch size I have SRCC=0.64 on the same validation dataset as yours. I also did weight regularization by adding moments of histogram to loss, and it helped to converge model faster (batch size=16, 1 epoch pretrain, 4 epochs lr=10-4, 12 epochs lr=5x10-6, 1 epoch 5x10-7) e.g. my loss looks like
|
Also, as long as we use pretrained ImageNet model, I used subtraction of Imagenet Mean and division on ImageNet variance (values can be googled) for images, instead of just bringing them to |
Cool that you managed to improve the performance, is the SRCC=0.64 with MobileNet or MobileNet v2? Would you be interested in contributing the model to the repository? :) I think the oversampling of tail samples provided already an improvement, even though the rank correlations do not improve much, but absolute values do, which makes the classifier more useful in practice (eg when working with thresholds). If you want to improve the model further my suggestion is to focus on a specific domain (eg hotels, buildings, or people) and collect domain specific training data. It becomes a much easier classification problem then. And |
@clennan Thanks, I will think about the version I want to publish. This is MobileNet v2 network. The current version with SRCC=0.64 has no oversampling, as I said, it did not help in improving correlation. Regarding I was initially using
|
Ah, interesting, I didn't know that MobileNetV2 uses Inception preprocessing, instead of VGG preprocessing (which normalizes with ImageNet parameters) like MobileNet (https://github.com/keras-team/keras-applications/blob/df0e26c73951e107a84969944305f492a9abe6d7/keras_applications/imagenet_utils.py#L157). Let me know when you decided on the model version you would like to publish :) |
Hi @clennan , here I tried to put all my thoughts on AVA and TID2013 together and provided pretrained models: https://github.com/hcl14/AVA-and-TID2013-image-quality-assessment I refer to your repository frequently, as our models work very similar, on my opinion. |
Hi, a couple of comments on your analysis for the aesthetic model. Discussing performance is always difficult as expectations differ so much, so this is just my opinion :)
|
@clennan Thanks very much for the answer! Well, I'm not blaming neither model architecture nor your use case actually, but I suppose that AVA dataset itself is imbalanced though, as oversampling of underrepresented elements did not succeed for me - I suppose those elements do not represent dependence between scores and image content well. Also it is suspicious that AVA model trained on patches shows almost the same correlation as the one trained on full images, which might indicate that it grasps exactly more aesthetic features like color balance, etc. As for tid2013 dataset, both mine and your models sometimes show something like 'reverse' scores for jpeg compression, when more compressed images get bigger score. I suppose it is because of small number of examples in tid2013 dataset, also I found that using cosine similarity for feature layer might be sometimes more accurate. I initially had an idea of approaching the variance problem by attaching such knowledge domain, as word2vec. I have created a dataset of text descriptions for all AVA images using available tensorflow im2txt model. I've added this dataset to my repository, along with simple model which tries to combine human knowledge about objects obtained from word embeddings and MobileNetV2 model for AVA images. It does not show better performance yet (correlation is still a little bit more than 0.6), but one can try to play around with this approach. |
Interesting idea - maybe it would make sense to take the image embedding straight from the im2txt model and add it to the image embedding from the Nima model, and then use dense layers to predict aesthetic scores?! The im2txt embedding, that is used to generate the descriptions, might include information that helps to classify aesthetics, similar to your approach with word vectors, but might be the more direct way to incorporate this information into Nima |
Yes, I thought about network embedding, but
My idea is that we can dig into human associations for those pictures using another knowledge domain, which is relevant - word2vec context vectors. Let's suppose that some objects like "piece of paper", "letters" or "humans with flowers instead of their heads" (example images from original paper) in image description both refer to something not aesthetic (writing) or controversial (unusual combination of objects) and drag human opinion score down / increase variance - then this dependency may be theoretically captured by pretrained word embeddings, like GoogleNews and more or less generalized to other objects because of cosine similarity of embeddings for contextually related words. Of course, maybe you are right, and special embedding can be trained somehow, but I don't have good ideas at the moment. |
I am attempting to train model on AVA myself and faced very low quality of predictions. Digging further, I found predictions to be very strange and started to investigate pretrained models and asking for help.
Let me introduce my research on the example of your model which has very similar performance to mine.
At first, let's take a look on a histogram of predicted mean scores and standard deviations (Fig.9 of original paper):
Here is your one, built on validation dataset you provided:
Mine looks similar as well. From the histogram one can decide that the model does not output scores > 7 and <3, but otherwise it should mirror real distribution well.
However, when I started to check images for my model manually, I found that scores seem very scattered and look often totally inadequate.
Let's compute accuracy of predicted scores as
Mean(1. - |m' - m|/m)
, wherem'
is predicted score andm
is ground truth. For your model it is 91.1%, for my model it is 90.8%. Seems good so far, but we need to remember that for score 5 it means mean error 5 - 0.9*5 = 0.5, and is even bigger for tails of the dataset distribution. Standard deviation of differences between scores is 0.37.But let's compute confusion matrix (mine is similar to yours):
And to compare with, let's randomly shuffle predictions:
We can see that score 5 still dominates because of it's abundance in dataset. Others are not much different from model prediction. Accuracy is 0.86. Also, I want to mention that when I tried to create balanced validation set with sets of scores <4, 4-7 and >7 to be represented equally, I've got 83% accuracy and 0.5 std of differences which is equal to the result above.
I've shared all the code on gist:
https://gist.github.com/hcl14/d641f82922ce11cee0164b16e6786dfb
Also here are correlation coefficients for scores predictions:
The article also reports values 0.5-0.6
Would be great to hear some insights on that. Is it overfit cause by Adam optimizer and we really needed to optimize it via SGD with lr=10-7 as in the parer?
The text was updated successfully, but these errors were encountered: