Skip to content
This repository has been archived by the owner on Dec 2, 2024. It is now read-only.

I have computed histograms and confusion matrix on validation dataset #18

Closed
hcl14 opened this issue Dec 22, 2018 · 11 comments
Closed

I have computed histograms and confusion matrix on validation dataset #18

hcl14 opened this issue Dec 22, 2018 · 11 comments
Labels
question Further information is requested

Comments

@hcl14
Copy link

hcl14 commented Dec 22, 2018

I am attempting to train model on AVA myself and faced very low quality of predictions. Digging further, I found predictions to be very strange and started to investigate pretrained models and asking for help.

Let me introduce my research on the example of your model which has very similar performance to mine.

At first, let's take a look on a histogram of predicted mean scores and standard deviations (Fig.9 of original paper):

original

Here is your one, built on validation dataset you provided:

screenshot_20181222_031920

Mine looks similar as well. From the histogram one can decide that the model does not output scores > 7 and <3, but otherwise it should mirror real distribution well.

However, when I started to check images for my model manually, I found that scores seem very scattered and look often totally inadequate.

Let's compute accuracy of predicted scores as Mean(1. - |m' - m|/m), where m' is predicted score and m is ground truth. For your model it is 91.1%, for my model it is 90.8%. Seems good so far, but we need to remember that for score 5 it means mean error 5 - 0.9*5 = 0.5, and is even bigger for tails of the dataset distribution. Standard deviation of differences between scores is 0.37.

But let's compute confusion matrix (mine is similar to yours):


Got labels: 25548
[[   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0   18  100   84    9    0    0    0    0    0]
 [   1   21  518 3027 1832  178    2    0    0    0]
 [   0    7  144 3472 9527 2660  129    0    0    0]
 [   0    0    3  147 1683 1786  198    2    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]]
accuracy:0.9111294033307482
standard deviation of score differences:0.3676722622236499

And to compare with, let's randomly shuffle predictions:

random shuffle:
[[   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    5   53  109   40    4    0    0    0]
 [   0    9  188 1440 2874  992   76    0    0    0]
 [   1   31  460 4201 8144 2903  197    2    0    0]
 [   0    6  112 1036 1924  689   52    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    0]]
accuracy:0.8577966426127535
standard deviation of score differences:0.5579391415989003

We can see that score 5 still dominates because of it's abundance in dataset. Others are not much different from model prediction. Accuracy is 0.86. Also, I want to mention that when I tried to create balanced validation set with sets of scores <4, 4-7 and >7 to be represented equally, I've got 83% accuracy and 0.5 std of differences which is equal to the result above.

I've shared all the code on gist:
https://gist.github.com/hcl14/d641f82922ce11cee0164b16e6786dfb

Also here are correlation coefficients for scores predictions:

Pearson: (0.6129669517144579, 0.0)
Spearman: SpearmanrResult(correlation=0.5949598193491837, pvalue=0.0)

The article also reports values 0.5-0.6

Would be great to hear some insights on that. Is it overfit cause by Adam optimizer and we really needed to optimize it via SGD with lr=10-7 as in the parer?

@clennan
Copy link
Collaborator

clennan commented Dec 24, 2018

Hi, thanks for sharing your analysis. Here are few comments/thoughts:

  1. Given that the AVA dataset has so many samples with mean scores in the 4 to 6 range, any model trained on this dataset will struggle to predict mean scores outside of this range. So I would start assessing the model not on absolute predictions but more on relative ones, ie rankings.

  2. We trained our internal model on an in-house hotel dataset which is more evenly distributed across mean scores and the model is better at predicting scores < 4 and > 7. So maybe you could try to balance the training data, similarly to what you have done for the validation data, and re-train your model.

  3. Regarding your confusion matrix analysis, not sure whether shuffling predictions is a valid approach. Maybe try to compare the predictions against random scores?!

  4. I don’t think the model is overfitted - we used a slightly higher learning rate than the one reported in the paper for the convolutional weights, but I also don’t know whether they used this learning rate on all models, or maybe just the best performing one. In terms of EMD, VGG16 and Inception-v2 are the winning models, and from my experience it is a much more subtle issue to fine-tune these architectures than MobileNet, thus requiring a more conservative learning rate regime.

@clennan clennan added the question Further information is requested label Dec 24, 2018
@hcl14
Copy link
Author

hcl14 commented Dec 24, 2018

HI @clennan, thanks for your answer. The reason why I did random shuffle of predictions is that I was confused by the fact that histogram resembles target distribution of scores very closely (I have the same for my MobileNet v2 model), but correlation is just 0.6, like in Google paper. Random shuffle, of course, removes correlation, but does not change confusion matrix much. I guess confusion matrix is how end-user sees rankings, and it looks not very good.

I did strong oversampling of tails (elements with scores <4 and >7), but that did not change much for me, just model starts to assign low and high scores, but with the same corr=0.6.

I have provided rank Spearman correlation for your model SRCC=0.59, which is close to Google. In my model after long training with manual tweaks of learning rate and batch size I have SRCC=0.64 on the same validation dataset as yours.

I also did weight regularization by adding moments of histogram to loss, and it helped to converge model faster (batch size=16, 1 epoch pretrain, 4 epochs lr=10-4, 12 epochs lr=5x10-6, 1 epoch 5x10-7) e.g. my loss looks like

# distribution elements to Keras
distribution_elements_row = tf.constant(np.array([1,2,3,4,5,6,7,8,9,10]), dtype='float32', name='marks')
distribution_elements = K.expand_dims(distribution_elements_row, -1)
distribution_elements_square = K.square(distribution_elements)
distribution_elements_cube = K.pow(distribution_elements, 3)


# compute squared difference of first moments
def first_moment(y_true, y_pred):
    
    means_true = K.dot(y_true, distribution_elements)
    means_pred = K.dot(y_pred, distribution_elements)
    
    return K.sqrt(K.mean(K.square(means_true-means_pred)))

# compute squared difference of second moments
def second_moment(y_true, y_pred):

    means_true = K.dot(y_true, distribution_elements)
    means_pred = K.dot(y_pred, distribution_elements)
    
    
    second_true = K.dot(y_true, distribution_elements_square)
    second_pred = K.dot(y_pred, distribution_elements_square)
    
    #E(x^2) - (E(x)^2)
    second_true = second_true - K.square(means_true)
    second_pred = second_pred - K.square(means_pred)
    
    
    return K.sqrt(K.mean(K.square(second_true-second_pred)))



# NIMA code https://github.com/titu1994/neural-image-assessment/blob/master/train_mobilenet.py
def earth_mover_loss(y_true, y_pred):
    cdf_true = K.cumsum(y_true, axis=-1)
    cdf_pred = K.cumsum(y_pred, axis=-1)
    emd = K.sqrt(K.mean(K.square(cdf_true - cdf_pred), axis=-1))
    return K.mean(emd)


def my_loss(y_true, y_pred):
    # Moments part is approximately 1.2 for fitted model (bin_acc>0.75)
    # Mean emd on batch is around 0.05 - 0.07 in vanilla emd loss
    # So this loss is 3.5 - 4.8 where contribution of emd is ~3.5 and moments ~1.0
    # Reqularizing makes mean fit faster, You need ~15 epochs instead of 40-50
    
    return 50*earth_mover_loss(y_true, y_pred) + 2*first_moment(y_true, y_pred) + second_moment(y_true, y_pred)

@hcl14
Copy link
Author

hcl14 commented Dec 24, 2018

Also, as long as we use pretrained ImageNet model, I used subtraction of Imagenet Mean and division on ImageNet variance (values can be googled) for images, instead of just bringing them to [-1,1] as vanilla .preprocess_inputs() function does.

@clennan
Copy link
Collaborator

clennan commented Dec 24, 2018

Cool that you managed to improve the performance, is the SRCC=0.64 with MobileNet or MobileNet v2? Would you be interested in contributing the model to the repository? :)

I think the oversampling of tail samples provided already an improvement, even though the rank correlations do not improve much, but absolute values do, which makes the classifier more useful in practice (eg when working with thresholds).

If you want to improve the model further my suggestion is to focus on a specific domain (eg hotels, buildings, or people) and collect domain specific training data. It becomes a much easier classification problem then.

And .preprocess_input() does not only scale to [-1, 1], but also does the mean substraction for colour channels.

@hcl14
Copy link
Author

hcl14 commented Dec 26, 2018

@clennan Thanks, I will think about the version I want to publish. This is MobileNet v2 network. The current version with SRCC=0.64 has no oversampling, as I said, it did not help in improving correlation.

Regarding keras.applications.mobilenetv2.preprocess_input:

I was initially using https://github.com/keras-team/keras-applications/blob/master/keras_applications/mobilenet_v2.py:

def preprocess_input(x, **kwargs):
    """Preprocesses a numpy array encoding a batch of images.
    This function applies the "Inception" preprocessing which converts
    the RGB values from [0, 255] to [-1, 1]. Note that this preprocessing
    function is different from `imagenet_utils.preprocess_input()`.
    # Arguments
        x: a 4D numpy array consists of RGB values within [0, 255].
    # Returns
        Preprocessed array.
    """
    x /= 128.
    x -= 1.
    return x.astype(np.float32)

@clennan
Copy link
Collaborator

clennan commented Jan 2, 2019

Ah, interesting, I didn't know that MobileNetV2 uses Inception preprocessing, instead of VGG preprocessing (which normalizes with ImageNet parameters) like MobileNet (https://github.com/keras-team/keras-applications/blob/df0e26c73951e107a84969944305f492a9abe6d7/keras_applications/imagenet_utils.py#L157).

Let me know when you decided on the model version you would like to publish :)

@hcl14
Copy link
Author

hcl14 commented Jan 2, 2019

Hi @clennan , here I tried to put all my thoughts on AVA and TID2013 together and provided pretrained models:

https://github.com/hcl14/AVA-and-TID2013-image-quality-assessment

I refer to your repository frequently, as our models work very similar, on my opinion.

@clennan
Copy link
Collaborator

clennan commented Jan 3, 2019

Hi, a couple of comments on your analysis for the aesthetic model. Discussing performance is always difficult as expectations differ so much, so this is just my opinion :)

  • if a rank correlation coefficient of 0.6 is garbage for you, then that’s understandable, but I disagree with your statement that it is inevitable to get garbage results from such a model. The aesthetic model tries to predict something very abstract and subjective, and is being trained on a very diverse dataset (AVA). To me this seems like an impossible task, way harder than any image recognition, so my expectations were low to begin with. And what you consider garbage was actually not garbage for us. We tested some hotel images on the model trained solely on AVA and were surprised how good it ranked them, see here e.g. bedrooms

image

  • as with any ML model the most important performance driver is the data you feed it - so once we narrowed down the domain (hotel images) and put together our own dataset, the model’s performance improved and we were seeing rank correlations around 0.77. So we took inspiration from a weak performing general aesthetic model and made it provide business value for a specific domain. My advice to you is don’t spend so much time optimising learning rates etc. but find an application domain for your ML model (e.g. hotels, cars) and label your own dataset.

@hcl14
Copy link
Author

hcl14 commented Jan 3, 2019

@clennan Thanks very much for the answer! Well, I'm not blaming neither model architecture nor your use case actually, but I suppose that AVA dataset itself is imbalanced though, as oversampling of underrepresented elements did not succeed for me - I suppose those elements do not represent dependence between scores and image content well. Also it is suspicious that AVA model trained on patches shows almost the same correlation as the one trained on full images, which might indicate that it grasps exactly more aesthetic features like color balance, etc. As for tid2013 dataset, both mine and your models sometimes show something like 'reverse' scores for jpeg compression, when more compressed images get bigger score. I suppose it is because of small number of examples in tid2013 dataset, also I found that using cosine similarity for feature layer might be sometimes more accurate.

I initially had an idea of approaching the variance problem by attaching such knowledge domain, as word2vec. I have created a dataset of text descriptions for all AVA images using available tensorflow im2txt model. I've added this dataset to my repository, along with simple model which tries to combine human knowledge about objects obtained from word embeddings and MobileNetV2 model for AVA images. It does not show better performance yet (correlation is still a little bit more than 0.6), but one can try to play around with this approach.

@clennan
Copy link
Collaborator

clennan commented Jan 4, 2019

Interesting idea - maybe it would make sense to take the image embedding straight from the im2txt model and add it to the image embedding from the Nima model, and then use dense layers to predict aesthetic scores?! The im2txt embedding, that is used to generate the descriptions, might include information that helps to classify aesthetics, similar to your approach with word vectors, but might be the more direct way to incorporate this information into Nima

@hcl14
Copy link
Author

hcl14 commented Jan 4, 2019

Yes, I thought about network embedding, but

  1. what I understood from the way how im2txt was trained, significant part of the performance can be already gained from training LSTM over frozen network, with weights from ImageNet (i.e. features are unchanged),
  2. LSTM in im2txt uses predefined vocabulary (COCO dataset), assigning those words to image features (without word2vec or any extra knowledge).

My idea is that we can dig into human associations for those pictures using another knowledge domain, which is relevant - word2vec context vectors. Let's suppose that some objects like "piece of paper", "letters" or "humans with flowers instead of their heads" (example images from original paper) in image description both refer to something not aesthetic (writing) or controversial (unusual combination of objects) and drag human opinion score down / increase variance - then this dependency may be theoretically captured by pretrained word embeddings, like GoogleNews and more or less generalized to other objects because of cosine similarity of embeddings for contextually related words.

Of course, maybe you are right, and special embedding can be trained somehow, but I don't have good ideas at the moment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants