The main problem we were facing here is overfitting and methods to avoid that. The definition of Reguralization is the following:
Define Regularization as “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”
We talked about several Reguralization technics:
The main idea here is to limit the model complexity by adding a parameter norm penalty, denoted as Ω(θ), to the objective function J: $ \tilde{J}(\theta;X, y) = J(\theta;X, y) + \alpha\Omega(\theta)$.
Most importantly, parameter θ represents the weights only and not the biases.
In
$ \tilde{J}(w;X, y) = \frac{\alpha}{2}w^Tw + J(w;X,y) $
Here, alpha serves as the regularization parameter. It is known as hyperparameter, and it's value is optimized for efficient results.
In
$ \tilde{J}(w;X, y) = \alpha | w|_1 + J(w;X,y)$
Unlike
Data Augmentation solve the problem of processing Limited Data with less diversity in order to get efficient results from the neural network. Let us think of Data Augmentation as an added noise to our dataset. The idea here is to add new training examples when the data supply is limited.
Following are the most popular Data Augmentation Techniques:
- Flip
- Rotation
- Scale
- Crop
We can impose a penalty on the norm of the weights by adding a noise with extremely small variance. Similarly, we can also add noise to weights. There are some interpretations about it. Among several, one interpretation is that when we add noise to weights, it will be considered as the stochastic implementation of Bayesian inference over the weights. In this case, the weights are not known. Probability distribution is used to model the uncertainity. As learning will become stable and efficient, so it is considered as a type of Regularization.
For example, consider a linear regression case, in order to reduce mean squared error, for each vector x, we can learn to map y(x).
If we add Gaussion random noise (ϵ) with zero mean to the weights. In such case, we are still interested to learn to reduce the mean square by proper mapping.
Including noise in the weights is simply like adding regularization Ω(θ). Due to which, weights are not really affected due to small perturbations in the weights. This helps in stabilising the training.
Consider overfitting as a decrement in training error and an increment in validation error as soon as we have a model complexity with high representation.
So, in such scenario, the best thing to do is to get back to previous point, where we have a least validation error. In order to check for improvement, with each epoch, we have to keep track of validation metrics and keep saving the parameter configuration. When the training ends, the parameter which was saved at the end is returned.
Similar is the case with Early stopping. With some known or fixed iterations, when there is no improvement in the validation error, then we try to terminate or finish the algorithm. The capacity or complexity of the model is efficiently reduced with the number of reduced steps to fit the model.
One of its comparison with weight decay is that in weight decay, we had to work with the coefficient of weight decay manually by tweaking it. Often with wrong chosen coefficient of weight decay, we can lead to local minima by suppressing the values of the weights very much.
In Early stopping, we don't need such manual setting of coefficient of weights for tuning. Early stopping is considered equivalent to
Another way of implementing prior knowledge into the training process.
Express that certain parameters should be close to each other taken from two different models
Changing their loss functions with the additive Regularization term:
Here we force sets of parameters in one (or multiple models) to be equal. Used for examples Heavy in CNN training, where the feature detectors of one layer get set on the same parameter set.
The Advantage here is the massive reduction of space (brings the ability to train larger models) and another implement of prior knowledge.
Here we improve generalization and speed up the model training by training the ensemble of all sub-networks of a NN.
Where a subnet is a subgraph of the original NN connecting (at least some) input neurons to the output. When training a subnet its weights are shared with the original net.
Since there are too many subnets to train we sample a subnet by selecting each non ouput neuron
Training routine is then: Sampling subnet, Training subnet, adjust wheigts in original net and iterate.
Once training is done one can predict new data points by usual forward propagation but with each weight
In conclusion Dropout provides a cheap Regularization method (implementable in $\mathcal{O}(n)$), which forces the model to generalize since it is trained with subnets which have various topologies and a smaller capacity.
Motivation: There a many models capable of reaching (or even exceeding) human performance on specific tasks (as Chess or GO). But do they also gather a human understandment of the game?
No they don't. That can be seen very well on object recognition tasks. Here often a little noise applied to the input image (barely distinguishable by a human), leads to a huge change in prediction. Those examples are called Adversarial examples.
By training a model on many Adversarial examples one tries to force the model to implement a prediction stable plateau around each of the training data points, because prior knowledge tells us that two different object in an image have a larger distance regarding the pixel values than a little noise could inject.
A: It is not appropriate when the transformation would change the correct class. You can see this e.g. on slide 7. If we rotate a six 180°, then it would be impossible to distinguish the digits 6 and 9.
Q: How do I decide how many supervised and unsupervised examples do I take to train a semi-supervised model?
A: This question cannot be answered in general, it depends on the model. You have to try different proportions of supervised and unsupervised examples and take a proportion that yields good results.
A: We mainly use the geometric mean. It usually improves the performance.
A: They are similar in many ways. We infer from an ensemble of models. Some differences: In the case of bagging, all models are independent and is trained till convergence. In the case of dropout only a small fraction of the subnetworks are trained.
A: Label smoothing prevents the pursuit of hard probabilities without discouraging correct classification. The disadvantage of label smoothing is that it lowers the quality of the training data a little bit.
Q: What is the significant cost of choosing hyperparameter automatically via early stopping? Is there any additional cost?
A: The only significant cost is is running the validation set evaluation periodically during training. An additional cost is the need to maintain a copy of the best parameters.
A: Parameters that correspond to directions of significant curvature tend to learn early relative to parameters corresponding to directions of less curvature.
A: Early stopping automatically determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter.
A: When it is added to the hidden units.
A: It is less effective when extremely few labeled training examples are available. Also, unsupervised feature learning can gain an advantage over dropout when additional unlabeled data is available.
Q: Why is the main power of dropout related to hidden layers?
A: It comes from the fact that the masking noise is applied to the hidden units. This can be seen as a form of highly intelligent, adaptive destruction of the information content of the input rather than destruction of the raw values of the input.
Q: During adverserial training how do we find the examples where we add little noise and the output is quiet different?
A: In general, there is no clear answer yet how exactly list all adverserial examples and how they are exactly functioning. Linearity is one of the answers but it was also criticized in some literature.