An autoencoder consists of two parts.
- An encoder
$$f(x)$$ that maps some input representation$$x$$ to a hidden, latent representiation$$h$$ , - A decoder
$$g(h)$$ that reconstructs the hidden layer$$h$$ to get back the input$$x$$
We usually add some constraints to the hidden layer - for example, by restricting the dimension of the hidden layer. An undercomplete autoencoder is one where
By adding in some sort of regularization/prior, we encourage the autoencoder to learn a distributed representation of
We train by minimizing the
The autoencoder is forced to consider these two terms - minimizing the regularazation cost as well as the reconstruction cost. In doing so, it learns a hidden representation of the data that has interesting properties.
As always, we can train this model using MLE by maxamizing
We can think of these two terms as the regularization and reconstruction cost respectively.
For example, for sparse autoencoders,
For a
Here we want to minimize the loss function
This corresponds to minimizing $$E_{x~ \hat{p}data} E{\tilde{x}~ L( \tilde{x} \vert x)} \log p_decoder (x \vert h)$$ where
Alterative we can use score matching as an alterative to maximizing the log likelihood - which learns the gradient field.
Denoising autoencoders (DAE) learn the vector field of
With a regularization loss
We could use this to train models to colorize pictures, deblur images, etc.
We aim to learn the structure of the manifold through the distribution representation learned by the autoencoder.
The manifold has tangent planes (similar to tangent lines). These tangent lines describe how to move along the manifold.
- We need to learn a representation
$$h$$ such that we can reconstruct$$x$$ - We need to satisfy constraint/regularization penalty.
As such, we learn variations along the manifold - we need to know this because we must remap onto the manifold in the case of a denoising autoencoder.
Most early research was based on nn approaches - we try to solve a linear system to get "pancakes", tying these pancakes togethere to form a global system.
However, if manifolds vary a lot, then we need a lot of "pancakes" to capture this variation - so instead we look towards deep learning to solve these problems.
Contractive autoencoders trained with sigmoidal units createa sort of binary code in their hiddne layer.
The cost function approaches f(x) as a linear operator. In this instance the term contractive comes from the fact that
Usually, only a small number of hidden units have large derivatives - these corresponds to movement along the manifold, as the capture most of the variance of the data.
One thing we can do to avoid learning trivial transformations is to tie the network weights together.
This is a hybrid of sparse and parametric autoencoders, where we try to $$ min \Vert x-g(h)\Vert^2 + \lambda \vert h \vert_1 + \gamma \Vert h-f(x)\Vert^2$$
We can use autoencoders to perform a variety of tasks - such as dimensionality reduction.
If we learn a binary coding, we can use this coding to determine similarity, to do a form of semantic hashing, where we can use the Levinghstam distance of two codes to determine their similarity.