Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion #
All appear however to build on the same principle that we may summarize as follows:
• Training a deep network to directly optimize only the supervised objective of interest (for ex- ample the log probability of correct classification) by gradient descent, starting from random initialized parameters, does not work very well.
• What works much better is to initially use a local unsupervised criterion to (pre)train each layer in turn, with the goal of learning to produce a useful higher-level representation from the lower-level representation output by the previous layer. From this starting point on, gradient descent on the supervised objective leads to much better solutions in terms of generalization performance.
One natural criterion that we may expect any good representation to meet, at least to some degree, is to retain a significant amount of information about the input. It can be expressed in information-theoretic terms as maximizing the mutual information I(X ; Y ) between an input random variable X and its higher level representation Y . This is the infomax principle put forward by Linsker (1989).
When using affine encoder and decoder without any nonlinearity and a squared error loss, the autoencoder essentially performs principal component analysis (PCA) as showed by Baldi and Hornik (1989).
Here we propose and explore a very different strategy. Rather than constrain the representation, we change the reconstruction criterion for a both more challenging and more interesting objec- tive: cleaning partially corrupted input, or in short denoising. In doing so we modify the implicit definition of a good representation into the following: “a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input”. Two underlying ideas are implicit in this approach: