# Citation

Memisevic, Roland. “Gradient-Based Learning of Higher-Order Image Features.” *Proceedings of the IEEE International Conference on Computer Vision* (November 2011): 1591–1598. doi:10.1109/ICCV.2011.6126419.

# Abstract

Recent work on unsupervised feature learning has shown that learning on polynomial expansions of input patches, such as on pair-wise products of pixel intensities, can improve the performance of feature learners and extend their applicability to spatio-temporal problems, such as human action recognition or learning of image transformations. Learning of such higher order features, however, has been much more difficult than standard dictionary learning, because of the high dimensionality and because standard learning criteria are not applicable. here, we show how one can cast the problem of learning higher-order features as the problem of learning a parametric family of manifolds. This allows us to apply a variant of a de-noising auto-encoder network to learn higher-order features using simple gradient based optimization. Our experiments show that the approach can outperform existing higher-order models, while training and inference are exact, fast, and simple.

# Finds

- A Python/theano implementation of the model is available at http://www.cs.toronto.edu/-rfm/code/rae/index.html
- A good description of auto-encoders and de-noising auto-encoders

# Quotes & Notes

**Re: learning relationship between two images**

An extension of feature learning that has received a lot of attention recently, is the learning of relations between pixel intensities, rather than of pixel intensities themselves [18], [12], [25], [19]. For this end, one can extend the bi-partite graph of a standard sparse coding model with a tri-partite graph that connects hidden variables with two images. Hidden units then turn into “mapping” units that model structure in the relationship between two images rather than static structure within a single image.

**Re: Why I want to use an auto-encoder instead of an RBM**

Unfortunately, for learning one has to invert these projections in order to compute objective functions and gradients. Naive training thus leads to a computational complexity that remains quadratic in the number of input components. More importantly, existing methods have to rely on sampling-based schemes, such as Hybrid Monte Carlo [24] or various modifications of contrastive divergence learning ([9]) to deal with the presence of three-way cliques [29].

**Re: advantages of the model**

Potential advantages of this model are that (a) low-dimensional pre-projections or multi-layer versions of the model can be defined naturally, (b) by using back-propagation, it is not necessary to manually calculate gradients, and one can use modern code-generation methods to transparently parallelize code (for example, [2]), (c) the model makes no difference between learning of covariance-features and learning of transformations, (d) covariance-features can be mixed with standard features by simply adding connections that are not gated, (e)

in order to deal with binary, real-valued and other types of observables one can simply use the appropriate activation/cost-functions in the final layer of the network,such as squared error for real-valued data and log-loss for binary data.

**Re: question about applicability to RTL and learning a common feature space**

When dealing with image pairs, which are related through transformations, one can think of the outputs as being confined to a conditional appearance manifold.

**If we’re assuming that instance samples from one MDP are related through some transformation (i.e. a mapping function), then can we cast the problem of learning the mapping function as learning a family of manifolds, parameterized by the inputs of the source MDP?**

**Re: loss-functions**

For binary or multinomial y, we minimize cross-entropy loss (negative log-probability)

**Re: the result of this model (i.e. the gated or conditional)**

The model defines a “conditional manifold” over y as a

function x. This is in contrast to [18], for example, who define a conditional distribution.The model is an instance of a higher-order neural network, i.e. a network whose units compute products of incoming variables, not just weighted

sums [7].

**Re: training the model symmetrically**

A simple way to achieve this is by defining the overall objective function as the sum of the two asymmetric objectives

Using the symmetric objective can be thought of as the non-probabilistic analog of modeling a joint distribution over x and y as opposed to modeling a conditional.

# Further Research

**Higher-order neural network: **C. L. Giles and T. Maxwell. Learning, invariance, and generalization in high-order neural networks. Appl. Opt., 26(23):4972–4978, Dec 1987