ReduNet [1]
This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation.
Motivation The design of deep networks are often based on years of trial and error, then trained via back propagation, and then deployed as a "black box". It lacks of rigorous mathematical principles, modeling and analysis. It naturally raises a fundamental question that we aim to address in this paper: how to develop a principled mathematical framework for better understanding and design of deep networks?
A new theoretical framework The paper develop a new theoretical framework for understanding deep networks around the following two questions:
- Objective of Representation Learning : What intrinsic structures of the data should we learn, and how should we represent such structures? What is a principled objective function for learning a good representation of such structures, instead of choosing heuristically or arbitrarily?
- Architecture of Deep Networks : Can we justify the structures of modern deep networks from such a principle? In particular, can the networks' layered architecture and operators (linear or nonlinear) all be derived from this objective, rather than designed heuristically and evaluated empirically?
The paper's answer to the two questions are:
- A principled objective for a deep network is to learn a low-dimensional linear discriminative representation of the data. The optimality of such a representation can be evaluated by a principled measure from (lossy) data compression, known as rate reduction.
- Deep networks can be naturally interpreted as optimization schemes for maximizing this measure.
Not only does this framework offer new perspectives to understand and interpret modern deep networks, they also provide new insights that can potentially change and improve the practice of deep networks.
The Principle of Maximal Coding Rate Reduction
Whether the given data
We assume the distribution
With the manifold assumption, we want to learn a mapping
- Within-Class Compressible
- Between-Class Discriminative
- Diverse Representation
In this work, to learn a discriminative linear representation for intrinsic low-dimensional structures from high-dimensional data, they propose an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class, known as rate reduction.
Measure of Compactness for Linear Representation
Rate distortion measures the "compactness"
of a random distribution: Given a random variable
Rate distortion for finite samples on a
subspace. The compactness of learned features as a whole
can be measured in terms of the average coding length per sample (as the
sample size m is large), a.k.a. the coding rate subject to the
distortion
Rate distortion of samples on a mixture of
subspaces. We may partition the data (representation)
When
Principle of Maximal Coding Rate Reduction
The learned features should follow the basic rule that
similarity contracts and dissimilarity
contrasts. To be more precise, a good (linear)
discriminative representation
If we choose our feature mapping to be
Normalization. To make the amount of reduction
comparable between different representations, we need to normalize the
scale of the learned features, either by imposing the Frobenius norm of
each class
Once the representations can be compared fairly, our goal becomes to
learn a set of features
This is refered to as the principle of maximal coding rate reduction
(MC
Properties of the Rate Reduction Function
The MC
The optimal representation
Theorem 1 (Informal Statement) Suppose
- Between-class Discriminative: As long as the ambient space is
adequately large
, the subspaces are all orthogonal to each other, i.e. for - Maximally Diverse Representation: As long as the coding precision is
adequately high, i.e.,
, each subspace achieves its maximal dimension, i.e. . In addition, the largest singular values of are equal.
References
[1] @article{chan2020redunet, title={ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction}, author={Chan, Kwan Ho Ryan and Yu, Yaodong and You, Chong and Qi, Haozhi and Wright, John and Ma, Yi}, journal={arXiv preprint arXiv:2105.10446}, year={2021} }