0%

notes on "ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction"

ReduNet [1]

This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation.

Motivation The design of deep networks are often based on years of trial and error, then trained via back propagation, and then deployed as a "black box". It lacks of rigorous mathematical principles, modeling and analysis. It naturally raises a fundamental question that we aim to address in this paper: how to develop a principled mathematical framework for better understanding and design of deep networks?

A new theoretical framework The paper develop a new theoretical framework for understanding deep networks around the following two questions:

  • Objective of Representation Learning : What intrinsic structures of the data should we learn, and how should we represent such structures? What is a principled objective function for learning a good representation of such structures, instead of choosing heuristically or arbitrarily?
  • Architecture of Deep Networks : Can we justify the structures of modern deep networks from such a principle? In particular, can the networks' layered architecture and operators (linear or nonlinear) all be derived from this objective, rather than designed heuristically and evaluated empirically?

The paper's answer to the two questions are:

  • A principled objective for a deep network is to learn a low-dimensional linear discriminative representation of the data. The optimality of such a representation can be evaluated by a principled measure from (lossy) data compression, known as rate reduction.
  • Deep networks can be naturally interpreted as optimization schemes for maximizing this measure.

Not only does this framework offer new perspectives to understand and interpret modern deep networks, they also provide new insights that can potentially change and improve the practice of deep networks.

The Principle of Maximal Coding Rate Reduction

Whether the given data of a mixed distribution can be effectively classified depends on how separable (or discriminative) the component distribution are. One popular working assumption is that the distribution of each class has relatively low-dimensional intrinsic structures.

We assume the distribution of each class has a support on a low-dimensional submanifold, say with dimension , and the distribution of is supported on the mixture of those submanifolds, , in the high-dimensional ambient space .

With the manifold assumption, we want to learn a mapping that maps each of the submanifolds to a linear subspace . We require the learned representation to have the following properties, called a linear discriminative representation (LDR):

  • Within-Class Compressible
  • Between-Class Discriminative
  • Diverse Representation

In this work, to learn a discriminative linear representation for intrinsic low-dimensional structures from high-dimensional data, they propose an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class, known as rate reduction.

Measure of Compactness for Linear Representation

Rate distortion measures the "compactness" of a random distribution: Given a random variable and a prescribed precision , the rate distortion is the minimal number of binary bits needed to encode such that the expected decoding error is less than , i.e., the decoded satisfied .

Rate distortion for finite samples on a subspace. The compactness of learned features as a whole can be measured in terms of the average coding length per sample (as the sample size m is large), a.k.a. the coding rate subject to the distortion :

Rate distortion of samples on a mixture of subspaces. We may partition the data (representation) into multiple subsets: with each containing samples in one low-dimensional subspace. Let be a set of diagonal matrices whose diagonal entries encode the membership of the samples in the classes. The average number of bits per sample (the coding rate) is

When is given, is a concave function of .

Principle of Maximal Coding Rate Reduction

The learned features should follow the basic rule that similarity contracts and dissimilarity contrasts. To be more precise, a good (linear) discriminative representation of is one such that, given a partition of , achieves a large difference between the coding rate for the whole and that for all the subsets:

If we choose our feature mapping to be (a dnn), the overall process of the feature representation and the resulting rate reduction w.r.t. certain partition can be illustrated by the following diagram:

Normalization. To make the amount of reduction comparable between different representations, we need to normalize the scale of the learned features, either by imposing the Frobenius norm of each class to scale with the number of features in or by normalizing each feature to be on the unit sphere: .

Once the representations can be compared fairly, our goal becomes to learn a set of features and their partition such that they maximize the reduction between the coding rate of all features and that of the sum of features w.r.t. their classes:

This is refered to as the principle of maximal coding rate reduction (MC).

Properties of the Rate Reduction Function

The MC principle is very general and can be applied to representations of any distributions with any attributes as long as the rates and for the distributions can be accurately and efficiently evaluated.

The optimal representation has the following desired properties:

Theorem 1 (Informal Statement) Suppose is the optimal solution that maximizes the rate reduction with the rates and . Assume that the optimal solution satisfies . We have:

  • Between-class Discriminative: As long as the ambient space is adequately large , the subspaces are all orthogonal to each other, i.e. for
  • Maximally Diverse Representation: As long as the coding precision is adequately high, i.e., , each subspace achieves its maximal dimension, i.e. . In addition, the largest singular values of are equal.

References

[1] @article{chan2020redunet, title={ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction}, author={Chan, Kwan Ho Ryan and Yu, Yaodong and You, Chong and Qi, Haozhi and Wright, John and Ma, Yi}, journal={arXiv preprint arXiv:2105.10446}, year={2021} }

[2]