0%

Stein Variational Gradient Decsent: Stein Method

Stein's Method

Stein's Method

Stein's method is a technique that can quantify the error in the approximation of one distribution by another in a variety of metrics. Stein's method is a general theoretical tool for obtaining bounds on distances between distributions.

Foundation of Stein's Method

(The following content is almost directly copied form [4])

Let be a random variable on with differentiable probability density . Our main emphasis here is for the standard normal density, but this argument works more generally. Now let be the set of differentiable real functions such that as . Then clearly

with : for the standard normal density, . Hence any function of the form with automatically has . Conversely, given any continuous with , the function can be written in the form with . (Remark: is this Stein's Lemma??)

One example:

example: begin If , then

and that this argument can be reversed: so defined, for an arbitrary for which , is such that . Hence we can take to be defined by

noting that it's then directly checked that . This allows us to write

How to get the last equation? By derivating at both side,

the left:

the right:

for any random variable for which the expectations exist; in particular, for standard normal approximation,

example: end

Important Conclusion!! Taking the supremum of the left hand side over test funcitons gives a measure of the distance between the distributions of and . The distance can in turn be computed by taking the supremum of the right side over . ..., in many many circumstances, expressing the difference in the entirely equivalent form makes it easier to bound.

Error Bounded [5] Let be the set of all that are twice continuously differentiable, and and for all . Let be a standard normal random variable and be any random variable. Then

Stein's Lemma, Identity and Operator

Stein's Lemma Define the functional operator by

  • If has the standard normal distribution, then for all absolutely continuous with .
  • If for some random variable , for all absolutely continuous functions with , then has the standard normal distribution.

The operator is referred to as a characterizing operator of the standard normal distribution.

(Stein) score function Assume that is a subset of and a continuous differentiable (also called smooth) density whose support is . The (Stein) score function of is defined as

We say that a function is in the Stein class of if is smooth and satisfies

It holds with a condition when , if

Stein's Operator The Stein's operator of is a linear operator acting on the Stein class of , defined as

Stein's Identity Assume is a smooth density supported on , then

for any that is in the Stein class of .

Lemma [6] Assume and are smooth densities supported on and is in the Stein class of , we have

Lemma Stein's identity (again..)
Roughly speaking, it relies on the basic fact that two smooth densities and supported on are identical if and only if

for smooth functions with proper zero-boudary conditions, where is called the (Stein) score function of . The equation is known as Stein's identity.

Stein Discrepancy Measure and Kernelized Stein Discrepancy (KSD)

Stein discrepancy measure between and is

where is a set of smooth functions that satisfies and is also rich enough to ensure whenever .

Kernelized Stein Discrepancy (KSD) between distribution and is defined as

where is the score difference between and , and are i.i.d draws from .

Proposition Define . Assume is integrally strictly positive definite, and are continuous densities with , we have and if and only if . (This establishes as a valid discrepancy measure.)

Definition A kernel is said to be in the Stein class of if has a continuous second order partial derivatives, and both and are in the Stein class of for any fixed .

Theorem Assume and are smooth densities and is in the Stein class of . Define

then,

Theorem Assume is a positive definite kernel in the Stein class of , with positive eigenvalues and eigenfunctions , then is also a positive definite kernel, and can be rewritten into

where is the Stein's operator acted on . In addition,

Note that although are orthonormal, the are no longer othonormal in general.

Theorem Let be the RKHS related to a positive definite kernel in the Stein class of . Denote by , then

Further, we have for , and hence

where the maximum is achieved when .

How to Use KSD? (Goodness-of-fit Testing Based on KSD)

The form in - allows efficient estimation of in practice. Given i.i.d sample ( is unknown) and the score function , we can estimate by

where is a form of U-statistics. can also be estimated using a V-statistics of form , which provides a biased estimator.

References

[1] Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conf. on Uncertainty in Artificial Intelligence, 2016.

[2] Schulman, J., Abbeel, P., and Chen, X. Equivalence be-tween policy gradients and soft Q-learning.arXiv preprintarXiv:1704.06440, 2017a.

[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016

[4] Barbour, Andrew D., and Louis HY Chen. "Steins (magic) method." arXiv preprint arXiv:1411.1179 (2014).

[5] Reinert, Gesine. "A short introduction to Stein’s method." Lecture Notes (2011).

[6] Ley, Christophe, and Yvik Swan. "Stein's density approach and information inequalities." Electronic Communications in Probability 18 (2013).