Stein's method is a technique that can quantify the error in the
approximation of one distribution by another in a variety of metrics.
Stein's method is a general theoretical tool for obtaining bounds on
distances between distributions.
Foundation of Stein's Method
(The following content is almost directly copied form [4])
Let be a random variable on
with differentiable
probability density . Our main
emphasis here is for the standard
normal density, but this argument works more generally. Now let be the set of differentiable
real functions such that as . Then
clearly
with : for the standard
normal density, . Hence
any function of the form with automatically has . Conversely, given
any continuous with ,
the function can
be written in the form with .
(Remark: is this Stein's Lemma??)
One example:
example: begin If , then
and that this argument can be reversed: so defined, for an arbitrary for which ,
is such that .
Hence we can take to be
defined by
noting that it's then directly checked that . This allows us to
write
How to get the last equation? By derivating at both side,
the left:
the right:
for any random variable for
which the expectations exist; in particular, for standard normal
approximation,
example: end
Important Conclusion!! Taking the supremum
of the left hand side over test funcitons gives a measure of the distance between the
distributions of and . The distance can in turn be computed
by taking the supremum of the right side over . ..., in many many
circumstances, expressing the difference in the
entirely equivalent form makes it easier to
bound.
Error Bounded[5] Let
be the set of all that
are twice continuously differentiable, and and for all . Let be a standard normal random variable
and be any random variable.
Then
Stein's Lemma, Identity and
Operator
Stein's Lemma Define the functional
operator by
If has the standard normal
distribution, then for all absolutely continuous with .
If for some random variable ,
for
all absolutely continuous functions with , then has the standard normal
distribution.
The operator is
referred to as a characterizing operator of the standard normal
distribution.
(Stein) score function Assume that is a subset of and a continuous differentiable (also
called smooth) density whose support is . The (Stein) score function
of is defined as
We say that a function is in the Stein
class of if is smooth and satisfies
It holds with a condition when , if
Stein's Operator The Stein's operator of
is a linear operator acting on
the Stein class of , defined
as
Stein's Identity Assume is a smooth density supported on
, then
for any that is in the Stein
class of .
Lemma [6] Assume and are smooth densities supported on
and is in the Stein class of , we have
Lemma Stein's identity (again..)
Roughly speaking, it relies on the basic fact that two smooth densities
and supported on are
identical if and only if
for smooth functions with
proper zero-boudary conditions, where is called the
(Stein) score function of . The
equation is known as Stein's identity.
Stein
Discrepancy Measure and Kernelized Stein Discrepancy (KSD)
Stein discrepancy measure between and is
where is a set of
smooth functions that satisfies and is also rich enough to ensure whenever .
Kernelized Stein Discrepancy (KSD) between distribution
and is defined as
where is the score difference between and , and are i.i.d draws from .
Proposition Define . Assume
is integrally
strictly positive definite, and are continuous densities with , we have and if and only if . (This establishes as a valid discrepancy
measure.)
Definition A kernel is said to be in the Stein
class of if has a continuous second
order partial derivatives, and both and are
in the Stein class of for any
fixed .
Theorem Assume and are smooth densities and is in the Stein class of
. Define
then,
Theorem Assume is a positive definite
kernel in the Stein class of ,
with positive eigenvalues and
eigenfunctions , then is also a positive
definite kernel, and can be rewritten into
where is the Stein's operator acted on . In addition,
Note that although are orthonormal, the
are no longer othonormal in general.
Theorem Let be the RKHS related to a
positive definite kernel in the Stein class of . Denote by , then
Further, we have for , and hence
where the maximum is achieved when .
How to Use
KSD? (Goodness-of-fit Testing Based on KSD)
The form in - allows efficient estimation of in practice. Given i.i.d
sample
( is unknown) and the score
function , we can estimate
by
where
is a form of U-statistics. can also be estimated using a V-statistics of form , which provides a biased estimator.
References
[1] Fox, R., Pakman, A., and Tishby, N. Taming the
noise in reinforcement learning via soft updates. In Conf. on
Uncertainty in Artificial Intelligence, 2016.
[2] Schulman, J., Abbeel, P., and Chen, X. Equivalence
be-tween policy gradients and soft Q-learning.arXiv
preprintarXiv:1704.06440, 2017a.
[3] Liu, Q. and Wang, D. Stein variational gradient
descent: A general purpose bayesian inference algorithm. In Advances In
Neural Information Processing Systems, pp. 2370–2378, 2016
[4] Barbour, Andrew D., and Louis HY Chen. "Steins
(magic) method." arXiv preprint arXiv:1411.1179 (2014).
[5] Reinert, Gesine. "A short introduction to Stein’s
method." Lecture Notes (2011).
[6] Ley, Christophe, and Yvik Swan. "Stein's density
approach and information inequalities." Electronic Communications in
Probability 18 (2013).