Bayesian Inference

What is Bayesian Inference?

Statistical inference is the process of deducing properties about a population or probability distribution from data. Bayesian inference is therefore just the process of deducing properties about a population or probability distribution from data using Bayes' therem. [1]

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. [2]

Statistical inference consists in learning about what we do not observe based on what we observe. Bayesian inference is the process of producing statistical inference taking a Bayesian point of view. [3]

Formal Description of Bayesian Inference [2]

Definitions

, a data point in general. This may in fact be a vector of values.
, the parameter of the data point's distribution, i.e., . This may in fact be a vector of parameters.
, the hyperparameter of the parameter distribution, i.e., . This may in fact be a vector of hyperparameters.
is the sample, a set of observed data points, i.e., .
, a new data point whose distribution is to be predicted.

Bayesian Inference

The prior distribution is the distribution of the parameters before any data is observed, i.e. . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. . This is also termed the likelihood, especially when viewed as a function of the parameters, sometimes written .
The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameters, i.e. .
The posterior distribution is the distribution of the parameters after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:

Bayesian prediction

The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:

The proir predictive distribution is the distribution of a new data point, marginalized over the prior:

Computational difficulties

The Bayes theorem tells us that the computation of the posterior requires three terms: a prior, a likelihood and evidence. The first two can be expressed easily as they are part of the assumed model. However, the third term, that is normalisation factor, requires to be computed ( not shown here).

Methods

The integral can become intractable in higher dimensions. Two most used approached are Markov Chain Monte Carlo ( simpling-based) and Variational Inference methods (from optimization perspectiver).

MCMC

In MCMC, we first construct an ergodic Markov chain on whose stationary distribution is the posterior . Then we sample from the chain to collect samples from the stationary distribution. Finally, we approximate the posterior with an empirical estimate constructed from (a subset of) the collected samples.[5]

Monte carlo sampling methods are able to draw independent samples from distribution. While, they are not effective and may be intractable for high-dimensinal probabilistic models. Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.

Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable.

A Markov chain is a special type of stochastic process, which deals with characterization of sequences of random variables. Special interest is paid to the dynamic and the limiting behaviors of the sequence.

MCMC algorithms are sensitive to their starting point, and often require a warm-up phase or burn-in phase to move in towards a fruitful part of the search space, after which prior samples can be discarded and useful samples can be collected.

Gibbs Sampling

Gibbs Sampling is appropriate for those probabilistic models where this conditional probability can be calculated.

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm is appropriate for those probabilistic models where we cannot directly sample the so-called next state probability distribution, . Unlike the Gibbs chain, the algorithm doesnot assume that we can generate next state samples from a particular target distribution. Instead, the Metropolis-Hastings algorithm involves using a surrogate or proposal probability distribution that is sampled (sometimes called the kernel), then an acceptance criterion that decides whether the new sample is accepted into the chain or discarded.

Varational Inference

Variational inference is widely used to approximate posterior densities for Bayesian models, an alternative strategy to MCMC sampling. Compared to MCMC, variational inference tends to be faster and easier to scale to large data.

References

[1] https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-parameter-estimation-90e8930e5348

[2] https://en.wikipedia.org/wiki/Bayesian_inference

[3] https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29

[4] @article{speagle2019conceptual, title={A conceptual introduction to markov chain monte carlo methods}, author={Speagle, Joshua S}, journal={arXiv preprint arXiv:1909.12313}, year={2019} }

[5] @article{blei2017variational, title={Variational inference: A review for statisticians}, author={Blei, David M and Kucukelbir, Alp and McAuliffe, Jon D}, journal={Journal of the American statistical Association}, volume={112}, number={518}, pages={859--877}, year={2017}, publisher={Taylor & Francis} }