Bayesian Inference
What is Bayesian Inference?
Statistical inference is the process of deducing properties about a population or probability distribution from data. Bayesian inference is therefore just the process of deducing properties about a population or probability distribution from data using Bayes' therem. [1]
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. [2]
Statistical inference consists in learning about what we do not observe based on what we observe. Bayesian inference is the process of producing statistical inference taking a Bayesian point of view. [3]
Formal Description of Bayesian Inference [2]
Definitions
, a data point in general. This may in fact be a vector of values. , the parameter of the data point's distribution, i.e., . This may in fact be a vector of parameters. , the hyperparameter of the parameter distribution, i.e., . This may in fact be a vector of hyperparameters. is the sample, a set of observed data points, i.e., . , a new data point whose distribution is to be predicted.
Bayesian Inference
- The prior distribution is the distribution of the
parameters before any data is observed, i.e.
. The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations. - The sampling distribution is the distribution of
the observed data conditional on its parameters, i.e.
. This is also termed the likelihood, especially when viewed as a function of the parameters, sometimes written . - The marginal likelihood (sometimes also termed the
evidence) is the distribution of the observed data
marginalized over the parameters, i.e.
. - The posterior distribution is the distribution of the parameters after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:
Bayesian prediction
- The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
- The proir predictive distribution is the distribution of a new data point, marginalized over the prior:
Computational difficulties
The Bayes theorem tells us that the computation of the posterior
requires three terms: a prior, a likelihood and evidence. The first two
can be expressed easily as they are part of the assumed model. However,
the third term, that is normalisation factor, requires to be computed
Methods
The integral can become intractable in higher dimensions. Two most used approached are Markov Chain Monte Carlo ( simpling-based) and Variational Inference methods (from optimization perspectiver).
MCMC
In MCMC, we first construct an ergodic Markov chain
on
Monte carlo sampling methods are able to draw independent samples from distribution. While, they are not effective and may be intractable for high-dimensinal probabilistic models. Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.
Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable.
A Markov chain is a special type of stochastic process, which deals with characterization of sequences of random variables. Special interest is paid to the dynamic and the limiting behaviors of the sequence.
MCMC algorithms are sensitive to their starting point, and often require a warm-up phase or burn-in phase to move in towards a fruitful part of the search space, after which prior samples can be discarded and useful samples can be collected.
Gibbs Sampling
Gibbs Sampling is appropriate for those probabilistic models where this conditional probability can be calculated.
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm is appropriate for those probabilistic models where we cannot directly sample the so-called next state probability distribution, . Unlike the Gibbs chain, the algorithm doesnot assume that we can generate next state samples from a particular target distribution. Instead, the Metropolis-Hastings algorithm involves using a surrogate or proposal probability distribution that is sampled (sometimes called the kernel), then an acceptance criterion that decides whether the new sample is accepted into the chain or discarded.
Varational Inference
Variational inference is widely used to approximate posterior densities for Bayesian models, an alternative strategy to MCMC sampling. Compared to MCMC, variational inference tends to be faster and easier to scale to large data.
References
[1] https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-parameter-estimation-90e8930e5348
[2] https://en.wikipedia.org/wiki/Bayesian_inference
[3] https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29
[4] @article{speagle2019conceptual, title={A conceptual introduction to markov chain monte carlo methods}, author={Speagle, Joshua S}, journal={arXiv preprint arXiv:1909.12313}, year={2019} }
[5] @article{blei2017variational, title={Variational inference: A review for statisticians}, author={Blei, David M and Kucukelbir, Alp and McAuliffe, Jon D}, journal={Journal of the American statistical Association}, volume={112}, number={518}, pages={859--877}, year={2017}, publisher={Taylor & Francis} }