Fan Gong's Blog

Recently I got a new project that needs to use MCMC. That reminds me of the time when I first learned it from school. I remember at that time I was discussing with Piaoge in butler library. How time flies! So this article aims to review what I have learned and make sure I understand everything correctly.

To understand what is MCMC, I would like to divide these two 'MC's and look at them on by one. At first I would like to summarize this topic in one article, but later on I realized that this is a very good chance for me to review Markov Chain, Monte Carlo Method as well as Bayesian Inference. So in the end I decided to split it into three articles:

Intro to Monte Carlo Method and Markov Chain
Metropolis-Hastings Sampling
Gibbs Sampling
MCMC Application with Bayesian Inference

Notice that the review of Monte Carlo method and Markov Chain is highly related to MCMC so keep in mind that our final target is to sample from one target distribution.

1.Monte Carlo Method

Definition

Generally speaking, Monte Carlo method relies on repeated random sampling to obtain numerical results.

Monte Carlo methods vary, but tend to follow a particular pattern:

Define a domain of possible inputs
Generate inputs randomly from a probability distribution over the domain
Perform a deterministic computation on the inputs : Aggregate the results (Like expectation)

Application

The Monte Carlo method was created for solving the problem that it is very difficult to calculate the intractable integrals when dealing with the denominator of Bayes Formulas: \[\int p(x|z)p(z)dz = E_{p(z)}p(x|z) \text{ , x is already known}\]
And what Monte Carlo method does is to find the approximate value of it. The key idea is to sample from \(p(x|z)\) where \(z \sim p(z)\), and the expectation value for all these samples are almost what we want.

One of the simple applications is to use for estimating \(\pi\)：

Define the domain - We put the circle within one square with side length
Sample from Uniform Distribution - We sample n point within this area.
Aggregate the results: The percentage that the points locate in the circle is actually the estimation of \(\pi\).

Advantages

For most of the distributions, especially high-dimensional ones, maybe we can write their specific probability density functions, whereas we don't know in which area we can get large density, and reality is most of the area will have very low density (near 0).That's the dilemma to calculate the arithmetic solution. So here comes the superiority of Monte Carlo method because We almost always get sample data from high density area, thus with very low computation cost, we can get pretty decent estimations. In other words, the sample sequence in some sense completes the mission of "searching for the high density area".

Limitation

However, for most of the high-dimensional distributions, it is very hard to find a proper proposal distribution that covers the targeted function. That's the reason we need to combine it with Markov Chain.

2.Markov Chain

Definition

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Mathematically:

\[P(X_{t+1}| X_{t},X_{t-1},...,X_{2},X_{1}) = P(X_{t+1}|X_{t})\]

And we have transition matrix \(P\) to define the probability of each stats's transformation, Where \(P_{ij}\) means the probability of transformation from state i to state j.

Markov Chain Property

One very important property of Markov Chain is that, suppose our initial distribution is \(\pi_0 = [\pi_0(1),...,\pi_0(p)]\) suppose we have p states. After n steps transformation: \(\pi_0\cdot P^n\), it will transform to a stationary distribution. At the same time, we found that the stationary distribution is purely decided by transition matrix \(P\) instead of the starting point \(\pi_0\). Mathematically speaking: \[\pi P = \pi\]

Based on this key property of Markov Chain, people were thinking: we know Markov chain will converge to a stationary distribution no matter what is the start point. So we want to create this kind of Markov Chain that makes the final stationary distribution same with our targeted distribution; So when we randomly start from \(x(0)\), after certain steps n, the Markov chain has converged to the stationary distribution, then \(x(n),x(n+1),...\) can be seen as the sample data from our targeted distribution.

Wow, this is a very clever idea. But now the problem comes to: How can we create this kind of Markov Chain?

To solve the problem above, we need have another very important property from Markov Chain:

Detailed Balance Stationary

Theorem:

If the aperiodic Markov Chain's transition matrix \(P\) and \(\pi(x)\) satisfies: \[\pi(i)P(i \to j) = \pi(j)P(j \to i)\] Then we can say \(\pi(x)\) is the stationary distribution of transition matrix \(P\).

Intuition: This property is very easy to understand, we can think it this way: If \(\pi(x)\) is stationary, it is then reasonable that the amount of value transferred from \(i \to j\) equals to the amount of value transferred from \(j \to i\).

What does this property mean? Actually it tells us: if we can find a transition matrix P that makes our targeted distribution satisfy this equation, then our previous problem is solved. But how can we find it?

Reference:
https://towardsdatascience.com/markov-chain-monte-carlo-in-python-44f7e609be98 https://segmentfault.com/a/1190000012645418 http://www.cnblogs.com/pinard/p/6632399.html http://www.cnblogs.com/pinard/p/6625739.html https://www.jianshu.com/p/28d32aa7cc45