Metropolis-Hastings Sampling

MCMC Review (2)

Posted by Fan Gong on Apr 28, 2019

Recap: Now we have the background knowledge about Markov Chain and Monte Carlo method; at the same time we have agreed that finding the transition matrix \(P\) is the final goal to make everything work. Therefore, this article is to tell how we finally make this happen.

1. MCMC Sampling

1.1 Intuition

Suppose we randomly choose a transition matrix \(Q(i \to j)\),then the detailed balance stationary will not be satisfied. However, we are thinking to introduce one \(\alpha(i \to j)\) that makes: \[\pi(i)Q(i \to j)\alpha(i \to j) = \pi(j)Q(j \to i)\alpha(j \to i) \] Based on the symmetry, it is easy to find \(\alpha\): \[\alpha(i \to j) = \pi(j)Q(j \to i)\] \[\alpha(j \to i) = \pi(i)Q(i \to j)\]

This is to say, our transition matrix can be obtained through any random transition matrix \(Q\). Here we call \(\alpha(i \to j)\) acceptance rate, the value will between [0,1]. The intuition behind it is that when we move from state \(i\) to \(j\) with probability \(Q(i \to j)\), we accept it with rate \(\alpha\). Therefore we have the complete MCMC algorithm:

1.2 MCMC Sampling Algorithm:

  1. Input our selected transition matrix \(Q\), stationary distribution \(\pi(x)\), stationary threshold number \(n_1\), wanted sampling number \(n_2\):
  2. Initialize the sample \(x_0\) from simple distribution
  3. for t = 0 to \(n_2+n_1-1\):
    • Sample \(x_*\) from \(Q(x|x_t)\)
    • Sample \(u\) from Uniform[0,1]
    • if \(u < \alpha(x_t \to x_*) = \pi(x_*)Q(x_* \to x_t)\), then we accept this transition: \(x_{t+1} = x_*\)
    • Otherwise we reject the transition: \(x_{t+1} = x_t\)

Then data points \((x_{n_1},x_{n_1+1},...,x_{n_1+n_2-1})\) will be the samples from our targeted distribution \(\pi(x)\)

To make a small summary about the whole process: We tried to find a Markov Chain that has its stationary distribution same with our targeted distribution \(\to\) To find that Markov Chain we only need the transition matrix to satisfy detailed balance stationary \(\to\) to get that transition matrix we use Monte Carlo method to sample from proposal distribution

1.3 My Confusion

Here I would like to talk about all the confusions I had during this learning. I think before the appearance of the acceptance rate, the whole logic of this algorithm is quite clear, but things are getting messier and messier afterwards.

1) Why not sample from \(P(i \to j)\) directly since we have both the value of \(\alpha\) and \(Q\), where \(P = \alpha \odot Q\) ?

This is the most confusing question I have had for understanding this algorithm. Actually, the real question behind the scene is why we are using Monte Carlo method here. Then by thinking about the intuition of Monte Carlo method, the answer becomes obvious:

  • For real cases (mostly continuous condition), the transition matrix \(P\) will be a high-dimensional complicated matrix, the computation and storage cost will be large;
  • At the same time, most of the transition probability will be very small; but with using acceptance rate, if we enlarge it with certain factor, detailed balance stationary is still satisfied. (That's the key idea for MH algorithm), and in that case we can force our chain move forward

2) The acceptance rate here is different with the one in reject-sampling method

The special part of \(\alpha\) here is that, if it is being rejected, we will not drop that data point, instead; we keep our state static and still store this data. Because here rejection means transiting from \(i \to i\). But still we want to have high acceptance rate because:

  • At first our distribution is not stationary yet, by moving forward we could speed up the convergence.
  • We want sampler to visit higher probability areas under the full joint density

3) Why we need MCMC if we already know the specific target distribution?

This question is fundamental but very confusing when I first know the goal of MCMC is to sample from a known distribution. Actually MCMC is used to deal with the curse of dimensionality. Think about it, if we have a one-dimensional distribution, it is easy for us to sample a data sequence with 0.1 distance with each other. And they roughly can describe what the distribution looks like. However, for high-dimensional distributions, it is almost impossible to use this equidistance sampling method: suppose we have 10 dimensions and sample 100 data points from each dimension, in total we need to sample 10^100 data points. That is a huge number which needs lots of computation power. In addition to this, it is very likely that the data points we sampled are not representative enough, meaning they have low density. With using MCMC we can see it perfectly solved this two problems.

1.4 Toy Example

Since I had a hard time to understand the whole process, I also created a discrete example to help better understand MCMC:

  • Suppose we have \(\pi(x) = [\frac{2}{3}, \frac{1}{3}]\), which means we only have two states, \(p(x=1) = \frac{2}{3}\) and \(p(x=2) = \frac{1}{3}\)
  • And suppose our unknown transition matrix P is \[P = \left(\begin{array}{cc} \frac{2}{3} & \frac{1}{3}\\ \frac{2}{3} & \frac{1}{3} \end{array}\right) \]
  • Finally suppose our random transition matrix is: \[Q = \left(\begin{array}{cc} \frac{1}{2} & \frac{1}{2}\\ \frac{1}{2} & \frac{1}{2} \end{array}\right) \]

So based on \[\alpha(i \to j) = \pi(j)Q(j \to i)\] We get: \[\alpha = \left(\begin{array}{cc} \frac{1}{3} & \frac{1}{6}\\ \frac{1}{3} & \frac{1}{6} \end{array}\right) \] Since this is a simple discrete example, we can directly calculate \(P'\) \[P' = \alpha \odot Q = \left(\begin{array}{cc} \frac{1}{6} & \frac{1}{12}\\ \frac{1}{6} & \frac{1}{12} \end{array}\right)\] We see this is actually the unnormalized \(P\), after normalization, it will be the same with \(P\).

Since we can do normalization, we can also enlarge \(\alpha\) even more. For MH algorithm: \[\alpha(i \to j) = min(\frac{\pi(j)Q(j \to i)}{\pi(i)Q(i \to j)}, 1) \] This time \[\alpha = \left(\begin{array}{cc} 1 & \frac{1}{2}\\ 1 & 1 \end{array}\right)\] and \[P' = \alpha \odot Q = \left(\begin{array}{cc} \frac{3}{4} & \frac{1}{4}\\ \frac{1}{2} & \frac{1}{2} \end{array}\right)\] And still satisfy \(\pi P' = \pi\). Notice here \(\alpha_{ii}\) can be any number, so \(P_{ii}\) can only be calculated through: \[p_{ii} = 1 - \sum_{i\neq g}{\alpha_{ij}q_{ij}}\]

So this example help us:

  • Intuitively understand that sampling using acceptance rate equals to sample from real transition matrix P
  • understand why we can enlarge acceptance rate but still achieve our goal.

2. Metropolis-Hastings Sampling

MH sampling makes improvement on top of original MCMC method. We have already seen that:

  1. \(\alpha(i \to j)\) is still very small
  2. Enlarge both \(\alpha(i \to j)\) and \(\alpha(j \to i)\) with same factor will still satisfy the detailed balance stationary

Therefore, the key idea of MH sampling is to enlarge either \(\alpha(i \to j)\) or \(\alpha(j \to i)\) to 1.

Metropolis-Hastings Algorithm

  1. Input our selected transition matrix \(Q\), stationary distribution \(\pi(x)\), stationary threshold number \(n_1\), wanted sampling number \(n_2\):
  2. Initialize the sample \(x_0\) from simple distribution
  3. for t = 0 to \(n_2+n_1-1\):
    • Sample \(x_*\) from \(Q(x|x_t)\)
    • Sample \(u\) from Uniform[0,1]
    • if \(u < \alpha(x_t \to x_*) = min(\frac{\pi(j)Q(j \to i)}{\pi(i)Q(i \to j)}, 1)\), then we accept this transition: \(x_{t+1} = x_*\)
    • Otherwise we reject the transition: \(x_{t+1} = x_t\)

Then data points \((x_{n_1},x_{n_1+1},...,x_{n_1+n_2-1})\) will be the samples from our targeted distribution \(\pi(x)\)

In most cases we will choose a symmetrical \(Q\) to further simplify the calculation of \(\alpha\):

\[\alpha(i \to j) = min(\frac{\pi(j)}{\pi(i)}, 1)\]

Limitation

In big data era, we are facing more and more high-dimensional data, so:

  • When we have many dimensions, to calculate \(\frac{\pi(j)Q(j \to i)}{\pi(i)Q(i \to j)}\) is also computational expensive
  • The \(\alpha\) in most cases are still below 1
  • Due to the high dimension, sometimes it is hard for us to have the full joint distribution. Instead, we can easily have conditional distribution for each dimension

That's the reason we are introducing Gibbs Sampling, let's talk about it in the next article.

Reference:
http://www.mit.edu/~ilkery/papers/MetropolisHastingsSampling.pdf https://towardsdatascience.com/markov-chain-monte-carlo-in-python-44f7e609be98 https://zhoujiansun.wordpress.com/2018/07/11/%E8%B4%9D%E5%8F%B6%E6%96%AF%E7%BD%91%EF%BC%9Amcmc%E7%AE%97%E6%B3%95%E5%8E%9F%E5%9E%8B/ https://www.zhihu.com/question/60437632 https://towardsdatascience.com/from-scratch-bayesian-inference-markov-chain-monte-carlo-and-metropolis-hastings-in-python-ef21a29e25a https://segmentfault.com/a/1190000012645418 https://zhuanlan.zhihu.com/p/25610149 https://www.jianshu.com/p/28d32aa7cc45 https://www.cnblogs.com/pinard/p/6638955.html http://stat.wharton.upenn.edu/~stjensen/stat542/lecture14.mcmchistory.pdf https://www.mcmchandbook.net/HandbookChapter1.pdf