Principal Component Analysis

Unsupervised Learning Method (1)

Posted by Fan Gong on February 13, 2018

Principal component analysis or PCA, in essence, is a linear projection operator that maps a variable of interest to a new coordinate frame where the axes represent maximal variability. it is a simple, non-parametric method for extracting relevant low-dimensional information from confusing data sets.

In this post, I will first talk about how to derive PCA; and then talk about the assumption of PCA; finally, I will implement this algorithm from scratch by Python.

1. PCA Derivation

To know the derivation of PCA, we need to have sufficient knowledge of Linear Algebra including eigenvalue and eigenvectors; eigenvector decomposition and SVD. I will not review them here, so if you'd like to review these terms, I recommend the classic MIT linear algebra courses by Prof. Gilbert Strang.

Basically, there are two ways to derive PCA, eigenvector decomposition technique, and singular value decomposition. I will mainly cover the first method since it is more intuitive to me.

1.1 Eigenvector Decomposition Method

The goal of PCA is to identify the most meaningful basis to re-express a dataset. So the first question is: How should we change the basis?

Change of Basis

Here comes our first assumption: We would like to find a basis, which is a linear combination of the original basis, that best re-expresses our data. Let me translate this sentence by math formulas:

Let $X_{m \times n}$ be the original data set (notice each row vector represent one dimension's data, different from the normal form we use), where each column is a single sample of our dataset. Let $Y_{m \times n}$ be our final transformation matrix, and $P_{m \times m}$ is the transformation matrix that makes:
$$Y = PX$$

Then how we find the good choice of basis $P$? It depends on what features we would like to see from Y. So next we talk about how to measure the goodness of the basis.

Measurement on the basis

We would like to find a basis that satisfies the following two qualities:

  1. Large Variance: Here comes our second assumption:
    Combined features with larger associated variance represent interesting structure. That makes sense since we want our transformed data to use a small number of features covering most of the information from the original data. And the feature with larger variance contains relatively more information.
  2. Small Redundancy: Redundancy basically describe the correlated relationship between some features. So if two features are highly correlated, recording solely one would express the data more concisely. Indeed, this is the central idea behind dimensional reduction.

To combine above two things together, we find that covariance matrix perfectly provides such information. After we centered each of the feature, the covariance matrix $C_{X}$ is: $$C_x = \frac{1}{n-1}XX^T$$ where the $ij^{th}$ element of $C_X$ is the covariance of the $i^{th}$ measurement and $j^{th}$ measurement; Also notice the dimension of $C_X$ is $m \times m$.

So, $C_X$ captures the covariance between all possible pairs of measurements. The covariance values reflect the noise and redundancy in our measurements.

  • In the diagonal terms, by assumption, large values correspond to the interesting structure.
  • In the off-diagonal term, large magnitudes correspond to high redundancy.

Based on the summary above, our optimized $C_Y$ should be like this:

  • All off-diagonal terms in $C_Y$ should be zero. Thus, $C_Y$ must be a diagonal matrix.
  • Each successive dimension in $Y$ should be rank-ordered according to variance.

We can call this procedure diagonalizing. There are many ways to deal with this problem, PCA assumes that $P$ is an orthonormal matrix to make it simple. Basically, we first select a normalized direction along which the variance in $X$ is maximized. save this as $p_1$; and then find another direction which variance is maximized, however, because of the orthonormal condition, restrict the search to all directions orthogonal to all previously selected directions.

Eigenvector Decomposition

Until Now, we have known that our goal is to find some orthonormal matrix $P$ in Y = PX such that $C_Y = \frac{1}{n}YY^T$ is a diagonal matrix.

Here is what we do:
We begin by rewriting $C_Y$ in terms of the P $$\begin{align*} C_Y & = \frac{1}{n-1}YY^T \\ &= \frac{1}{n-1} (PX)(PX)^T \\ &= \frac{1}{n-1}PXX^TP^T \\ &=P(\frac{1}{n-1}XX^T)P^T \\&= PC_XP^T \end{align*}$$

Then, we have a theorem that for a symmetric matrix $A$ is diagonalized by an orthogonal matrix of its eigenvectors. $A = EDE^T$, where $D$ is a diagonal matrix and $E$ is a matrix of eigenvectors of $A$ arranged as columns. So we want our $C_X$ to be that $D$ matrix.

Based on theroem that the inverse of an orthogonal matrix is its transpose, we find that when $P = E^T$ or to say $p_i$ is an eigenvecotor of $C_X$, we could get that results. Here is the proof: $$\begin{align*} C_Y & = PC_XP^T \\ &= P(E^TDE)P^T \\ &=(PP^T)D(PP^T) \\ &=(PP^{-1})D(PP^{-1}) \\&=D \end{align*}$$

Here are the brief summary:

  1. The loading vectors of $X$ are the eigenvectors of $C_X$
  2. The $i^{th}$ diagonal value of $C_Y$ is the variance of $X$ along $p_i$.

1.2 SVD

SVD tells us that any arbitrary matrix $X$ can be converted to an orthogonal matrix, a diagonal matrix and another orthogonal matrix: $$X = U \Sigma V^T$$ Then: (here $X_{n \times m }$) $$\begin{align*} C_X &= \frac{1}{n-1}X^TX \\&=(V \Sigma U^T)(U\Sigma V^T) \\ &= V\Sigma^2V^T \end{align*}$$

If we define $P = V$, then : $$\begin{align*} C_Y &= (XP)^T(XP) \\ &=P^TP\Sigma U^T U \Sigma P^TP \\ &=\Sigma^2
\end{align*}$$

Therefore, the columns of $V$ are the loading vectors of $X$.

2. Model Assumption and Notation

2.1 Model Assumption

There are three assumptions of this model:

  • Linearity: We find the linear combination of the original basis
  • Large variances have important structure.
  • The principal components are orthogonal.

2.2 Notation

Suppose we have $X_{n\times p}$, $P_{p \times p}$, $Y_{n \times p}$, and $Y = XP$.

  • The results $Y$ is always called component scores. The first column is called the first principal component
  • The transformation matrix $P$ is called loadings, and each column is called loading vector.
  • In most cases we not only need to center the data (Centering is because that makes it easy to calculate the covaiance matrix), but also standadizing it to make each feature in the same scale

3. PCA Implementation

In practice, computing PCA entails (suppose we have $X_{(m,n)}$):

  1. Subtracting off the mean of each feature
  2. Calculating the covariance matrix $C_X$
  3. Computing the eigenvectors of $C_X$
  4. Taking top k eigenvalues's associated eigenvectors and form into matrix $P$
  5. $Y=PX$ is the output of k-dimensional data

Here is the link for Jupyter Notebook of PCA .

All right, these are all the reviews for PCA. Let us see more unsupervised learning method the next time.

Reference:

Book: An Introduction to Statistical Learning
https://arxiv.org/abs/1404.1100
sklean.decomposition.PCA documentation
https://www.cnblogs.com/bjwu/p/9358777.html