PCA

This jupyter notebook tries to construct principal component analysis model from scratch. I will use SVD method here since it is more convenient. Same as usual, I will compare my results with sklean.decomposition.PCA. Finally, I will use logistic regression to make a prediction and compare the results before and after using PCA.

1. Load Library and Generate Data

In [1]:
# Load Library
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib as plt
from sklearn.preprocessing import StandardScaler
In [3]:
# Generate Data
X, y = make_classification(n_samples=500, n_features = 6, n_informative=2, n_redundant=2, n_repeated=2)
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) 

2. Model Construction

In [13]:
class PCA:
    '''
    This is an PCA object
    '''
    def __init__(self):
        V = None
        cov = None
        scores = None
        
    def fit(self, X_train):
        '''
        This function tries to fit the PCA model by training data
        '''
        X_train = StandardScaler(with_mean=True, with_std=False).fit(X_train).transform(X_train)
        _,_,v = np.linalg.svd(X_train)
        self.V = v.transpose()
        self.scores = np.dot(X_train, self.V)
        self.cov = np.cov(self.scores, rowvar=False)
    
    def variance_explained(self, n):
        '''
        Parameter
        ---------
        n: The number of principle component. Start from 0
        '''
        if self.cov is None:
            print('Please fit the data first')
        else:
            variance = np.around(self.cov.diagonal()[n],3)
            variance_explained = np.around(variance/sum(self.cov.diagonal()),3)
            total_explained = np.around(sum(self.cov.diagonal()[:(n+1)])/sum(self.cov.diagonal()),3)
        return([variance, variance_explained, total_explained])

3. Model Evaluation

let us compare our results with sklearn's

In [14]:
pca = PCA()
pca.fit(X_train)
pca.variance_explained(0)
Out[14]:
[8.193, 0.77, 0.77]
In [15]:
pca.variance_explained(1)
Out[15]:
[2.445, 0.23, 1.0]

We see the first two PC has already explained 100% of the variance. That is reasonable because we know that when we creating this datasets, we did only add two informative features, so that matched the situation that the first two PCs contains almost all the variance.

Let then see how sklearn perform

In [16]:
from sklearn.decomposition import PCA as sk_PCA
p = sk_PCA()
p = p.fit(X_train)
print(p.explained_variance_ratio_)
[7.70170840e-01 2.29829160e-01 1.25387689e-32 4.87355033e-33
 1.70166353e-33 8.83916079e-34]
In [17]:
print(np.around(p.explained_variance_ratio_/sum(p.explained_variance_ratio_),3))
[0.77 0.23 0.   0.   0.   0.  ]

We could see that our results are exactly the same!