Naive Bayes Classifier

Fan Gong 18/01/2018

This notebook tries to construct the Bayes Classifier Model from scratch. Last time we use ISIR data but we could see it is not a good classification example since the data size is too small. So this time I decide to create some datasets by sklearn. Here we still suppose it a binary classification.

1. Load Library and Generate Data

In [122]:
# Load Library
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from scipy.stats import norm
from sklearn.naive_bayes import GaussianNB
In [12]:
# Generate Data
X, y = make_classification(n_samples = 5000, n_features = 5, n_classes = 2, random_state = 42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

2. Model Construction

Here we need to calculate: $$\hat{Y} = \arg \max_Y P(X|Y)P(Y) = \arg \max_Y P(x_1,...,x_p|Y)P(Y)= \arg \max_Y \prod_{i=1}^p P(x_i|Y)P(Y)$$

And my assumption here is each $P(x_i|Y=y)$ belongs to a normal distribution, and the MLE for normal distribution is: $$\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i $$ $$\hat{\sigma^2}=\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2$$

In [79]:
def get_paras(X_train, y_train):
    '''
    This function aims to get the class proportion and the parameters for each feature in each class 
    
    Parameter
    ---------
    X_train: The features of training dataset
    y_train: The labels of training dataset
    
    Return
    ------
    P_Y: A list contains the class proportion information
    paras: A dictionary contains the parameters' information. 'c0f0' means the this is the key parameter for the class 0 feature 0. 
    '''
    
    # Calculate the Prob of Y for each y 
    P_Y = [sum(y_train == y)/len(y_train) for y in np.unique(y_train)]

    # Calculate the parameter 
    paras = {} # initialize list for storing the fitting parameters in each class 
    for i in range(len(np.unique(y_train))):
            class_temp = X_train[y_train == np.unique(y_train)[i]] # extract each class's data
            for j in range(X_train.shape[1]):
                class_feature_temp = class_temp[:,j] # extract the feature's data
                paras_temp = norm.fit(class_feature_temp) # fit each class's data to normal distribution by using MLE
                paras.update({"c{0}f{1}".format(i,j): paras_temp}) # put all the parameters together
                
    return P_Y, paras
In [83]:
P_Y, paras = get_paras(X_train, y_train)

3. Model Prediction

In [111]:
def make_prediction(X, P_Y, paras):
    '''
    This function aims to make the prediction of the given data based on the parameters we get from the last function.
    
    Parameter
    --------
    X: The features of the given dataset
    P_Y: A list contains the class proportion information
    paras: A dictionary contains the parameters' information.
    
    Return
    ------
    probs: The predicting probabilities of the data in each class 
    '''
    probs= []
    for c in range(len(P_Y)):
        probs_temp = 1
        for f in range(int(len(paras)/len(P_Y))):
            probs_temp = probs_temp * (P_Y[c] * norm.pdf(X[:,f], loc=paras['c{0}f{1}'.format(c,f)][0], 
                                                         scale=paras['c{0}f{1}'.format(c,f)][1]))
        
        probs.append(probs_temp)
        
    
    return(probs)
In [112]:
probs = make_prediction(X_test, P_Y, paras)

4. Model Evaluation

Then based on the probability we get, we then could make a evaluation.

In [127]:
def make_evaluation(probs, Y):
    '''
    Parameter
    --------
    probs: The predicting probabilities of the data in each class (Binary)
    Y: The label of testing dataset
    
    Return
    ------
    pred_label: The label prediction
    acc: The accuracy of the prediction
    '''
    
    probs1 = probs[0]
    probs2 = probs[1]
    
    pred_label = (probs1 <= probs2).astype(int)
    acc = np.mean(pred_label == Y)
    
    return(pred_label,acc)
In [129]:
pred_label, acc = make_evaluation(probs, y_test)
acc
Out[129]:
0.86899999999999999

We have a pretty good prediction accuracy; Let us then compare it with the sklearn's results

In [123]:
gnb=GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
In [124]:
np.mean(y_pred==y_test)
Out[124]:
0.872

Cool! We have almost the same accuracy with the sklearn package.