Fan Gong 18/01/2018
This notebook tries to construct the Bayes Classifier Model from scratch. Last time we use ISIR data but we could see it is not a good classification example since the data size is too small. So this time I decide to create some datasets by sklearn
. Here we still suppose it a binary classification.
# Load Library
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from scipy.stats import norm
from sklearn.naive_bayes import GaussianNB
# Generate Data
X, y = make_classification(n_samples = 5000, n_features = 5, n_classes = 2, random_state = 42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Here we need to calculate: $$\hat{Y} = \arg \max_Y P(X|Y)P(Y) = \arg \max_Y P(x_1,...,x_p|Y)P(Y)= \arg \max_Y \prod_{i=1}^p P(x_i|Y)P(Y)$$
And my assumption here is each $P(x_i|Y=y)$ belongs to a normal distribution, and the MLE for normal distribution is: $$\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i $$ $$\hat{\sigma^2}=\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2$$
def get_paras(X_train, y_train):
'''
This function aims to get the class proportion and the parameters for each feature in each class
Parameter
---------
X_train: The features of training dataset
y_train: The labels of training dataset
Return
------
P_Y: A list contains the class proportion information
paras: A dictionary contains the parameters' information. 'c0f0' means the this is the key parameter for the class 0 feature 0.
'''
# Calculate the Prob of Y for each y
P_Y = [sum(y_train == y)/len(y_train) for y in np.unique(y_train)]
# Calculate the parameter
paras = {} # initialize list for storing the fitting parameters in each class
for i in range(len(np.unique(y_train))):
class_temp = X_train[y_train == np.unique(y_train)[i]] # extract each class's data
for j in range(X_train.shape[1]):
class_feature_temp = class_temp[:,j] # extract the feature's data
paras_temp = norm.fit(class_feature_temp) # fit each class's data to normal distribution by using MLE
paras.update({"c{0}f{1}".format(i,j): paras_temp}) # put all the parameters together
return P_Y, paras
P_Y, paras = get_paras(X_train, y_train)
def make_prediction(X, P_Y, paras):
'''
This function aims to make the prediction of the given data based on the parameters we get from the last function.
Parameter
--------
X: The features of the given dataset
P_Y: A list contains the class proportion information
paras: A dictionary contains the parameters' information.
Return
------
probs: The predicting probabilities of the data in each class
'''
probs= []
for c in range(len(P_Y)):
probs_temp = 1
for f in range(int(len(paras)/len(P_Y))):
probs_temp = probs_temp * (P_Y[c] * norm.pdf(X[:,f], loc=paras['c{0}f{1}'.format(c,f)][0],
scale=paras['c{0}f{1}'.format(c,f)][1]))
probs.append(probs_temp)
return(probs)
probs = make_prediction(X_test, P_Y, paras)
Then based on the probability we get, we then could make a evaluation.
def make_evaluation(probs, Y):
'''
Parameter
--------
probs: The predicting probabilities of the data in each class (Binary)
Y: The label of testing dataset
Return
------
pred_label: The label prediction
acc: The accuracy of the prediction
'''
probs1 = probs[0]
probs2 = probs[1]
pred_label = (probs1 <= probs2).astype(int)
acc = np.mean(pred_label == Y)
return(pred_label,acc)
pred_label, acc = make_evaluation(probs, y_test)
acc
We have a pretty good prediction accuracy; Let us then compare it with the sklearn
's results
gnb=GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
np.mean(y_pred==y_test)
Cool! We have almost the same accuracy with the sklearn package.