Fan Gong's Blog

Last time we talked about logistic regression, but I did not give a very overview idea about what the classification problem is and how can we classify different classification methods. So let us talk them this time!

Most classification models can be categorized as generative or discriminative. What is the difference?

Generative models focus on modeling the $P(X,Y)$ and then use $\hat{Y(X)}$ as a classifier where:
$$\hat{Y(X)} = \arg \max_{Y} P(Y|X) = \arg \max \frac{P(X|Y)P(Y)}{\sum_i P(X|Y_i)P(Y_i)} = \arg \max_Y P(X|Y)P(Y)$$
Example includes linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and Naive Bayes

Discriminative methods focus on modeling $P(Y|X)$ directly. So these methods aim to minimize the loss directly without making any assumptions about $P(X, Y)$ or $P(Y|X)$. Most of the classification methods belong to this category such as the logistic regression we talked before, and also SVMs, knn and classification trees.

1. Introduction

I will cover three generative classification models in the next several posts: LDA, QDA and Naive Bayes. Here are their basic differences:

LDA and QDA assume Gaussian density for $P(X|Y)$, but LDA needs the variance for each class to be the same, and QDA does not.
Naive Bayes assumes $P(X|Y) = \prod_j P(X_j|Y)$

2. Bayes Classifier

Before we talk about Naive Bayes, let us review what is bayes classifier first.

Suppose we know $P(Y|X)$, then given an input, we could predict the response:
$$\hat{y_0} = \arg\max_y P(Y=y|X=x_0)$$
and it is the best classifier under the expected 0-1 loss, Why? Here is the brief proof: (I think this part is very necessary to know)

The 0-1 binary loss function has the following form:

$$ l_\boldsymbol{x}(\hat s, s^*) = 1 - \delta_{\hat ss^*} = \begin{cases} 1 & \text{if} \quad \hat s \ne s^* \\ 0 & \text{otherwise} \end{cases} $$

where $\delta$ is the Kronecker Delta function. (...) the expected loss is:

$$ \begin{align} \mathcal{L}_\boldsymbol{x}(\hat s) &= \sum_{s^*} l_\boldsymbol{x}(\hat s, s^*) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} (1 - \delta_{\hat ss^*}) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} P(s = s^* \mid \boldsymbol{x}) - \sum_{s^*} \delta_{\hat ss^*} P(s = s^* \mid \boldsymbol{x}) \\ &= 1 - P(s = s^* \mid \boldsymbol{x}) \end{align} $$

Use for reference from Cross Validated

So we could see, minimize the expected 0-1 loss equals to maximize the $P(s=s^*|\boldsymbol{x})$ and if we want to find $\arg \min_s \mathcal{L}_\boldsymbol{x}(\hat s)$, it equals to find $\arg \max_s P(s = s^*|\boldsymbol{x})$.

After knowing the relationship between 0-1 loss and bayes classifier, we then could have a much clear understanding of naive bayes.

Finally, Since these "true" probabilities are essentially never known, the Bayes Classifier is more a theoretical concept and not something that we can actually use in practice.

3. Naive Bayes Classifier

Naive Bayes Classifier could be seen as a simple case of Bayes Classifier. Since Bayes Classifier is hard to calculate in practical, Naive Bayes make some assumptions to make it computable: it assumes that the features are independent conditional on the class $Y$.

It is a strong and generally unrealistic assumption but naive Bayes still often works very well in practice.

3.1 Model Construction

We start from the basic generative model:
$$\hat{Y(X)} = \arg \max_{Y} P(Y|X) = \arg \max \frac{P(X|Y)P(Y)}{\sum_i P(X|Y_i)P(Y_i)} = \arg \max_Y P(X|Y)P(Y)$$

And for naive bayes model, we have (suppose we have p features):
$$\hat{Y} = \arg \max_Y P(X|Y)P(Y) = \arg \max_Y P(x_1,...,x_p|Y)P(Y)= \arg \max_Y \prod_{i=1}^p P(x_i|Y)P(Y)$$

The distribution $P(Y)$ is easy to estimate from training data: $$ P(y) = \frac{\# \ of \ observations \ in \ class\ y}{\# \ of\ observations}$$
The class conditionals $P(x_i|Y=y)$ usually requires a distribution assumption. When dealing with continuous data, a typical assumption is normal distribution; for discrete data, we always use multinomial distribution. Then we will use MLE to estimate the parameters.

3.2 Model Implementation

This time I compare my Naive Bayes classifier's result with sklearn.naive_bayes.GaussianNB 's result, and they both have an 87% accuracy, which is pretty good. Here is my code for Naive Bayes Classifier from scratch:

Jupyter Notebook For Naive Bayes Classifier

Let us talk other generative classification methods next time!