Decision Tree involves segmenting the predictor space into a number of simple regions. This kind of method is simple and useful for interpretation, even though itself only is not competitive enough with other best supervised methods. But we could see later that combining a large number of trees can often result in dramatic improvements in prediction accuracy.
1. Basic Concept of Decision Trees
The logic behind decision tree model is very simple: we want to split our predictor spaces into regions that leads to the greatest reduction of our loss function. For each of the node, we will make a binary splitting until we satisfy some criteria.
Here are some concepts in tree model:
Decision trees are typically drowned upside down. The top node is called root and the bottom node is called terminal nodes or leaves of the tree. The points along the tree where the predictor space is split are referred to as internal nodes. Finally, we call the connection between the nodes as branches.
2. Regression Trees
Just like most of machine learning algorithms, here is what we should consider:
- How does the model work?
- What is the loss function?
- How to prevent overfitting?
then let us answer these questions one by one and then we try to implement this model from scratch by python.
2.1 Model structure and Loss Function
Roughly speaking, there are two steps to build a regression tree:
- Divide the predictor space $X_1,X_2,...,X_p$ into $J$ distinct and non-overlapping regions, $R_1,R_2,...,R_J$ (high-dimensional rectangles, or boxes)
- For every observation that falls into that region $R_j$, we make the same prediction, since here is the regression problem, we simply use the mean of the response values for the training observations in $R_j$. Then in each of the box $R_j$, the loss will be: $\sum_{i\in R_j}(y_i - \bar{y_{R_j}})$
So the total loss will be:
$$\sum_{j=1}^{J}\sum_{i\in R_j}(y_i - \bar{y_{R_j}})$$
Which is the loss function for regression tree.
However, it is computationally infeasible to consider every possible partition of the feature space, so we take a top-down, greedy approach that is known as recursive binary splitting. Basically, It begins at the top of the tree and then successively splits the predictor space into two regions. It is greedy because the best split is made at each step, rather than looking ahead and picking a split that will lead to a better tree in some future step.
In mathematics, it means we will select the predictor $X_j$ and the cutpoint $s$ that splitting the predictor space into the regions$\{X|X_j<s\}$ and $\{X|X_j\ge s\}$ lead to the greatest reduction of our loss function. Next, we will repeat this procedure on one of the two previously identified regions, we continue until no region contains more than a specific number of observations.
2.2 Regularization
After seeing the stop criteria for this model, we could know how easily it getting overfits. To prevent this, a better strategy is to grow a very large tree $T_0$ and then prune back in order to obtain a subtree. So we have a method called cost-complexity pruning also known as weakest link pruning. What it does is adding the penalty on growing a very deep tree, and use cross-validation to find the optimal penalty level (the tunning parameter $\alpha$ below). Here is the loss function for this method:
$$\sum_{m=1}^{|T|}\sum_{i: x_i \in R_m}(y_i - \hat{y_{R_m}})^2 + \alpha|T|$$
where $|T|$ is the number of the leaves of the tree $T$
2.3 Model Implememtation
This time I did spend some time on constructing this model. It seems that it is difficult than I thought. Here is the link to my Jupiter Notebook.
Decision Tree Model
Certainly, I need to improve this algorithm. Let me get back to it later.
3. Classification Tree
Basically, the classification tree is very similar to regression tree. If we follow the same thinking structure of regression tree, here are some differences:
- Change the response value in each region. We use majority vote or mode labels to represent the prediction value in this region.
- Change the loss function.
- 0-1 loss: $\sum_{j=1}^{J}\sum_{i: x_i \in R_j}I{(y_i \ne \hat{y_{R_j}})}$ or the classification error rate: $\sum_{j=1}^{J}\frac{1}{n_j}\sum_{i: x_i \in R_j}\mathbb{I}{y_i \ne \hat{y_{R_j}}}$
- Gini Index: $\sum_{j=1}^{J}\sum_{k=1}^{K}p_{jk}(1-p_{jk})$ where $p_{jk}$ represents the proportion of the training observations in the $j^{th}$ region that are from the $k^{th}$ class.
- Cross-Entrpy: $-\sum_{j=1}^{J}\sum_{k=1}^{K}p_{jk} log(p_{jk})$
Normally, Gini Index and Cross-entropy are very similar and they both reflect the purity of one region and so are more sensitive to node purity than classification error rate. But the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.
4. Summary
Trees are very easy to explain to people and can be displayed graphically. Also, Trees can easily handle qualitative predictors without the need to create dummy variables.
However, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
How to improve this good method? let us see how ensemble method solves this problem next time.