First, let us review what AdaBoost method does:
- Fit an additive model $\sum_{m=1}^M\beta_mb(x;\gamma_m)$ in a forward stage-wise manner.
- In each stage, introduce a weak leaner to compensate the shortcomings of existing weak learner. In AdaBoost, "shortcomings" are identified by high-weight data points.
But AdaBoost has some obvious drawback. For example, it cannot deal with regression problem; Also, it requires the classification problem has two outputs with the label 1 and -1. So, could we extend the AdaBoost model to some more general boosting model?
Sure, that is exactly what gradient boosting machine does.
1. Gradient Boosting For Regression
1.1 Model Intuition
We know that Boosting method tries to fit the model sequentially to improve our prediction power. In regression problem, after we fit a model to our original data, we could always write a formula like this:
$$y_i = f(x_i) + h(x_i)$$
where $y_i$ is our true value, $f(x_i)$ is our predicted value, and $h(x_i) = y_i - f(x_i)$ is the residual.
So, if we want to improve our prediction, it is very natural to fit a new model to the residual, and if the model is still not good enough, we then continue fitting model to that model's residual.
That is basically the intuition behind GBM, very simple right? But how can that relate to gradient?
1.2 Residual vs Gradient
Suppose our loss function is squared loss $L(y,f(x)) = \frac{1}{2}(y-f(x))^2$, then we want to minimize $J = \sum_{i=1}^{n} L(y_i,f(x_i))$, to find optimal $f(x_i)$, we could use gradient descent. We first calculate the gradient at $x_i$: $$\frac{\partial J}{\partial f(x_i)}=\frac{\partial \sum_{i=1}^{n} L(y_i,f(x_i))}{\partial f(x_i)}=\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} = f(x_i)-y$$ Therefore, we could see that we can interpret residuals as negative gradient. Then we could update our prediction like the normal gradient descent algorithm: $$f_n(x_i) = f_{n-1}(x_i) - \frac{\partial L(y_i,f_{n-1}(x_i))}{\partial f_{n-1}(x_i)}$$ Then we could fit our sequential model to the negative gradient instead of residuals.
But why we use gradient rather than residuals? It turns out that the concept of gradients is more general and useful than the concepts of residuals. Here is the main reason:
Negative gradient pays less attention to outliers. We know the squared loss is not robust to outliers (which means pay much attention to outliers. Try hard to incorporate outliers into the model). If the loss function is squared loss, there is no difference between gradients and residuals. However, if we use some robust loss function such as Huber loss (Google it!), negative gradient obviously performs better than residual.
1.3 Model Structure
After knowing the intuition behind GBM, we then cover the detailed algorithm of gradient boosting model based on regression tree:
- First we initialize $f_0(x)$ which means we generate a complete regression tree to minimize our loss function.
- Then for $1:M$ boosting tree model:
- Calculate the negative gradient for each of the data point: $r_{im}= -(\frac{\partial L(y_i,f_{n-1}(x_i))}{\partial f_{n-1}(x_i)})$
- Fit a new regression tree model to these negative gradient values, which results in $J$ terminal nodes. Then in each of the terminal region, we try to find the value to minimize the whole loss and at the same time represent the value in this region: $\gamma_{im} = \arg\min_{\gamma}\sum_{x_i \in R_{jm}}L(y_i, f_{m-1}(x_i)+\gamma)$. Which basically is the mean value in that region.
- Update $f_m(x)=f_{m-1}(x) + \sum_{j=1}^{J_m}\gamma_{jm}I\{x \in R_{jm}\}$
- Output $f(x) = f_M(x)$
We could see that there are two basic tuning parameters in this algorithm: The number of iterations $M$ and the size of each constituent tree $J_m$. The reason I talk about these tuning parameters is that we could see boosting method is easy to overfit. So, next let me talk about how to prevent overfitting.
1.4 Shrinkage and Subsampling
1.4.1 Shrinkage
There are two general ways to prevent overfitting for GBM. First is shrinkage. The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor $0 < v < 1$ when it is added to the current approximation. That is, our updated procedure looks like this: $$f_m(x)=f_{m-1}(x) + v \cdot \sum_{j=1}^{J_m}\gamma_{jm}I\{x \in R_{jm}\}$$ The parameter $v$ can be regarded as controlling the learning rate of the boosting procedure. It is natural that smaller values of $v$ lead to larger values of $M$. But empirically it has been found that smaller values of v favor best test error, and require correspondingly larger values of M. And we always use early stopping to choose $M$.
For $J_m$, since it is a weak leaner, we still use either stumps or six terminal-node trees.
1.4.2 Subsampling
With stochastic gradient boosting, at each iteration, we sample only a fraction of the training observations (without replacement) and grow the next tree using that subsample. Not only this method reduce time, but also improve the accuracy.
2. Gradient Boosting for Classification
The intuition behind GBM for classification is very similar to regression. Basically, we still generate regression tree for classification problem, and then we transform the score of each label to probabilities (softmax).
Basically, here are what we do to deal with classification problem:
- We first one-hot encode our label to make it like a distribution
- We then use our normal GBM regression method to train our data, but we use KL-divergence as our loss function (since now our output is a distribution) and transform the output score to probability by using softmax function.
3. XGBoost
XGBoost is an extension for GBM. Since its introduction, XGBoost becomes more and more popular in Kaggle competition, but people always have little idea about what it really is (like me). So, this time I will have a high-level introduction to what is the difference between XGBoost and GBM and why it becomes so popular.
Basically, the advantages of XGBoost comparing with GBM are:
- XGBoost decide the optimization step directly by Newton method
- XGBoost has more regularization options
Here are the differences summarized by SauceCat:
GBM has broader application. At each iteration, both GBM and XGBoost need to calculate gradient at current estimate. XGBoost also needs to calculate hessian, requiring the objective function to be twice differentiable (strictly convex). GBM only requires a differentiable loss function, thus it can be used in more applications.
XGBoost is faster. Comparing the weights calculated by GBM and XGBoost, for GBM, the weight is simply the average value of the gradients, while for XGBoost, it is the sum of gradients scaled by the sum of hessians.
XGBoost provides more regularization options, including L1(α) and L2(λ) regularization as well as penalization on the number of leaf nodes(γ).
4. Implementation
Since GBM is a little bit complex to write from scratch, and its basic technique contains regression tree and gradient descent which I have already written before, so I will only implement it by using sklearn.ensemble.GradientBoostingClassifier
.
Here is the link of Jupyter Notebook for GBM
5. Summary
Here is the summary for GBM:
- Fit an additive model in a forward stage-wise manner
- In each stage, introduce a new regression tree $h(x)$ to compensate the shortcomings of existing model
- The "shortcomings" are identified by negative gradients
- For any loss function, we can derive a gradient boosting algorithm
- Absolute loss and Huber loss are more robust to outliers than square loss
Reference: http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf https://towardsdatascience.com/boosting-algorithm-xgboost-4d9ec0207d