I have planned to write this post long time ago. Thanks to my procrastination, I need to review it again and write it now: This post include these several parts:
- Quick review of GBM
- How to decide the value in each node
- Object function
- Optimal Results
- How to split the tree:
- Exact Greedy Algorithm
- Approximate Algorithm
- System Design Overview
- Parameter Explanation for XGBoost
1. Review of GBM
You may or may not know that the full name of XGBoost is eXtreme Gradient Boosting. So there is no doubt that this model is built on top of GBM. The detailed GBM model explaination can be seen from here. For this post, I will list several summarized GBM information which can save us time when understanding XGBoost.
1.1 Theory Basis
Gradient boosting machine tries to fit an additive model in a forward stage-wise manner. In each stage, we introduce a new regression tree \(f(x)\) to compensate the shortcomings of existing model and the "shortcomings" are identified by negative gradients.
1.2 Negative Gradients
I think for negative gradients part I still did not explain clearly last time. So, this time I would like to deep dive a little bit more:
Remeber we have our additive model like this: \[\hat{y^{(t)}} = \hat{y^{(t-1)}} + f_t(X) \tag{1}\] And each time we want to train a weak learner \(y^{(t)}\) (here we use tree) to minimize the loss: \[\arg\min_{\hat{y^{(t)}}} \ L(y,\hat{y^{(t)}})\] In order to find the optimal weaker leaner to minimize the loss, we can use gradient descent. So if we deem \(\hat{y^{(t)}}\) as the parameter to update, we have: \[\hat{y^{(t)}} = \hat{y^{(t-1)}} - \frac{\partial L(y,\hat{y^{(t-1)}})}{\partial \hat{y^{(t-1)}}}\tag{2}\] This type of GD is called function space gradient descent.
Then it comes to the amazing part: If you compare formula (1) with (2), you will realize that if we fit \(f_t(X)\approx -\frac{\partial L(y,\hat{y^{(t-1)}})}{\partial \hat{y^{(t-1)}}}\), then we will minimize the loss, and the reality is \(\hat{y^{(t-1)}}\) can be seen as known value, so the only part we train is \(f_t(X)\) within \(\hat{y^{(t)}}.\)
That's why we are saying that in each state, GBDT tries to fit the negative gradients of the last state.
And if you use squared loss, you will find every tree just fits to the residuals.
In a word: The loss function is still what is being minimized, while the gradient is how it is being minimized.
After reviewing the GBM knowledge, we can summarize that the key points to generate a boosting tree are to know:
- How to decide the value in each node
- How to split the tree
Let's then look at how XGBoost solves these two problems.
2. How to decide the value in each node
2.1 Object Function
The major difference between normal GBM and XGBoost is the object function: \[obj^{t}=\sum_{i=1}^{n}l(y_i,\hat{y_i^{(t-1)}}+f_t(x_i))+\Omega(f_t)\]
We can see that except for the loss function which is the same with GBM, it also has a regularization term.
Regularization Term
The regularization term is vary common and important in preventing the overfitting. To know the specific form of the regularization term, we need to know what parameter controls the tree \(f_t(x)\).
Here we define: \[f_t(x)= \omega_{q(x)}, q:R^{d}={1,2,..,T}, \omega \in R^{T}\] where \(q(x)\) defines the tree structure and tells the mapping between each sample \(x\) to its tree leaves; \(\omega\) tells the leaf weight (leaf value) of that tree. In that case, one sample \(x_i\) belongs to one leaf through \(q(x_i)\) and has the value \(\omega_{q(x_i)}\)
Therefore, the complication of a tree includes two parts:
- The number of leaves \(T\)
- The weight value \(\omega\) for each leaf (?)
\[\Omega(f_t) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}\omega_j^2\]
2.2 Optimal Solution
After making clear the objective function, let's now try to answer the first question: How to decide the value in each node. From GBM we know, we tried to minimize the loss to find the optimal value and use it to represent each node (Mean for L2 loss and Median for L1 loss). Here we do the same with using one mathematical technique called Tyler Theorem:
Tyler Theorem gives an approximation of a k-times differentiable function around a given point by a k-th order Taylor polynomial:(here we show the second-order estimation) \[f(x+\Delta x) \approx f(x)+ f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2\]
2.2.1 Usage of Tyler Formula
After knowing this, we get back to our loss function without regularization term: \[\sum_{i=1}^{n}l(y_i,\hat{y_i^{(t-1)}}+f_t(x_i))\] If we ignore the \(y_i\) (that is the true value we already know), and bring Tyler formula into this:
- Let \(x\) match with \(\hat{y_i^{(t-1)}}\) in the loss function
- Let \(\Delta x\) matach with \(f_t(x_i)\)
Then we can re-write the loss function as following: \[loss^{(t)}\approx \sum_{i=1}^{n}[l(y_i,\hat{y_i^{(t-1)}})+ g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)]\] Where \(g_i = \partial_{\hat{y_i^{(t-1)}}} l(y_i,\hat{y_i^{(t-1)}})\) and \(h_i = \partial^2_{\hat{y_i^{(t-1)}}} l(y_i,\hat{y_i^{(t-1)}})\)
Next, since we are using greedy method which means the value for previous \(t-1\) trees will be worked as known in stage \(t\), so \(l(y_i,\hat{y_i^{(t-1)}})\) is constant and can be ignored. After clean it up we get: \[loss^{(t)}\approx \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)]\]
We can see that the loss only depends on the first and second derivative of the loss function.
2.2.2 From Sample to Leaf Node
Next we add regularization term together with the loss function to further optimize: \[\begin{align*} obj^{(t)} &\approx \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)]+ \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}\omega_j^2 \\ &=\sum_{i=1}^{n}[g_i\omega_{q(x_i)} + \frac{1}{2}h_i \omega^2_{q(x_i)}]+ \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}\omega_j^2 \\ &=\sum_{j=1}^{T}[(\sum_{i\in I_j}g_i)\omega_j + \frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)\omega^2_j] + \gamma T \end{align*}\] Where \(I_j\) contains all the sample index which belongs to leaf node \(j\). Here we did a transformation from sample to leaf node. At the beginning we add every term up in sample level i \(\to\) n, then we use \(q(x)\) and \(\omega\) to define the tree structure and gradually transfer to add term up in node level j \(\to\) T. This transferring step is really important for solving this optimization problem.
2.2.3 Optimal Solution:
Finally, we can define \(G_j = \sum_{i\in I_j}g_i\), \(H_j =\sum_{i\in I_j}h_i\) and re-write the objective function: \[obj^{(t)}=\sum_{j=1}^{T}[G_j\omega_j + \frac{1}{2}(H_j+ \lambda)\omega^2_j] + \gamma T\] And if we make partial derivative on \(\omega_j\) to 0, we can get the optimal value: \[\omega^{*} = -\frac{G_j}{H_j+\lambda}\] And if we bring in it into the objective function: \[obj = -\frac{1}{2}\sum_{j=1}^{T}\frac{G^2_j}{H_j+\lambda}+\gamma T\] It tells us when the tree structure is fixed, what is the minimum loss we will get, so we also call it structure score.
Now we can get the optimal value (\(\omega_j\)) to minimize the objective function when the tree structure is fixed. Then the next question will be: how can we get a good tree structure? In another word, how should we split the tree? Let's talk about that in the next post.
Reference:
https://arxiv.org/abs/1603.02754
https://zhuanlan.zhihu.com/p/38329631
https://towardsdatascience.com/boosting-algorithm-gbm-97737c63daa3
https://blog.csdn.net/vjulyv/article/details/81410574
https://explained.ai/gradient-boosting/faq.html