XGBoost

This notebook tries to implement XGBoost through xgboost library and at the same time try to understand the tuning parameter

1. Load the Library

In [1]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
In [2]:
boston = load_boston()
In [3]:
# Generate Data
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
In [4]:
data.head()
Out[4]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

2. Model Training

XGBoost has lots of hyperparameters. Normally we divide it into three groups:

  1. General Parameters:

    • booster=gbtree: Choose the weaker leaner: gbtree or gblinear
    • silent=0: 1 for no running log info
  2. Booster Parameters:

    • learning_rate=0.3: step size shrinkage used to prevent overfitting. Range is [0,1]
    • max_depth=6: determines how deeply each tree is allowed to grow during any boosting round.
    • subsample=1: percentage of samples used per tree. Low value can lead to underfitting. Range is (0,1]
    • colsample_bytree=1: percentage of features used per tree. High value can lead to overfitting.
    • n_estimators: number of trees you want to build.
    • gamma=0: controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners.
    • alpha: L1 regularization on leaf weights. A large value leads to more regularization.
    • lambda: L2 regularization on leaf weights and is smoother than L1 regularization.
  3. Learning Task Parameters:

    • objective=reg:squarederror: determines the loss function to be used like reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.
    • metric: The metric to be used for validation data, such as: rmse, mae, logloss, error
In [5]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)
In [6]:
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
In [7]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: {}".format(rmse))
RMSE: 9.656435145869283

3. Model Validation

xgboost can also do cross validation:

  • num_boost_round: denotes the number of trees you build (analogous to n_estimators)
  • metrics: tells the evaluation metrics to be watched during CV
  • as_pandas: to return the results in a pandas DataFrame.
  • early_stopping_rounds: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
  • seed: for reproducibility of results.
In [8]:
data_dmatrix = xgb.DMatrix(data=X,label=y)
In [9]:
params = {'objective':'reg:squarederror','colsample_bytree': 0.7,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, 
                    params=params, 
                    nfold=3,
                    num_boost_round=50,
                    early_stopping_rounds=10,
                    metrics="rmse", 
                    as_pandas=True, 
                    seed=123)
In [10]:
cv_results.tail()
Out[10]:
test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std
45 3.581326 0.339733 1.770831 0.006182
46 3.566893 0.345902 1.742025 0.004021
47 3.554664 0.345619 1.713263 0.005822
48 3.540806 0.350308 1.686923 0.006609
49 3.530053 0.349713 1.663324 0.008643

4. Model Visualization

XGBoost has a plot_tree() function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument

In [15]:
xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [100, 20]
plt.show()
In [19]:
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [20, 5]
plt.show()
In [ ]: