XGBoost¶

This notebook tries to implement XGBoost through xgboost library and at the same time try to understand the tuning parameter

1. Load the Library¶

import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

boston = load_boston()

# Generate Data
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

data.head()

2. Model Training¶

XGBoost has lots of hyperparameters. Normally we divide it into three groups:

General Parameters:
- booster=gbtree: Choose the weaker leaner: gbtree or gblinear
- silent=0: 1 for no running log info
Booster Parameters:
- learning_rate=0.3: step size shrinkage used to prevent overfitting. Range is [0,1]
- max_depth=6: determines how deeply each tree is allowed to grow during any boosting round.
- subsample=1: percentage of samples used per tree. Low value can lead to underfitting. Range is (0,1]
- colsample_bytree=1: percentage of features used per tree. High value can lead to overfitting.
- n_estimators: number of trees you want to build.
- gamma=0: controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners.
- alpha: L1 regularization on leaf weights. A large value leads to more regularization.
- lambda: L2 regularization on leaf weights and is smoother than L1 regularization.
Learning Task Parameters:
- objective=reg:squarederror: determines the loss function to be used like reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.
- metric: The metric to be used for validation data, such as: rmse, mae, logloss, error

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: {}".format(rmse))

RMSE: 9.656435145869283

3. Model Validation¶

xgboost can also do cross validation:

num_boost_round: denotes the number of trees you build (analogous to n_estimators)
metrics: tells the evaluation metrics to be watched during CV
as_pandas: to return the results in a pandas DataFrame.
early_stopping_rounds: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
seed: for reproducibility of results.

data_dmatrix = xgb.DMatrix(data=X,label=y)

params = {'objective':'reg:squarederror','colsample_bytree': 0.7,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, 
                    params=params, 
                    nfold=3,
                    num_boost_round=50,
                    early_stopping_rounds=10,
                    metrics="rmse", 
                    as_pandas=True, 
                    seed=123)

cv_results.tail()

4. Model Visualization¶

XGBoost has a plot_tree() function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument

xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [100, 20]
plt.show()

xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [20, 5]
plt.show()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

	test-rmse-mean	test-rmse-std	train-rmse-mean	train-rmse-std
45	3.581326	0.339733	1.770831	0.006182
46	3.566893	0.345902	1.742025	0.004021
47	3.554664	0.345619	1.713263	0.005822
48	3.540806	0.350308	1.686923	0.006609
49	3.530053	0.349713	1.663324	0.008643