This notebook tries to implement XGBoost through xgboost
library and at the same time try to understand the tuning parameter
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
boston = load_boston()
# Generate Data
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
data.head()
XGBoost has lots of hyperparameters. Normally we divide it into three groups:
General Parameters:
booster=gbtree
: Choose the weaker leaner: gbtree
or gblinear
silent=0
: 1 for no running log infoBooster Parameters:
learning_rate=0.3
: step size shrinkage used to prevent overfitting. Range is [0,1]max_depth=6
: determines how deeply each tree is allowed to grow during any boosting round. subsample=1
: percentage of samples used per tree. Low value can lead to underfitting. Range is (0,1]colsample_bytree=1
: percentage of features used per tree. High value can lead to overfitting. n_estimators
: number of trees you want to build.gamma=0
: controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners.alpha
: L1 regularization on leaf weights. A large value leads to more regularization.lambda
: L2 regularization on leaf weights and is smoother than L1 regularization.Learning Task Parameters:
objective=reg:squarederror
: determines the loss function to be used like reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.metric
: The metric to be used for validation data, such as: rmse, mae, logloss, errorxg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: {}".format(rmse))
xgboost
can also do cross validation:
num_boost_round
: denotes the number of trees you build (analogous to n_estimators)metrics
: tells the evaluation metrics to be watched during CVas_pandas
: to return the results in a pandas DataFrame.early_stopping_rounds
: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.seed
: for reproducibility of results.data_dmatrix = xgb.DMatrix(data=X,label=y)
params = {'objective':'reg:squarederror','colsample_bytree': 0.7,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix,
params=params,
nfold=3,
num_boost_round=50,
early_stopping_rounds=10,
metrics="rmse",
as_pandas=True,
seed=123)
cv_results.tail()
XGBoost has a plot_tree()
function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree()
function along with the number of trees you want to plot using the num_trees argument
xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [100, 20]
plt.show()
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [20, 5]
plt.show()