Gradient Boosting Machine

This notebook tries to implement GBM through sklearn and at the same time try to understand the tuning parameter in that class.

1. Load Library and Generate Data

In [33]:
# Load Library
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
In [20]:
# Generate Data
X, y = make_regression(n_samples = 500, n_features = 10, random_state = 42)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

2. Model Implementation

The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage. Also we have subsample parameter to perform stochastic gradient descent.

In [27]:
gbm = GradientBoostingRegressor(n_estimators= 1000,
                                learning_rate=0.1, 
                                max_depth=1,
                                random_state=42,
                               subsample = 0.5)
In [28]:
gbm = gbm.fit(X_train, y_train)
In [34]:
mean_squared_error(y_test, gbm.predict(X_test))
Out[34]:
346.2935087147836
In [38]:
gbm.feature_importances_
Out[38]:
array([0.118, 0.049, 0.132, 0.066, 0.138, 0.144, 0.096, 0.068, 0.088,
       0.101])