Exercise: Fitting a Random Forest
Learn how to fit a random forest model with cross-validation on the training data from the case study.
We'll cover the following
Extending decision trees with random forests
In this exercise, we will extend our efforts with decision trees by using the random forest model with cross-validation on the training data from the case study. We will observe the effect of increasing the number of trees in the forest and examine the feature importance that can be calculated using a random forest model. Perform the following steps to complete the exercise:
-
Import the random forest classifier model class as follows:
from sklearn.ensemble import RandomForestClassifier
-
Instantiate the class using these options:
rf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=3, min_samples_split=2,\ min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None,\ min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,\ random_state=4, verbose=0, warm_start=False, class_weight=None)
For this exercise, we’ll use mainly the default options. However, note that we will set
max_depth = 3
. Here, we are only going to explore the effect of using different numbers of trees, which we will illustrate with relatively shallow trees for the sake of shorter runtimes. To find the best model performance, we’d typically try more trees and deeper depths of trees.We also set
random_state
for consistent results across runs. -
Create a parameter grid for this exercise in order to search the numbers of trees, ranging from 10 to 100 by 10s:
rf_params_ex = {'n_estimators':list(range(10,110,10))}
We use Python’s
range()
function to create an iterator for the integer values we want, and then convert them to a list usinglist()
. -
Instantiate a grid search cross-validation object for the random forest model using the parameter grid from the previous step. Otherwise, you can use the same options that were used for the cross-validation of the decision tree:
cv_rf_ex = GridSearchCV(rf, param_grid=rf_params_ex, scoring='roc_auc', n_jobs=None, refit=True,\ cv=4, verbose=1, pre_dispatch=None, error_score=np.nan, return_train_score=True)
-
Fit the cross-validation object as follows:
cv_rf_ex.fit(X_train, y_train)
The fitting procedure should output the following:
Fitting 4 folds for each of 10 candidates, totalling 40 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 40 out of 40 | elapsed: 28.0s finished GridSearchCV(cv=4, estimator=RandomForestClassifier(max_depth=3, n_estimators=10, random_state=4), param_grid={'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}, pre_dispatch=None, return_train_score=True, scoring='roc_auc', verbose=1)
You may have noticed that, although we are only cross-validating over 10 hyperparameter values, comparable to the 7 values that we examined for the decision tree in the previous exercise, this cross-validation took noticeably longer. Consider how many trees we are growing in this case. For the last hyperparameter,
n_estimators = 100
, we have grown a total of 400 trees across all the cross-validation splits.How long has model fitting taken across the various numbers of trees that we just tried? What gains in terms of cross-validation testing performance have we made by using more trees? These are good things to examine using plots. First, we’ll pull the cross-validation results out into a pandas DataFrame, as we’ve done before.
-
Put the cross-validation results into a pandas DataFrame:
cv_rf_ex_results_df = pd.DataFrame(cv_rf_ex.cv_results_)
You can examine the whole DataFrame in the accompanying Jupyter Notebook. Here, we move directly to creating plots of the quantities of interest. We’ll make a line plot, with symbols, of the mean fit time across the folds for each hyperparameter, contained in the
mean_fit_time
column, as well as an error bar plot of testing scores, which we’ve already done for decision trees. Both plots will be against the number of trees on the x axis. -
Create two subplots of the mean training time and mean testing scores with standard error:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(6, 3)) axs[0].plot(cv_rf_ex_results_df['param_n_estimators'], cv_rf_ex_results_df['mean_fit_time'], '-o') axs[0].set_xlabel('Number of trees') axs[0].set_ylabel('Mean fit time (seconds)') axs[1].errorbar(cv_rf_ex_results_df['param_n_estimators'], cv_rf_ex_results_df['mean_test_score'],\ yerr=cv_rf_ex_results_df['std_test_score']/ np.sqrt(4))) axs[1].set_xlabel('Number of trees') axs[1].set_ylabel('Mean testing ROC AUC $\pm$ 1 SE ') plt.tight_layout()
Here, we’ve used
plt.subplots
to create two axes at once, within a figure, in a one-row-by-two-column configuration. We then access the axes objects by indexing the array ofaxs
axes returned from this operation in order to create plots. The output should look similar to this plot:
Get hands-on with 1200+ tech skills courses.