At this point, you should have an understanding of how a decision tree makes predictions using features, and the class fractions of training samples in the leaf nodes. Now, we will learn how decision trees are trained. The training process involves selecting features to split nodes on, and the thresholds at which to make splits, for example PAY_1 <= 1.5 for the first split in the tree of the previous exercise. Computationally, this means the samples in each node must be sorted on the values of each feature to consider a split for, and splits between each successive pair of sorted feature values are considered. All features may be considered, or only a subset as we will learn about shortly.

How are the splits decided during the training process?

Given that the method of prediction is to take the majority class of a leaf node, it makes sense that we’d like to find leaf nodes that are primarily from one class or the other; choosing the majority class will be a more accurate prediction, the closer a node is to containing just one class. In the perfect case, the training data can be split so that every leaf node contains entirely positive or entirely negative samples. Then, we will have a high level of confidence that a new sample, once sorted into one of these nodes, will be either positive or negative. In practice, this rarely, if ever, happens. However, this illustrates the goal of training decision trees—that is, to make splits so that the next two nodes after the split have a higher purity, or, in other words, are closer to containing either only positive or only negative samples.

In practice, decision trees are actually trained using the inverse of purity, or node impurity. This is some measure of how far the node is from having 100% of the training samples belonging to one class and is analogous to the concept of a cost function, which signifies how far a given solution is from a theoretical perfect solution. The most intuitive concept of node impurity is the misclassification rate. Adopting a widely used notation (see the scikit-learn documentation) for the proportion of samples in each node belonging to a certain class, we can define pmkp_{mk} as the proportion of samples belonging to the kthk^{th} class in the mthm^{th} node. In a binary classification problem, there are only two classes: k=0k = 0 and k=1k = 1. For a given node mm, the misclassification rate is simply the proportion of the less common class in that node, since all these samples will be misclassified when the majority class in that node is taken as the prediction.

Misclassification rate

Let’s visualize the misclassification rate as a way to start thinking about how decision trees are trained. Programmatically, we consider possible class fractions, pm0p_{m0}, between 0.01 and 0.99 of the negative class, k=0k = 0, in a node, mm, using NumPy’s linspace function:

pm0 = np.linspace(0.01,0.99,99) 
pm1 = 1 - pm0

Then, the fraction of the positive class for this node is one minus pm0p_{m0}:

pm1=1−pm0p_{m1}=1-p_{m0}

Now, the misclassification rate for this node will be whatever the smaller class fraction is, between pm0p_{m0} and pm1p_{m1}. We can find the smaller of the corresponding elements between two arrays with the same shape in NumPy by using the minimum function:

misclassification_rate = np.minimum(pm0, pm1)

What does the misclassification rate look like plotted against the possible class fractions of the negative class?

We can plot this using the following code:

mpl.rcParams['figure.dpi'] = 400 
plt.plot(pm0, misclassification_rate, label='Misclassification rate') 
plt.xlabel('$p_{m0}$') 
plt.legend()

You should obtain this graph:

Get hands-on with 1200+ tech skills courses.