Controlling Complexity

Learn to control the complexity of CART decision trees via hyperparameters.

The CART hyperparameters in R

The classification and regression tree (CART) algorithms are available in R via the rpart package. CART trees can be specified in tidymodels by using the value rpart with the set_engine() function.

The rpart package supports many hyperparameters for controlling the complexity (i.e., tuning) of decision tree models as they are being built. Of these hyperparameters, the following are the most useful in practice:

  • minsplit: The minimum number of observations that must exist in a node for a split to be attempted

  • minbucket: The minimum number of observations in any leaf node

The rpart package maintains a relationship between the above hyperparameters:

  • If only a value for minsplit is provided, then minbucket is set to minsplit / 3.

  • If only a value for minbucket is provided, then minsplit is set to minbucket * 3.

Given these values, it’s common to tune only minsplit and allow minbucket to be set automatically.

Controlling complexity

Conceptually, decision trees with more nodes are more complex than those with fewer nodes. Consider the nature of root and internal nodes within decision trees. These nodes represent rules the CART algorithm has learned from the training data. More of these nodes represent more patterns in the data the algorithm has learned—more complexity.

Now consider the nature of the minsplit hyperparameter. This hyperparameter controls the number of nodes that can be built in the tree. In general, larger values of minsplit produce smaller, less complex trees. Smaller values do the opposite, they produce larger, more complex trees.

Let’s make these abstract ideas concrete with an example. Take the following data sample from the Adult Census Income dataset:

Get hands-on with 1200+ tech skills courses.