Many Categories SSE

Build knowledge of how CART regression trees find the most optimal splits for categorical data.

Splitting binary categorical features

The simplest case for categorical features is where there are only two possible values (i.e., the feature is binary). In the case of binary categorical features, the CART regression tree algorithm calculates the SSE for the feature by choosing one of the categories, splitting the data, calculating the SSE for the left-hand and right-hand data, and then adding the two SSEs.

Splitting many category features

When a categorical feature has three or more categories (i.e., levels), the CART regression tree algorithm optimizes to evaluate a minimum number of potential splits. Consider a hypothetical model imputing missing values of the Age feature from the Titanic dataset.

The following table represents a sample of Titanic data for the Pclass and Age features available to the CART regression tree algorithm for a split:

Get hands-on with 1200+ tech skills courses.