4. Create and Assess Machine Learning Models


Train and Evaluate Multiple Models on the Training Set

At last! We framed the problem, we got the data, explored it, prepared the data, and wrote transformation pipelines to clean up the data for machine learning algorithms automatically. We are now ready for the most exciting part: to select and train a machine learning model.

The great news is that thanks to all the previous steps, things are going to be way simpler than you might think! Scikit-learn makes it all very easy!

Create a Test Set

As a first step we are going to split our data into two sets: training set and test set. We are going to train our model only on part of the data because we need to keep some of it aside in order to evaluate the quality of our model.

Creating a test set is quite simple: the most common approach is to pick some instances randomly, typically 20% of the dataset, and set them aside. The simplest function for doing this Scikit-learn’s train_test_split().

It is a common convention to name the feature set with X in the name, X_train and X_test, and the data with the variable to be predicted with y in the name, y_train and y_test:

Get hands-on with 1300+ tech skills courses.