Exercise: Generating and Modeling Synthetic Classification Data

Learn how overfitting happens by using a synthetic dataset with many candidate features and relatively few samples.

Overfitting in binary classification

Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:

  1. Import the make_classification, train_test_split, LogisticRegression, and roc_auc_score classes using the following code:

    from sklearn.datasets import make_classification 
    from sklearn.model_selection import train_test_split 
    from sklearn.linear_model import LogisticRegression 
    from sklearn.metrics import roc_auc_score
    

    Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before: make_classification. This class does just what its name indicates—it makes data for a classification problem. Using the various keyword arguments, you can specify how many samples and features to include, and how many classes the response variable will have. There is also a range of other options that effectively control how “easy” the problem will be to solve.

    Note: For more information, refer to the scikit-learn documentation on make_classification. Suffice to say that we’ve selected options here that make a reasonably easy-to-solve problem, with some curveballs thrown in. In other words, we expect high model performance, but we’ll have to work a little bit to get it.

  2. Generate a dataset with two variables, x_synthetic and y_synthetic. The variable x_ synthetic has the 200 candidate features, and y_synthetic has the response variable, for all 1,000 samples. Use the following code:

    X_synthetic, y_synthetic = make_classification(n_samples=1000, n_features=200, n_informative=3, 
    n_redundant=10, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, 
    class_sep=0.8, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=24)
    
  3. Examine the shape of the dataset and the class fraction of the response variable using the following code:

    print(X_synthetic.shape, y_synthetic.shape) 
    print(np.mean(y_synthetic))
    

    You will obtain the following output:

    (1000, 200) (1000,) 
    0.501
    

    After checking the shape of the output, note that we’ve generated an almost perfectly balanced dataset: close to a 50/50 class balance. It is also important to note that we’ve generated all the features so that they have the same shift and scale—that is, a mean of 0 with a standard deviation of 1. Making sure that the features are on the same scale, or have roughly the same range of values, is a key point for using regularization methods—and we’ll see why later. If the features in a raw dataset are on widely different scales, it is advisable to normalize them so that they are on the same scale. Scikit-learn has the functionality to make this easy, which we’ll learn about in the challenge at the end of this section.

  4. Plot the first few features as histograms to show that the range of values is the same using the following code:

    for plot_index in range(4): 
        plt.subplot(2, 2, plot_index+1) 
        plt.hist(X_synthetic[:, plot_index]) 
        plt.title('Histogram for feature {}'.format(plot_index+1))
    plt.tight_layout()
    

    You will obtain the following output:

Get hands-on with 1200+ tech skills courses.