Exercise: Obtaining Probabilities from Logistic Regression Model

Learn to obtain the probabilities of the trained logistic regression model.

Discovering predicted probabilities

How does logistic regression make predictions? Now that we’re familiar with accuracy, true and false positives and negatives, and the confusion matrix, we can explore new ways of using logistic regression to learn about more advanced binary classification metrics. So far, we’ve only considered logistic regression as a “black box” that can learn from labeled training data and then make binary predictions on new features. While we will learn about the workings of logistic regression in detail later in the course, we can begin to peek inside the black box now.

One thing to understand about how logistic regression works is that the raw predictions—in other words, the direct outputs from the mathematical equation that defines logistic regression—are not binary labels. They are actually probabilities on a scale from 0 to 1 (although, technically, the equation never allows the probabilities to be exactly equal to 0 or 1, as we’ll see later). These probabilities are only transformed into binary predictions through the use of a threshold. The threshold is the probability above which a prediction is declared to be positive, and below which it is negative. The threshold in scikit-learn is 0.5. This means any sample with a predicted probability of at least 0.5 is identified as positive, and any with a predicted probability < 0.5 is decided to be negative. However, we are free to use any threshold we want. In fact, choosing the threshold is one of the key flexibilities of logistic regression, as well as other machine learning classification algorithms that estimate probabilities of class membership.

Predicted probabilities from logistic regression

In the following exercise, we will get familiar with the predicted probabilities of logistic regression and how to obtain them from a scikit-learn model.

We can begin to discover predicted probabilities by further examining the methods available to us on the logistic regression model object that we trained earlier in this section. Recall that before, once we trained the model, we could then make binary predictions using the values of features from new samples by passing these values to the predict method of the trained model. These are predictions made on the assumption of a threshold of 0.5.

However, we can directly access the predicted probabilities of these samples, using the predict_proba method. Perform the following steps to complete the exercise.

  1. Obtain the predicted probabilities for the test samples using this code:

    y_pred_proba = example_lr.predict_proba(X_test)
    y_pred_proba
    

    The output should be as follows:

    # array([[0.77423402, 0.22576598],
    #       [0.77423402, 0.22576598],
    #       [0.78792915, 0.21207085],
    #       ...,
    #       [0.78792915, 0.21207085],
    #       [0.78792915, 0.21207085],
    #       [0.78792915, 0.21207085]])
    

We see in the output of this, which we’ve stored in y_pred_proba, that there are two columns. This is because there are two classes in our classification problem: negative and positive. Assuming the negative labels are coded as 0 and the positives as 1, as they are in our data, scikit-learn will report the probability of negative class membership as the first column, and positive class membership as the second.

Because the two classes are mutually exclusive and are the only options, the sum of predicted probabilities for the two classes should equal 1 for every sample. Let’s confirm this.

First, we can use np.sum over the first dimension (columns) to calculate the sum of probabilities for each sample.

  1. Calculate the sum of predicted probabilities for each sample with this code:

    prob_sum = np.sum(y_pred_proba,1) 
    prob_sum 
    

    The output is as follows:

    # array([1., 1., 1., ..., 1., 1., 1.]) 
    

    It certainly looks like all 1s. We should check to see that the result is the same shape as the array of test data labels.

  2. Check the array shape with this code:

    prob_sum.shape 
    

    This should output the following:

    # (5333,) 
    

    Good; this is the expected shape. Now, to check that each value is 1. We use np.unique to show all the unique elements of this array. This is similar to DISTINCT in SQL. If all the probability sums are indeed 1, there should only be one unique element of the probability array: 1.

  3. Show all unique array elements with this code:

    np.unique(prob_sum) 
    

    This should output the following:

    # array([1.]) 
    

    After confirming our belief in the predicted probabilities, we note that because class probabilities sum to 1, it’s sufficient to just consider the second column, the predicted probability of positive class membership. Let’s capture these in an array.

  4. Run this code to put the second column of the predicted probabilities array (predicted probability of membership in the positive class) in an array:

    pos_proba = y_pred_proba[:,1] 
    pos_proba 
    

    The output should be as follows:

    # array([0.22576598, 0.22576598, 0.21207085, ..., 0.21207085, 0.21207085, 0.21207085])
    

    What do these probabilities look like? One way to find out, and a good diagnostic for model output, is to plot the predicted probabilities. A histogram is a natural way to do this, for which we can use the matplotlib function, hist(). Note that if you execute a cell with only the histogram function, you will get the output of the NumPy histogram function returned before the plot. This includes the number of samples in each bin and the locations of the bin edges.

  5. Execute this code to see histogram output and an unformatted plot (not shown here):

    plt.hist(pos_proba)
    

    The output is as follows:

    # (array([1883.,    0.,    0., 2519.,    0.,    0.,  849.,    0.,    0., 82.]), 
    # array([0.21207085, 0.21636321, 0.22065556, 0.22494792, 0.22924027, 0.23353263, 0.23782498, 0.24211734, 0.24640969, 0.25070205, 0.2549944 ]), 
    # <BarContainer object of 10 artists>)
    

    This may be useful information for you and could also be obtained directly from the np.histogram() function. However, here we’re mainly interested in the plot, so we adjust the font size and add some axis labels.

  6. Run this code for a formatted histogram plot of predicted probabilities:

    mpl.rcParams['font.size'] = 12
    plt.hist(pos_proba)
    plt.xlabel('Predicted probability of positive class for test data')
    plt.ylabel('Number of samples')
    

    The plot should look like this:

Get hands-on with 1200+ tech skills courses.