Boolean Masks

Learn about the filtration of data using masks and operators.

We'll cover the following

Introduction to logical mask
Try it yourself

Introduction to logical mask

To help clean the case study data, we introduce the concept of a logical mask, also known as a Boolean mask. A logical mask is a way to filter an array, or Series, by some condition. For example, we can use the “is equal to” operator in Python, ==, to find all locations of an array that contain a certain value. Other comparisons, such as “greater than” (>), “less than” (<), “greater than or equal to” (>=), and “less than or equal to” (<=), can be used similarly. The output of such a comparison is an array or Series of True/False values, also known as Boolean values. Each element of the output corresponds to an element of the input is True if the condition is met, otherwise it is False.

To illustrate how this works, we will use synthetic data. Synthetic data is data that is created to explore or illustrate a concept. First, we are going to import the numpy package, which has many capabilities for generating random numbers, and give it the alias np.

We’ll also import the default random number generator from the random module within numpy:

import numpy as np
from numpy.random import default_rng

Now we use what’s called a seed for the random number generator. If you set the seed, you will get the same results from the random number generator across runs. Otherwise, this is not guaranteed. This can be a helpful option if you use random numbers in some way in your work and want to have consistent results every time you run a notebook. We arbitrarily set the seed to 12345:

rg = default_rng(12345)

Next, we generate 100 random integers, using the integers method of rg, with the appropriate arguments. We generate integers from between 1 and 4. Note the high argument specifies an open endpoint by default, that is, the upper limit of the range is not included:

random_integers = rg.integers(low=1,high=5,size=100)

Let’s look at the first five elements of this array, with random_integers[:5]. The output should appear as follows:

# array ([3, 1, 4, 2, 1])

Suppose we wanted to know the locations of all elements of random_integers equal to 3. We could create a Boolean mask to do this:

is_equal_to_3 = random_integers == 3

From examining the first 5 elements, we know the first element is equal to 3, but none of the rest are. So in our Boolean mask, we expect True in the first position and False in the next 4 positions. Is this the case?

is_equal_to_3[:5]

The preceding code should give this output:

# array([ True, False, False, False, False])

This is what we expected. This shows the creation of a Boolean mask. But what else can we do with them? Suppose we wanted to know how many elements were equal to 3. To know this, you can take the sum of a Boolean mask, which interprets True as 1 and False as 0:

sum(is_equal_to_3)

This should give us the following output:

# 31

This makes sense, as with a random, equally likely choice of 4 possible values, we would expect each value to appear about 25% of the time. In addition to seeing how many values in the array meet the Boolean condition, we can also use the Boolean mask to select the elements of the array that meet that condition. Boolean masks can be used directly to index arrays, as shown here:

random_integers[is_equal_to_3]

This outputs the elements of random_integers meeting the Boolean condition we specified. In this case, 31 elements are equal to 3:

# array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

Now you know the basics of Boolean arrays, which are useful in many situations. In particular, you can use the loc method of DataFrames to index the rows by a Boolean mask, and the columns by label, to get values of various columns meeting a condition in a potentially different column. Let’s continue exploring the case study data with these skills.

Try it yourself

You can practice executing these codes yourself in the Jupyter notebook below.

Get hands-on with 1400+ tech skills courses.

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Boolean Masks

Introduction to logical mask

Try it yourself