Feature Extraction

Learn how to extract features from data using scikit-learn.

In ML, features are the variables used for making predictions. Feature extraction involves transforming raw data into a set of features that can be used for training ML models. The scikit-learn library provides several methods for feature extraction, including feature hashing and text feature extraction. Because most ML models provided by scikit-learn need numerical data as input, you need to convert nonnumerical data before training the models.

Feature hashing for categorical data

Feature hashing, also known as the hash trick, is a method for transforming categorical variables into numerical variables. The idea behind feature hashing is to map each categorical value to a unique integer using a hash function and then use these integers as input features for ML models.

In other words, it’s a feature extraction method used to convert categorical data into numerical data. This is important because scikit-learn models cannot handle categorical data fed directly as text: the data needs to be converted to numbers first.

This method works by applying a hash function to the categorical data and mapping the hash values to a lower-dimensional feature space. The number of features in this lower-dimensional space can be specified, and the output is a sparse matrix.

The advantage of using FeatureHasher is its ability to reduce the size of large datasets with a large number of unique categories. When applied to categorical data, the hash function results in a compact numerical representation of the data. The sparse matrix that is generated can be used as input for ML algorithms. However, it’s worth noting that this method is probabilistic and may lose some information about the original categorical data in the process.

The scikit-learn library provides the FeatureHasher class for performing feature hashing, as shown below:

Get hands-on with 1200+ tech skills courses.