The traditional or classical approach to solving NLP is a sequential flow of several key steps, and it’s a statistical approach. When we take a closer look at a traditional NLP learning model, we’ll be able to see a set of distinct tasks taking place, such as preprocessing data by removing unwanted data, feature engineering to get good numerical representations of textual data, learning to use machine learning algorithms with the aid of training data, and predicting outputs for novel, unseen data. Of these, feature engineering was the most time-consuming and crucial step for obtaining good performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks. First, the text corpora need to be preprocessed, focusing on reducing vocabulary and distractions. By distractions, we refer to the things that distract the algorithm (for example, punctuation marks and stop word removal) from capturing the vital linguistic information required for the task.

Objective of feature engineering

Next come several feature engineering steps. The main objective of feature engineering is to make learning easier for the algorithms. Often, the features are hand engineered and biased toward the human understanding of a language. Feature engineering was of the utmost importance for classical NLP algorithms, and consequently, the best-performing systems often had the best-engineered features. For example, for a sentiment classification task, we can represent a sentence with a parse tree and assign positive, negative, or neutral labels to each node/subtree in the tree to classify that sentence as positive or negative. Additionally, the feature engineering phase can use external resources such as WordNet (a lexical database that can provide insights into how different words are related to each other, e.g., synonyms) to develop better features. We will soon look at a simple feature engineering technique known as bag of words.

Get hands-on with 1200+ tech skills courses.