Introduction

Explore the core concepts of data preprocessing in machine learning using scikit-learn. Understand the distinction between ML engineering and data science, and learn how scikit-learn supports efficient data analysis and input pipeline creation for industry applications.

We'll cover the following...

A. ML engineering vs. data science

A. ML engineering vs. data science

In industry, there is quite a bit of overlap between machine learning engineering and data science. Both jobs involve working with data, such as data analysis and data preprocessing.

The main task for machine learning engineers is to first analyze the data for viable trends, then create an efficient input pipeline for training a model. This process involves using libraries like NumPy and pandas for handling data, along with machine learning frameworks like TensorFlow for creating the model and input pipeline. For more information on ML engineering and the NumPy and pandas libraries, check out the previous two sections in this course.

While the NumPy and pandas libraries are also used in data science, this chapter covers one of the core libraries that is specific to industry-level data science: scikit-learn. Data scientists tend to work on smaller datasets than machine learning engineers, and their main goal is to analyze the data and quickly extract usable results. Therefore, they focus more on traditional data inference models (found in scikit-learn), rather than deep neural networks.

The scikit-learn library includes tools for data preprocessing and data mining. It is imported in Python via the statement import sklearn.

1.What you'll learn from this course

2.Data Manipulation with NumPy

3.Data Analysis with pandas

4.Data Preprocessing with scikit-learn

5.Data Modeling with scikit-learn

6.Clustering with scikit-learn

7.Gradient Boosting with XGBoost

8.Deep Learning with TensorFlow

9.Deep Learning with Keras

Introduction

A. ML engineering vs. data science