Dataset
Learn how datasets are represented in TensorFlow's input pipeline.
We'll cover the following
Chapter Goals:
- Learn how to create a dataset in TensorFlow
- Implement a function that creates a dataset from NumPy data
A. Input pipeline
In TensorFlow, the input pipeline for executing a machine learning model is represented by the Dataset
class (which we’ll refer to as simply a dataset). A dataset can be created for a variety of input values, from NumPy arrays to protocol buffers. The most basic way to create a dataset is with the tf.data.Dataset.from_tensor_slices
function.
import numpy as npimport tensorflow as tfdata = np.array([[ 1. , 2.1],[ 2. , 3. ],[ 8.1, -10. ]])d1 = tf.data.Dataset.from_tensor_slices(data)print(d1)
In the example, d1
is a dataset containing the data from data
. The dataset consists of three observations, with each observation being a row in data
. Since each row of data
has two columns, the observations in d1
have shape (2,)
.
We can also create datasets from tuple inputs. This is useful when we want to create a dataset from both feature data and labels for each data observation.
import numpy as npimport tensorflow as tfdata = np.array([[1. , 2. , 3. ],[1.1, 0. , 8. ]])labels = np.array([1, 0])d2 = tf.data.Dataset.from_tensor_slices((data, labels))print(d2)
In the example, d2
is a dataset containing the data from data
and the observation labels from labels
. There are two total observations, and each observation has shape (3,)
, since data
has three columns.
B. Image file dataset
The from_tensor_slices
function is not limited to just taking NumPy arrays as input. For example, we can use it to create a dataset of file names. A popular application of this is creating a dataset for image files.
import numpy as npimport tensorflow as tffilenames = ['img1.jpg', 'img2.jpg']img_d1 = tf.data.Dataset.from_tensor_slices(filenames)print(img_d1)labels = np.array([1, 0])img_d2 = tf.data.Dataset.from_tensor_slices((filenames, labels))print(img_d2)
In the example, img_d1
represents a dataset for the input file names, while img_d2
also has a label for each image file. Note that each dataset observation is a filename, rather than the actual file contents. For more information on processing image files to retrieve the byte data, see the Image Recognition course on Educative.
C. Specialized datasets
Apart from the from_tensor_slices
function, we can also use TFRecordDataset
and TextLineDataset
to create specialized datasets for protocol buffers and text data, respectively.
import numpy as npimport tensorflow as tfrecords_files = ['one.tfrecords', 'two.tfrecords']d1 = tf.data.TFRecordDataset(records_files)print(d1)txt_files = ['lines.txt']d2 = tf.data.TextLineDataset(txt_files)print(d2)
The TFRecordDataset
takes in a list of TFRecords files and creates a dataset where each observation is an individual serialized protocol buffer. In the example, d1
contains the serialized protocol buffers from 'one.tfrecords'
and 'two.tfrecords'
.
The TextLineDataset
takes in a list of text files and creates a dataset where each observation is a separate line from the text files. In the example, d2
contains the lines from 'lines.txt'
.
Get hands-on with 1300+ tech skills courses.