Checkpoint

A. Checkpoint directory

After running training with a checkpoint directory, it will contain several files. An example checkpoint directory, named my_model, is shown below.

Press + to interact

The .pbtxt file represents the entire computation graph stored in human readable text format. The .tfevents file is the events file for TensorBoard (note that the longer file suffix, which contains the local machine’s ID, is omitted).

The actual saved model state at a particular training step consists of the following three files:

.data: One or more files containing the values for the model’s parameters. Larger models may require more .data files.
.index: Metadata descriptions for how to find a particular tensor in the .data file(s).
.meta: Represents the non-human readable saved graph structure. This file can be used to restore the computation graph.

The checkpoint file lists which checkpoint to use when restoring parameters, as well as all the possible checkpoints available.

Press + to interact

In the example above, we obtain the checkpoint state of my_model with the tf.compat.v1.train.get_checkpoint_state function. The function returns a CheckpointState object which contains the properties model_checkpoint_path and all_checkpoint_paths. The former represents the checkpoint to use, while the latter represents the list of all checkpoints. Note that if the checkpoint file is not present in the checkpoint directory, the function returns None.

The Saver object contains the restore function, which restores the checkpoint from the path given in the second argument. The first argument is a tf.compat.v1.Session object that we use to execute the restoration.

We can use the save function to save the computation graph’s parameters. It takes in the same two required arguments as restore, and saves the parameters to the directory passed in as the second argument.

What you'll learn in this course

Data Pipeline

Model Execution

Chapter Goals:

A. Checkpoint directory

B. Saving parameters

C. Restoring parameters