Calculating Loss
Calculate the loss for your LSTM model.
We'll cover the following
Chapter Goals:
- Convert your LSTM model's outputs into logits
- Use a padding mask to calculate the overall loss
A. Logits & loss
As mentioned in earlier chapters, the task for a language model is no different from regular multiclass classification. Therefore, the loss function will still be the regular softmax cross entropy loss. We use a final fully-connected layer to convert model outputs into logits for each of the possible classes (i.e. vocabulary words).
import tensorflow as tf# Output from an LSTM# Shape: (batch_size, time_steps, cell_size)lstm_outputs = tf.compat.v1.placeholder(tf.float32, shape=(None, 10, 7))vocab_size = 100#print(lstm_outputs)logits = tf.keras.layers.Dense(units=vocab_size)(lstm_outputs)# Target tokenized sequences# Shape: (batch_size, time_steps)target_sequences = tf.compat.v1.placeholder(tf.int64, shape=(None, 10))loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_sequences,logits=logits)
The function used to calculate the softmax cross entropy loss for feed-forward neural networks is tf.nn.softmax_cross_entropy_with_logits
. However, we can only use this function if the labels
and logits
arguments both have the same shape.
In our example, logits
has 3 dimensions while labels
(target_sequences
) only has 2. In this case, the labels
are referred to as sparse (i.e. they represent class indexes rather than one-hot vectors), so we use the sparse version of the loss function.
B. Padding mask
When we calculate the loss based on the model's outputs, we don't want to include the logits for every time step in each sequence. Specifically, we want to exclude the loss calculated for the padded time steps, since those values are meaningless. Therefore, we use a padding mask to zero-out the loss at padded time steps.
The padding mask will have the same shape as the labels (i.e. target batch), but it will only contain 0's and 1's. Locations containing 0 represent padded time steps while locations containing 1 represent actual input sequence tokens. We multiply the padding mask by the loss to zero-out the padded time step locations.
The code below demonstrates an example usage of a padding mask, with batch size of 1 and max sequence length of 5. Note that we cast the padding mask to tf.float32
so that it matches the type of the loss.
import tensorflow as tf# loss: Softmax loss for LSTMwith tf.compat.v1.Session() as sess:print(repr(sess.run(loss)))# Same shape as losspad_mask = tf.constant([[1., 1., 1., 1., 0.],[1., 1., 0., 0., 0.]])new_loss = loss * pad_maskwith tf.compat.v1.Session() as sess:print(repr(sess.run(new_loss)))
Time to Code!
In this chapter you'll be completing calculate_loss
function, which calculates the model loss from the LSTM outputs.
First, we'll convert the outputs of the LSTM model into logits.
Set logits
equal to tf.keras.layers.Dense
applied with lstm_outputs
and self.vocab_size
as the two arguments.
Note that logits
has shape (batch_size, self.max_length, self.vocab_size)
while output_sequences
(the batch of tokenized target sequences) has shape (batch_size, self.max_length)
. Therefore, we use a sparse softmax cross entropy to compute the loss.
Set batch_sequence_loss
equal to tf.nn.sparse_softmax_cross_entropy_with_logits
applied with output_sequences
and logits
for the labels
and logits
keyword arguments, respectively.
To zero-out the loss for the padded time steps, we'll use binary_sequences
(created in the previous chapter) as our padding mask. However, in order to use the padding mask, we need to first cast it to type tf.float32
(to match the type of the loss).
Cast binary_sequences
to type tf.float32
and then multiply it by batch_sequence_loss
. Store the output in the variable unpadded_loss
.
The overall loss calculated by the function is the sum of the losses across every time step of each sequence.
Set overall_loss
equal to tf.math.reduce_sum
applied with unpadded_loss
as the only argument. Then return overall_loss
.
import tensorflow as tfclass LanguageModel(object):# Model Initializationdef __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):self.vocab_size = vocab_sizeself.max_length = max_lengthself.num_lstm_units = num_lstm_unitsself.num_lstm_layers = num_lstm_layersself.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)def make_lstm_cell(self, dropout_keep_prob):cell = tf.keras.layers.LSTMCell(self.num_lstm_units, dropout=dropout_keep_prob)return cell# Stack multiple layers for the LSTMdef stacked_lstm_cells(self, is_training):dropout_keep_prob = 0.5 if is_training else 1.0cell_list = [self.make_lstm_cell(dropout_keep_prob) for i in range(self.num_lstm_layers)]cell = tf.keras.layers.StackedRNNCells(cell_list)return cell_list# Convert input sequences to embeddingsdef get_input_embeddings(self, input_sequences):embedding_dim = int(self.vocab_size**0.25)embedding=tf.keras.layers.Embedding(self.vocab_size+1, embedding_dim, embeddings_initializer='uniform',mask_zero=True, input_length=self.max_length)input_embeddings = embedding(input_sequences)return input_embeddingsdef run_lstm(self, input_sequences, is_training):cell = self.stacked_lstm_cells(is_training)input_embeddings = self.get_input_embeddings(input_sequences)binary_sequences = tf.math.sign(input_sequences)sequence_lengths = tf.math.reduce_sum(binary_sequences, axis=1)rnn=tf.keras.layers.RNN(cell,return_sequences=True,input_length=sequence_lengths,dtype=tf.float32)lstm_outputs = rnn(input_embeddings)return lstm_outputs, binary_sequencesdef calculate_loss(self, lstm_outputs, binary_sequences, output_sequences):#CODE HEREpass
Get hands-on with 1300+ tech skills courses.