The Transformer's Decoder
Formulate the transformer's decoder and learn about masked multi-head self-attention.
We'll cover the following
The decoder consists of all the aforementioned components plus two novel ones. As before:
-
The output sequence is fed in its entirety, and word embeddings are computed.
-
Positional encoding is again applied.
-
The vectors are passed to the first decoder block.
Each decoder block includes:
-
A masked multi-head self-attention layer
-
A normalization layer followed by a residual connection
-
A new multi-head attention layer (known as encoder-decoder attention)
-
A second normalization layer and a residual connection
-
A linear layer and a third residual connection
The decoder block appears again repeated times. The final output is transformed through a final linear layer, and the output probabilities are calculated with the standard softmax function.
Get hands-on with 1400+ tech skills courses.