The Transformer's Decoder

Formulate the transformer's decoder and learn about masked multi-head self-attention.

We'll cover the following

The decoder consists of all the aforementioned components plus two novel ones. As before:

The output sequence is fed in its entirety, and word embeddings are computed.
Positional encoding is again applied.
The vectors are passed to the first decoder block.

Each decoder block includes:

A masked multi-head self-attention layer
A normalization layer followed by a residual connection
A new multi-head attention layer (known as encoder-decoder attention)
A second normalization layer and a residual connection
A linear layer and a third residual connection

The decoder block appears again $N=6$ repeated times. The final output is transformed through a final linear layer, and the output probabilities are calculated with the standard softmax function.

Get hands-on with 1400+ tech skills courses.

Learn Deep Learning

Neural Networks

Training Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

Autoencoders

Generative Adversarial Networks

Attention and Transformers

Graph Neural Networks

Conclusion

Final Quiz

The Transformer's Decoder