The Transformer's Encoder
Formulate the encoder of a transformer by combining all the building blocks.
We'll cover the following
Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.
Add linear layers to form the encoder
Suppose is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:
import torch
import torch.nn as nn
dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1
norm = nn.LayerNorm(dim)
linear = nn.Sequential(
nn.Linear(dim, dim_linear_block),
nn.ReLU()
nn.Dropout(dropout),
nn.Linear(dim_linear_block, dim),
nn.Dropout(dropout)
)
out = norm(linear(x) + x)
Dropout helps avoid overfitting. It is not exactly a linear model. As we saw in the second chapter, it can be called feedforward neural network or MLP (multi-layer perceptron). The code illustrates that it is not something new.
The idea of the linear layer after multi-head self-attention is to project the representation in a higher space and then back in the original space. This helps solve some stability issues and counter bad initializations.
Finally, this is the transformer’s encoder:
Get hands-on with 1400+ tech skills courses.