Transformers Building Blocks

Learn why we use skip connections and layer normalization inside a transformer.

Short residual skip connections

In language, there is a significant notion of a wider understanding of the world and our ability to combine ideas. Humans extensively utilize these top-down influences (our expectations) to combine words in different contexts.

In a very rough manner, skip connections give a transformer a tiny ability to allow the representations of different levels of processing to interact.

With the forming of multiple paths, we can “pass” our higher-level understanding of the last layers to the previous layers. This allows us to re-modulate how we understand the input. Again, this is the same idea as human top-down understanding, which is nothing more than expectations.

Layer normalization

Let’s open the Layer Norm black box.

In Layer Normalization (LN), the mean and variance are computed across channels and spatial dims.

In language, each word is a vector. Since we are dealing with vectors, we only have one spatial dimension.

μn=1Kk=1Kxnk\mu_{n}=\frac{1}{K} \sum_{k=1}^{K} x_{nk}

σn2=1Kk=1K(xnkμn)2\sigma_{n}^{2}=\frac{1}{K} \sum_{k=1}^{K}\left(x_{nk}-\mu_{n}\right)^{2}

x^nk=xnkμnσn2+ϵ,x^nkR\hat{x}_{nk}= \frac{x_{nk}-\mu_{n}}{\sqrt{\sigma_{n}^{2}+\epsilon}}, \hat{x}_{nk} \in R

LNγ,β(xn)=γx^n+β,xnRK,\mathrm{LN}_{\gamma, \beta}\left(x_{n}\right) =\gamma \hat{x}_{n}+\beta ,x_{n} \in R^{K} ,

where γ\gamma and β\beta are trainable parameters.

In a 4D tensor with merged spatial dimensions, we can visualize this with the following figure:

Get hands-on with 1400+ tech skills courses.