Multi-Head Self-Attention

Explore how multi-head attention expands upon self-attention.

The idea of self-attention can be expanded to multi-head attention. In essence, we run through the attention mechanism several times.

Each time, we map the independent set of Key, Query, Value matrices into different lower-dimensional spaces and compute the attention there. The individual output is called a “head”. The mapping is achieved by multiplying each matrix with a separate weight matrix, which is denoted as ${W}_{i}^{K} , {W}_{i}^{Q} \in R^{d_{model} \times d_{k} }$ and ${W}_{i}^{V} \in R^{d_{model} \times d_{k}}$ , where $i$ is the head index.

To compensate for the extra complexity, the output vector size is divided by the number of heads. Specifically, in the vanilla transformer, they use $d_{model}=512$ and $h=8$ heads, which gives us vector representations of $d_k = 64$ .

With multi-head attention, the model has multiple independent paths (ways) to understand the input.

The heads are then concatenated and transformed using a square weight matrix ${W}^{O} \in R^{d_{model} \times d_{model}}$ , since $d_{model}=h d_{k}$ .

Putting it all together, we get:

MultiHead ({Q}, {K}, {V}) = { Concat (head }_{1}, \ldots, { head } \left._{{h}}\right) {W}^{O}

where $head_{{i}} = { Attention }\left({Q} {W}_{i}^{Q}, {K} {W}_{i}^{K},{V} {W}_{i}^{V}\right)$

where again:

{W}_{i}^{Q}, {W}_{i}^{K}, {W}_{i}^{V} \in {R}^{d_{\text{model}} \times d_{k}}

Since heads are independent of each other, we can perform the self-attention computation in parallel on different workers:

Get hands-on with 1400+ tech skills courses.

Learn Deep Learning

Neural Networks

Training Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

Autoencoders

Generative Adversarial Networks

Attention and Transformers

Graph Neural Networks

Conclusion

Final Quiz

Multi-Head Self-Attention