Self-Attention with Relative Position Representations

Note: I advise you to watch this video if you do not have prior knowledge of this type of representation.

Overview

The "Self-Attention with Relative Position Representations" extends the original Transformer's self-attention mechanism to efficiently consider representations of relative positions or distances between sequence elements. This enhancement proves particularly effective in tasks like machine translation, offering improved performance compared to models relying solely on absolute position representations.

Self-Attention

Input and Output Dimensions:
- Input sequence $x=\left(x_{1}, \ldots, x_{n}\right)$ of $n$ elements with dimension $d_a$.
- Output sequence $z=\left(z_{1}, \ldots, z_{n}\right)$ with dimension $d_z$.
Computation of $z_i$: $$z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}\right)$$
Weight Coefficients $\alpha_{ij}$: $$\alpha_{i j}=\frac{\exp e_{i j}}{\sum_{k=1}^{n} \exp e_{i k}}$$
Compatibility Function $e_{ij}$: $$e_{i j}=\frac{\left(x_{i} W^{Q}\right)\left(x_{j} W^{K}\right)^{T}}{\sqrt{d_{z}}}$$

Relation-aware Self-Attention

Extension to Self-Attention:
- An extension to self-attention is proposed to consider the pairwise relationships between input elements in the sense that the input is modeled as a labeled, directed, fully-connected graph. The edge between input elements $x_i$ and $x_j$ is represented by vectors: $a_{i j}^{V}, a_{i j}^{K} \in \mathbb{R}^{d_{a}}$. So, $a_{i j}^{V}, a_{i j}^{K}$ model the interaction between positions $i$ and $j$. These representations can be shared across attention heads. We use $d_a = d_z$ .
- These representations can be shared across attention heads.
- Edges can capture information about the relative position differences between input elements.
Modification of $z_i$ with Edge Information $a_{ij}^V$: $$z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}+a_{i j}^{V}\right)$$
Modification of Compatibility Function with Edge Information $a_{ij}^K$: $$e_{i j}=\frac{x_{i} W^{Q}\left(x_{j} W^{K}+a_{i j}^{K}\right)^{T}}{\sqrt{d_{z}}}$$

Relative Position Representations

Edge Labels for Relative Positions:
- Edge representations $a_{ij}^K, a_{ij}^V$ capture relative position differences.
- The maximum relative position is clipped to a maximum absolute value of $k$. It is hypothesized that precise relative position information is not useful beyond a certain distance. Clipping the maximum distance also enables the model to generalize to sequence lengths not seen during training.
- So, $2k+1$ unique edge labels are considered.
Learnable Relative Position Representations:
- $a_{ij}^K, a_{ij}^V$ are determined using learnable relative position representations $w^K, w^V$.
- Clipping function: $$a_{ij}^K = w_{\text{clip}(j-i, k)}^K$$ $$a_{ij}^V = w_{\text{clip}(j-i, k)}^V$$ $$\text{clip}(x, k) = \max(-k, \min(k, x))$$
Learnable Vectors ( w^K, w^V ):
- $w^K = (w_{-k}^K, \ldots, w_k^K)$
- $w^V = (w_{-k}^V, \ldots, w_k^V)$

Analysis

Among the first, Shaw, Uszkoreit, and Vaswani (2018) introduced an alternative method for incorporating both absolute and relative position encodings.
The use of relative position representations allows the model to consider pairwise relationships and capture information about the relative position differences between input elements, enhancing its ability to understand the sequence structure.
Although it cannot directly be compared with the effect of simple addition of position embeddings, they roughly omit the position–position interaction and have only one unit–position term. In addition, they do not share the projection matrices but directly model the pairwise position interaction with the vectors $a$. In an ablation analysis they found that solely adding $a_{i j}^{K}$ might be sufficient.
To reduce space complexity, they share the parameters across attention heads. While it is not explicitly mentioned in their paper we understand that they add the position information in each layer but do not share the parameters. The authors find that relative position embeddings perform better in machine translation and the combination of absolute and relative embeddings does not improve the performance.

References

Ref.1 Ref.2 Ref.3 Ref.4

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
Self_Attention_with_Relative_Position_Representations.ipynb		Self_Attention_with_Relative_Position_Representations.ipynb
relation_aware_attention.py		relation_aware_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Attention with Relative Position Representations

Overview

Self-Attention

Relation-aware Self-Attention

Relative Position Representations

Analysis

References

About

Releases

Packages

Languages

License

AliHaiderAhmad001/Self-Attention-with-Relative-Position-Representations

Folders and files

Latest commit

History

Repository files navigation

Self-Attention with Relative Position Representations

Overview

Self-Attention

Relation-aware Self-Attention

Relative Position Representations

Analysis

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages