Rotary Position Embedding

Deep Learning

Author

Ritesh Kumar Maurya

Published

April 7, 2026

Why Relative Position Embedding?

It captures how two tokens are positioned with respect to each other
For Example:-
- ```
  my daughter called her brother last night  [Sentence:-1]
  0     1 (n)  2 (m)  3    4      5     6
```
- Absolute Position Embedding of token daughter depends on position of token i.e. 1 (n)
- Relative Position Embedding of token ‘daughter’ with respect to token ‘called’ -1 (n-m)
  - Where -1 or in general it will tell whether token ‘daughter’ is before or after ‘called’ and at what distance.
Query at position m attends Key at position n and to calculate attention score, it depends more on relative position than absolute position.

  last night my daughter called her brother  [Sentence:-2]
  0      1    2    3        4    5     6

In above sentence even though the APE is different compared to previous sentence their RPE is still the same
In case of APE, there is linear relationship between tokens at pos and pos + k, and we expect our model to learn this relationship
- PE_{pos + k} = T^k * PE_{pos}
In case of RPE, we are explicitly telling our model that these two tkens are at k tokens apart
Below table defines a summary of APE vs RPE

Sentence	APE	RPE
1	tokens are at 1st and 2nd Position	tokens are at (-1) tokens apart, where daughter comes before called
2	tokens are at 1st and 2nd Position	tokens are at (-1) tokens apart, where daughter comes before called

As we can gneralise from above table that the two tokens which generally comes together or at k tokens apart, RPE will be same in suh cases whereas in case of APE, it will differe for each sentence which model have to learn to generalise better
So instead of saying i am at this position or that position, it would be better to explicitly mention that we are at k tokens apart

Let’s say we have a 2D vector anw we want to rotate it by angle \theta, where it was earlier at \alpha with magnitude r
We can write initial x and y as:
- x = r*\cos(\theta)
- y = r*\sin(\theta)
If we rotate it by \alpha, then:
- x' = r*\cos(\theta + \alpha) = r(\cos(\theta)\cos(\alpha) - \sin(\theta)\sin(\alpha))
- y' = r*\sin(\theta + \alpha) = r(\sin(\theta)\cos(\alpha) + \cos(\theta)\sin(\alpha))
If we represent: R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}

z' = \begin{bmatrix} x' \\ y' \\ \end{bmatrix}

z = \begin{bmatrix} x \\ y \\ \end{bmatrix}
Then we can write z' = R(\theta)z
Below two properties which will be used later:
- R(-\theta) = R(\theta)^T which gives
- R(\theta)^T R(\theta) = I [Identity Matrix]
- R(\theta_{1}) R(\theta_{2}) = R(\theta_{1} + \theta_{2})
RoPE rotates the embeddings
Let’s say q(query) at m and k(key) at n
Then dot product between q and k would be
- (R(m\theta)q_{m})^TR(n\theta)k_{n}
- = q_{m}^TR(m\theta)R(n\theta)k_{n}
- Since R(-\theta) = R(\theta)^T, so R(-m\theta) = R(m\theta)^T
- this gives q_{m}^TR(-m\theta)R(n\theta)k_{n}
- now since R(\theta_{1}) R(\theta_{2}) = R(\theta_{1} + \theta_{2})
- this gives q_{m}^TR((n-m)\theta)k_{n}
- In above eqn attention score now explicitly depends on relative distance between q and k
- So if relative distance changes then only this factor will change

Treat a pair as single 2D vector space and rotate it independently
For example lets say we have d_model of 256 then we will have 128 2D vectors and we can rotate those pairs independently