Attention Mechanisms

2017年的论文《Attention is All you Need》提出了Transformer模型，摒弃以往的循环神经网络（RNN）和卷积神经网络（CNN）结构，转而完全基于自注意力机制（Self-Attention Mechanism）来构建模型。

目前，Transformer模型已经成为GPT系列等大语言模型的基石。然而。值得注意的是，Transformer并非序列建模的唯一解决方案。例如，RWKV（Receptance Weighted Key-Value）模型就创新性地将RNN结构与注意力机制相结合。

3.1 The Problem with Modeling Long Sequence

“长序列建模的问题”

由于语言间的语法结构不同，将文本一个字一个字进行翻译是并不可行的。

如将 Could you do me a favor 进行字字翻译会得到：能你做我个恩惠

在Transformer模型出现之前， encoder-decoder结构的 RNN 是机器翻译任务的主流方法。编码器将输入序列（源语言的句子）转换为固定长度的上下文向量，抓取其语义信息。

![[Pasted image 20250222191715.png]]

3.2 Capturing data dependencies with attention mechanisms

“通过注意力机制捕捉数据依赖关系”

在注意力机制中，网络中的 text-generating decoder 可以动态地选择访问所有的 input tokens，这意味着某一些 input tokens 在生成某个 output token 时比其他 tokens 拥有更多的重要性。

Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence

“Transformer 中的自注意力机制是一种通过让序列中的每个位置都能与同一序列中的其他位置交互并确定其相关性，从而增强输入表示的技术。”

自注意力机制的目标是改进输入数据的特征表达，通过让序列中的每个元素都能关注并评估其他元素的重要性，从而更好地捕捉序列内部的依赖关系。

比如，以英文单词 bank 为例，在不同的上下文中的意思是截然不同的：

“I need to go to the bank to deposit some money.”
“We had a picnic on the bank of the river.”

第一句话中 bank 的意思是银行，第二句话的 bank 是指河岸。

句子中的每个单词会被转化为词嵌入，即一个高维向量。自注意力机制则会计算每个单词与句子中所有单词的注意力分数。

就好似在 picnic 和 river 这两个单词出现时，bank 大概率是指河岸，而非别的什么东西。

3.3 Attending to different parts of the input with self-attention

通过自注意力机制关注输入的不同部分

3.3.1 A simple self-attention mechanism without trainable weights

一种没有可训练权重的简单自注意力机制

我们先从一个用于演示的自注意力机制代码实现开始。这种简单的自注意力机制并非是实际用在 Transfomer 模型中的。

假设给定输入序列 $x^1$ -> $x^T$

如 $x^1$ 是一个表达 “Your” 单词的已转换成 token embeddings 的多维度向量。

我们需要：为每个输入序列的元素 $x^\left(i\right)$ 计算上下文向量 $z^\left(i\right)$，其中 $x^\left(i\right)$ 和 $z^\left(i\right)$ 处于同一维度。

在序列模型中（RNN、LSTM、Transformer），上下文向量都被用于捕捉输入序列的信息。

上下文向量 $z^\left(i\right)$ 是输入序列 $x^1$ -> $x^T$ 的 weighted sum (加权和)

如，在计算 $z^\left(2\right)$ 时，模型会基于 $x^\left(2\right)$ 的信息计算注意力权重 $α1,α2,α3$。

$$z^\left(2\right)=α_1x_1+α_2x_2+α_3x_3 + …$$

![[Pasted image 20250222200141.png]]

(Please note that the numbers in this figure are truncated to one digit after the decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)
请注意，此图中的数值被截断为小数点后一位，以减少视觉上的杂乱；类似地，其他图中也可能包含截断后的数值。

Compute unnormalized attention scores

计算未归一化的注意力分数

假设我们使用第二输入token作为 query ，即 $q^\left(2\right)=x^\left(2\right)$

我们通过点积计算未归一化的注意力分数：

$$ \begin{align} \omega_{21} = \mathbf{x}^{(1)} \mathbf{q}^{(2)\top} \ \omega_{22} = \mathbf{x}^{(2)} \mathbf{q}^{(2)\top} \ … \ \omega_{2T} = \mathbf{x}^{(T)} \mathbf{q}^{(2)\top} \end{align} $$ $\omega$ 用于表示未归一化的注意力分数，$\omega_{21}$ 表示输入序列第二元素被用作 query ，与输入序列第一元素进行对比。

现在演示如何将 $x^\left(2\right)$ 作为 query 来计算 context vector $z^\left(2\right)$

Step1 compute the unnormalized attention scores by computing the dot product between the query $x^\left(2\right)$ and all other input tokens:

假设给定以下已被嵌入在 3 维向量之中的输入元素：

import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

(in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)

“（以上所示的张量中，每一行代表一个词，每一列代表一个嵌入维度。）”

query = inputs[1] # take 2nd input token as query
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)
print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

点积本质上是对两个向量逐元素相乘并将结果求和的一种简写形式。

res = 0.

for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx] * query[idx]

print(res)
print(torch.dot(inputs[0], query))

tensor(0.9544)
tensor(0.9544)

Step 2 normalize the unnormalized attention scores

![[Pasted image 20250223170956.png]]

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656]) Sum: tensor(1.0000)

然而，在实践中，通常推荐使用 softmax 函数进行归一化，因为它能更好地处理极端值，并且在训练过程中具有更理想的梯度特性。

attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)

Setep 3 compute the context vector $z^\left(2\right)$ by multiplying the embedded input tokens, $x^\left(i\right)$ with the attention weights and sum the resulting vectors:

通过将嵌入的输入标记 x(i) 与注意力权重相乘并对结果向量求和，计算上下文向量 z(2)

$$z^{(2)} = \sum_{i} \left( \text{attention weights} \times x^{(i)} \right)$$

![[Pasted image 20250224001521.png]]

query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])

3.3.2 Computing attention weights for all input tokens

![[Pasted image 20250224002240.png]]

3.4 Implementing self-attention with trainable weights