3.4 Implementing self-attention with trainable weights

拥有可训练权重的自注意力机制的实现。

Transformer是一种基于自注意力机制的序列到序列（Seq2Seq）模型，由Vaswani等人在2017年的论文《Attention is All You Need》中提出。它完全摒弃了传统的RNN和CNN结构，仅依赖自注意力机制和前馈神经网络来实现高效的序列建模。

现如今的主流大语言模型都基于 Transformer 架构。而谈论 Transfomer 就逃不开注意力机制。

3.4.1 Computing the attention weights step by step

在本文中，我们将会实现被用在最初的 GPT 系列的 Transformer 架构中的自注意力机制。

我们要将输入向量按照特定的输入元素的权重进行加权求和，来计算上下文向量。

而可训练权重矩阵(Trainable weight matrices)，实际上就至关重要。因为模型（特别是模型内部的注意力模块）能够通过学习生成‘优质’的上下文向量。

在开始之前，要先介绍一下三个可训练权重矩阵 $W_Q, \quad W_K, \quad W_V$

它们分别是查询矩阵（Query Matrix）， 键矩阵（Key Matrix） 和 值矩阵（Value Matrix）。

这三个矩阵用于通过矩阵乘法将输入的词嵌入投影为查询（query）、键（key）和值（value）向量：

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

对于 input token $x$ 和 query 向量 $q$ 的嵌入维度可以相同，也可以不同，这取决于模型的设计。

我们先从一些初始的张量开始：

import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

在 GPT 模型中，输入和输出的维度总是相同的。但为了演示方便，我们的输入和输出维度是不同的：

x_2 = inputs[1] # second input element
d_in = inputs.shape[1] # the input embedding size, d=3
d_out = 2 # the output embedding size, d=2

在下面，我们初始化了三个权重矩阵，注意，为了在示例中减少输出内容的杂乱，我们设置了 requires_grad=False，但如果我们要将这些权重矩阵用于模型训练，则需要将 requires_grad 设置为 True，以便在模型训练期间更新这些矩阵。

torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

下一步，来算一算向量 $q_2, k_2, v_2$ 吧：

query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])

如下所示，这样会把6个input token从三维投射到二维。

keys = inputs @ W_key 
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])

之后，我们要通过对 query 和每一个 key 进行点积和来计算未归一化的注意力分数(unnormalized attention scores)

![[Pasted image 20250224031728.png]]

keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

因为我们有6个 input tokens，所以有6个相对于给定的 $q$ 的注意力分数。

attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

![[Pasted image 20250224032137.png]]

现在，我们要计算注意力权重，使用 softmax 函数。这里额外多一个将注意力分数除以嵌入维度的平方根 $\sqrt{d_k}$。这是为了引入一个缩放因子，防止点积结果过大，从而导致梯度消失。

d_k = keys.shape[1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

![[Pasted image 20250224032504.png]]

现在计算上下文向量 $z_2$

context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

这是我们全部的代码了：

import torch.nn as nn

class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

![[Pasted image 20250224033104.png]]

我们可以使用 PyTorch 的 Linear layers 来简化上述实现，如果我们禁用 bias units，它等价于矩阵乘法。与手动使用 nn.Parameter(torch.rand(...)) 的方法相比，使用 nn.Linear 的另一个巨大优势是，nn.Linear 具有首选的权重初始化方案，这可以使模型训练更加稳定

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)